Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Neuml Txtai RetrieveTask

From Leeroopedia


Knowledge Sources
Domains Workflow, File Retrieval, ETL
Last Updated 2026-02-10 01:00 GMT

Overview

Concrete tool for downloading remote and local URLs to a local directory within workflows provided by txtai.

Description

The RetrieveTask class extends UrlTask to download files from URLs (local or remote) to a local directory. It only accepts elements that match a URL pattern (via the parent UrlTask). For each URL, it extracts the file path, optionally flattens the directory structure, and retrieves the file using Python's urllib.request.urlretrieve. When no output directory is specified, a temporary directory is automatically created and managed for the task's lifetime. The task replaces each input URL with the local file path of the downloaded file.

Usage

Use RetrieveTask in txtai workflows when you need to download files from URLs before processing them in subsequent workflow steps. This is common in ETL pipelines where source data resides on remote servers or cloud storage accessible via HTTP URLs. Configure it with a target directory and optionally disable directory flattening to preserve the source directory structure.

Code Reference

Source Location

  • Repository: Neuml_Txtai
  • File: src/python/txtai/workflow/task/retrieve.py

Signature

class RetrieveTask(UrlTask):
    def register(self, directory=None, flatten=True)
    def prepare(self, element)

Import

from txtai.workflow.task.retrieve import RetrieveTask

I/O Contract

Inputs

Name Type Required Description
elements list Yes List of URL strings (local file:// or remote http(s)://) to retrieve
directory str No (register) Local directory to store retrieved files; defaults to an auto-created temporary directory
flatten bool No (register) If True (default), flattens the directory structure using only the basename; if False, preserves the URL path structure

Outputs

Name Type Description
local paths list[str] List of local file paths where the retrieved files were stored, replacing the original URL elements

Usage Examples

from txtai.workflow.task.retrieve import RetrieveTask

# Create a retrieve task that downloads files to a specific directory
retrieve = RetrieveTask(
    directory="/tmp/downloads",
    flatten=True
)

# Download files from URLs
local_paths = retrieve([
    "https://example.com/data/file1.pdf",
    "https://example.com/data/file2.txt"
])
# Returns ["/tmp/downloads/file1.pdf", "/tmp/downloads/file2.txt"]

# Preserve directory structure
retrieve_nested = RetrieveTask(
    directory="/tmp/downloads",
    flatten=False
)

local_paths = retrieve_nested([
    "https://example.com/data/subdir/file1.pdf"
])
# Returns ["/tmp/downloads/data/subdir/file1.pdf"]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment