Implementation:Neuml Txtai RetrieveTask
| Knowledge Sources | |
|---|---|
| Domains | Workflow, File Retrieval, ETL |
| Last Updated | 2026-02-10 01:00 GMT |
Overview
Concrete tool for downloading remote and local URLs to a local directory within workflows provided by txtai.
Description
The RetrieveTask class extends UrlTask to download files from URLs (local or remote) to a local directory. It only accepts elements that match a URL pattern (via the parent UrlTask). For each URL, it extracts the file path, optionally flattens the directory structure, and retrieves the file using Python's urllib.request.urlretrieve. When no output directory is specified, a temporary directory is automatically created and managed for the task's lifetime. The task replaces each input URL with the local file path of the downloaded file.
Usage
Use RetrieveTask in txtai workflows when you need to download files from URLs before processing them in subsequent workflow steps. This is common in ETL pipelines where source data resides on remote servers or cloud storage accessible via HTTP URLs. Configure it with a target directory and optionally disable directory flattening to preserve the source directory structure.
Code Reference
Source Location
- Repository: Neuml_Txtai
- File: src/python/txtai/workflow/task/retrieve.py
Signature
class RetrieveTask(UrlTask):
def register(self, directory=None, flatten=True)
def prepare(self, element)
Import
from txtai.workflow.task.retrieve import RetrieveTask
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| elements | list | Yes | List of URL strings (local file:// or remote http(s)://) to retrieve |
| directory | str | No (register) | Local directory to store retrieved files; defaults to an auto-created temporary directory |
| flatten | bool | No (register) | If True (default), flattens the directory structure using only the basename; if False, preserves the URL path structure |
Outputs
| Name | Type | Description |
|---|---|---|
| local paths | list[str] | List of local file paths where the retrieved files were stored, replacing the original URL elements |
Usage Examples
from txtai.workflow.task.retrieve import RetrieveTask
# Create a retrieve task that downloads files to a specific directory
retrieve = RetrieveTask(
directory="/tmp/downloads",
flatten=True
)
# Download files from URLs
local_paths = retrieve([
"https://example.com/data/file1.pdf",
"https://example.com/data/file2.txt"
])
# Returns ["/tmp/downloads/file1.pdf", "/tmp/downloads/file2.txt"]
# Preserve directory structure
retrieve_nested = RetrieveTask(
directory="/tmp/downloads",
flatten=False
)
local_paths = retrieve_nested([
"https://example.com/data/subdir/file1.pdf"
])
# Returns ["/tmp/downloads/data/subdir/file1.pdf"]