Workflow:Neuml Txtai Workflow Orchestration
| Knowledge Sources | |
|---|---|
| Domains | Workflow_Orchestration, Pipeline_Composition, NLP |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
End-to-end process for composing and executing deterministic multi-step data processing workflows using txtai's pipeline and task framework.
Description
This workflow covers the creation of deterministic processing pipelines that chain multiple AI and data transformation tasks together. Unlike agents which make dynamic tool choices, workflows follow a fixed sequence of steps, making them predictable and reproducible. Each step is a Task wrapping a pipeline (text extraction, summarization, translation, LLM generation, etc.) or a custom function. Workflows process data in batches, support concurrent execution of task actions, and can be scheduled using cron expressions.
Usage
Execute this workflow when you need a predictable, repeatable data processing pipeline that chains multiple operations. Workflows are ideal for ETL-style processes such as extracting text from URLs, summarizing content, translating output, and exporting results. Use workflows instead of agents when the processing steps are known in advance and do not require dynamic reasoning.
Execution Steps
Step 1: Define Pipelines
Instantiate the individual pipeline components that will form the workflow steps. txtai provides pipelines for text extraction (Textractor), summarization (Summary), translation (Translation), LLM generation (LLM), and many more. Each pipeline is a self-contained unit that processes input data and returns output.
Available pipeline categories:
- Data: Textractor, Segmentation, Tabular, Tokenizer, FileToHTML, HTMLToMarkdown
- Text: Summary, Translation, Entity, Labels, Similarity, Reranker
- LLM: LLM (prompt-based generation), RAG (retrieval-augmented generation)
- Audio: Transcription, TextToSpeech, TextToAudio
- Image: Caption, Objects, ImageHash
Step 2: Create Tasks
Wrap each pipeline in a Task object. Tasks control how data flows through the workflow, including filtering (only processing certain element types), merging results from multiple actions, and error handling. Specialized task types exist for common patterns.
Task types:
- Task: base task, wraps any callable
- UrlTask: filters for URL inputs before processing
- FileTask: filters for local file paths
- ImageTask: filters for image files
- RetrieveTask: downloads URLs to local temporary files
- ServiceTask: makes HTTP requests to external services
- StorageTask: lists files from cloud storage providers
- TemplateTask: applies text templates for prompt generation
- ExportTask: writes output to files
- ConsoleTask: prints output for debugging
Step 3: Compose the Workflow
Create a Workflow instance by providing the ordered list of tasks. Configure batch size for processing efficiency and optionally set the number of concurrent workers for parallel task execution within each step.
Configuration options:
- tasks: ordered list of Task objects defining the pipeline
- batch: number of elements to process per batch (default 100)
- workers: number of concurrent workers for parallel action execution
- name: optional workflow name for logging and scheduling
Step 4: Execute the Workflow
Run the workflow by calling it with an iterable of input elements. The workflow processes elements in batches, passing each batch sequentially through all tasks. The result is a generator that yields transformed elements.
What happens:
- Input elements are split into batches
- Each batch flows through every task in sequence
- Tasks transform, filter, or enrich elements
- Results are yielded as a generator for memory-efficient processing
Step 5: Schedule Recurring Execution
Optionally schedule the workflow for recurring execution using a cron expression. The scheduler runs the workflow at specified intervals, passing the same input elements each time. This enables automated, periodic data processing.
Scheduling features:
- Standard cron expression syntax
- Optional iteration limit
- Error logging without stopping the schedule
- Runs using local timezone