Implementation:Neuml Txtai Pipeline Constructors
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Workflow |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Concrete tool for creating reusable NLP processing pipeline instances provided by the txtai library. This page covers the constructors for the four primary pipeline types used in workflow chaining: Textractor, Summary, Translation, and LLM.
Description
Each pipeline constructor initializes the underlying model, tokenizer, and supporting infrastructure needed for its specific NLP task. Constructors accept configuration parameters that control model selection, hardware placement (GPU vs. CPU), quantization for reduced memory usage, and operation-specific options. Once constructed, each pipeline instance is a callable object that transforms input data through its designated operation.
- Textractor -- Sets up a FileToHTML backend and an HTMLToMarkdown renderer. The
sectionsflag controls whether output preserves document section structure. Thebackendparameter selects the file parsing engine (defaults to the best available). - Summary -- Delegates to the HFPipeline base class, requesting a
"summarization"pipeline from Hugging Face Transformers. Optionally loads a custom model path. - Translation -- Loads the M2M100 multilingual model by default (or a custom model), initializes language detection, and prepares a model cache for source-target translation pairs.
- LLM -- Uses a GenerationFactory to instantiate the appropriate backend (Transformers, llama.cpp, LiteLLM, or custom) based on the provided path and method parameters.
Usage
Use these constructors when you need to instantiate a pipeline for text extraction, summarization, translation, or language generation. Each constructed instance can be called directly or passed as an action to a Task for inclusion in a Workflow.
Code Reference
Source Location
- Repository: txtai
- File (Textractor):
src/python/txtai/pipeline/data/textractor.py(lines 23-49) - File (Summary):
src/python/txtai/pipeline/text/summary.py(lines 15-16) - File (Translation):
src/python/txtai/pipeline/text/translation.py(lines 28-53) - File (LLM):
src/python/txtai/pipeline/llm/llm.py(lines 25-39)
Signature
Textractor:
class Textractor(Segmentation):
def __init__(
self,
sentences=False,
lines=False,
paragraphs=False,
minlength=None,
join=False,
sections=False,
cleantext=True,
chunker=None,
headers=None,
backend="available",
**kwargs
):
Summary:
class Summary(HFPipeline):
def __init__(self, path=None, quantize=False, gpu=True, model=None, **kwargs):
Translation:
class Translation(HFModel):
def __init__(self, path=None, quantize=False, gpu=True, batch=64,
langdetect=None, findmodels=True):
LLM:
class LLM(Pipeline):
def __init__(self, path=None, method=None, **kwargs):
Import
from txtai.pipeline import Textractor, Summary, Translation
from txtai.pipeline import LLM
I/O Contract
Inputs
Textractor Constructor:
| Name | Type | Required | Description |
|---|---|---|---|
| sentences | bool | No | Segment output by sentences |
| lines | bool | No | Segment output by lines |
| paragraphs | bool | No | Segment output by paragraphs |
| minlength | int | No | Minimum segment length to include |
| join | bool | No | Join segmented text into a single string |
| sections | bool | No | Preserve document section structure in output |
| cleantext | bool | No | Apply text cleaning rules (default True) |
| chunker | callable | No | Custom chunking function |
| headers | dict | No | HTTP headers for remote URL retrieval |
| backend | str | No | File parsing backend, defaults to "available" (best available) |
Summary Constructor:
| Name | Type | Required | Description |
|---|---|---|---|
| path | str | No | Hugging Face model hub id or local model path |
| quantize | bool | No | Enable model quantization (default False) |
| gpu | bool | No | Enable GPU acceleration (default True) |
| model | object | No | Pre-loaded model instance |
Translation Constructor:
| Name | Type | Required | Description |
|---|---|---|---|
| path | str | No | Model path, defaults to "facebook/m2m100_418M" |
| quantize | bool | No | Enable model quantization (default False) |
| gpu | bool | No | Enable GPU acceleration (default True) |
| batch | int | No | Batch size for incremental processing (default 64) |
| langdetect | callable or str | No | Custom language detection function or model path |
| findmodels | bool | No | Search Hugging Face Hub for source-target models (default True) |
LLM Constructor:
| Name | Type | Required | Description |
|---|---|---|---|
| path | str | No | Model path, defaults to "ibm-granite/granite-4.0-350m" |
| method | str | No | LLM framework (inferred from path if not provided) |
| **kwargs | dict | No | Additional model keyword arguments passed to GenerationFactory |
Outputs
| Name | Type | Description |
|---|---|---|
| pipeline instance | Textractor / Summary / Translation / LLM | A callable object. When called with text input, returns transformed text output. |
Usage Examples
Basic Example
from txtai.pipeline import Textractor, Summary, Translation, LLM
# Create a text extraction pipeline with section awareness
textractor = Textractor(sections=True)
content = textractor("path/to/document.pdf")
# Create a summarization pipeline
summary = Summary()
result = summary("Long article text goes here...", maxlength=150)
# Create a translation pipeline
translate = Translation()
translated = translate("Hola, como estas?", target="en")
# Create an LLM generation pipeline
llm = LLM("microsoft/Phi-3-mini-4k-instruct")
response = llm("Explain pipeline chaining in txtai.")
Pipelines as Workflow Actions
from txtai.pipeline import Textractor, Summary
from txtai.workflow import Task, Workflow
# Create pipelines
textractor = Textractor(sections=True)
summary = Summary()
# Wrap pipelines in tasks
extract_task = Task(action=textractor)
summarize_task = Task(action=summary)
# Compose into a workflow
workflow = Workflow([extract_task, summarize_task])
# Execute: extract text from files then summarize
results = list(workflow(["report.pdf", "article.html"]))