Implementation:Neuml Txtai Pipeline Constructors

Knowledge Sources	txtai txtai Documentation
Domains	Data_Processing, Workflow
Last Updated	2026-02-09 00:00 GMT

Overview

Concrete tool for creating reusable NLP processing pipeline instances provided by the txtai library. This page covers the constructors for the four primary pipeline types used in workflow chaining: Textractor, Summary, Translation, and LLM.

Description

Each pipeline constructor initializes the underlying model, tokenizer, and supporting infrastructure needed for its specific NLP task. Constructors accept configuration parameters that control model selection, hardware placement (GPU vs. CPU), quantization for reduced memory usage, and operation-specific options. Once constructed, each pipeline instance is a callable object that transforms input data through its designated operation.

Textractor -- Sets up a FileToHTML backend and an HTMLToMarkdown renderer. The sections flag controls whether output preserves document section structure. The backend parameter selects the file parsing engine (defaults to the best available).
Summary -- Delegates to the HFPipeline base class, requesting a "summarization" pipeline from Hugging Face Transformers. Optionally loads a custom model path.
Translation -- Loads the M2M100 multilingual model by default (or a custom model), initializes language detection, and prepares a model cache for source-target translation pairs.
LLM -- Uses a GenerationFactory to instantiate the appropriate backend (Transformers, llama.cpp, LiteLLM, or custom) based on the provided path and method parameters.

Usage

Use these constructors when you need to instantiate a pipeline for text extraction, summarization, translation, or language generation. Each constructed instance can be called directly or passed as an action to a Task for inclusion in a Workflow.

Code Reference

Source Location

Repository: txtai
File (Textractor): src/python/txtai/pipeline/data/textractor.py (lines 23-49)
File (Summary): src/python/txtai/pipeline/text/summary.py (lines 15-16)
File (Translation): src/python/txtai/pipeline/text/translation.py (lines 28-53)
File (LLM): src/python/txtai/pipeline/llm/llm.py (lines 25-39)

Signature

Textractor:

class Textractor(Segmentation):
    def __init__(
        self,
        sentences=False,
        lines=False,
        paragraphs=False,
        minlength=None,
        join=False,
        sections=False,
        cleantext=True,
        chunker=None,
        headers=None,
        backend="available",
        **kwargs
    ):

Summary:

class Summary(HFPipeline):
    def __init__(self, path=None, quantize=False, gpu=True, model=None, **kwargs):

Translation:

class Translation(HFModel):
    def __init__(self, path=None, quantize=False, gpu=True, batch=64,
                 langdetect=None, findmodels=True):

LLM:

class LLM(Pipeline):
    def __init__(self, path=None, method=None, **kwargs):

Import

from txtai.pipeline import Textractor, Summary, Translation
from txtai.pipeline import LLM

I/O Contract

Inputs

Textractor Constructor:

Name	Type	Required	Description
sentences	bool	No	Segment output by sentences
lines	bool	No	Segment output by lines
paragraphs	bool	No	Segment output by paragraphs
minlength	int	No	Minimum segment length to include
join	bool	No	Join segmented text into a single string
sections	bool	No	Preserve document section structure in output
cleantext	bool	No	Apply text cleaning rules (default True)
chunker	callable	No	Custom chunking function
headers	dict	No	HTTP headers for remote URL retrieval
backend	str	No	File parsing backend, defaults to "available" (best available)

Summary Constructor:

Name	Type	Required	Description
path	str	No	Hugging Face model hub id or local model path
quantize	bool	No	Enable model quantization (default False)
gpu	bool	No	Enable GPU acceleration (default True)
model	object	No	Pre-loaded model instance

Translation Constructor:

Name	Type	Required	Description
path	str	No	Model path, defaults to "facebook/m2m100_418M"
quantize	bool	No	Enable model quantization (default False)
gpu	bool	No	Enable GPU acceleration (default True)
batch	int	No	Batch size for incremental processing (default 64)
langdetect	callable or str	No	Custom language detection function or model path
findmodels	bool	No	Search Hugging Face Hub for source-target models (default True)

LLM Constructor:

Name	Type	Required	Description
path	str	No	Model path, defaults to "ibm-granite/granite-4.0-350m"
method	str	No	LLM framework (inferred from path if not provided)
**kwargs	dict	No	Additional model keyword arguments passed to GenerationFactory

Outputs

Name	Type	Description
pipeline instance	Textractor / Summary / Translation / LLM	A callable object. When called with text input, returns transformed text output.

Usage Examples

Basic Example

from txtai.pipeline import Textractor, Summary, Translation, LLM

# Create a text extraction pipeline with section awareness
textractor = Textractor(sections=True)
content = textractor("path/to/document.pdf")

# Create a summarization pipeline
summary = Summary()
result = summary("Long article text goes here...", maxlength=150)

# Create a translation pipeline
translate = Translation()
translated = translate("Hola, como estas?", target="en")

# Create an LLM generation pipeline
llm = LLM("microsoft/Phi-3-mini-4k-instruct")
response = llm("Explain pipeline chaining in txtai.")

Pipelines as Workflow Actions

from txtai.pipeline import Textractor, Summary
from txtai.workflow import Task, Workflow

# Create pipelines
textractor = Textractor(sections=True)
summary = Summary()

# Wrap pipelines in tasks
extract_task = Task(action=textractor)
summarize_task = Task(action=summary)

# Compose into a workflow
workflow = Workflow([extract_task, summarize_task])

# Execute: extract text from files then summarize
results = list(workflow(["report.pdf", "article.html"]))

Related Pages

Implements Principle

Principle:Neuml_Txtai_Pipeline_Definition

Requires Environment

Uses Heuristic

Heuristic:Neuml_Txtai_LLM_Context_Window_Fallback

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment