Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Neuml Txtai Pipeline Constructors

From Leeroopedia


Knowledge Sources
Domains Data_Processing, Workflow
Last Updated 2026-02-09 00:00 GMT

Overview

Concrete tool for creating reusable NLP processing pipeline instances provided by the txtai library. This page covers the constructors for the four primary pipeline types used in workflow chaining: Textractor, Summary, Translation, and LLM.

Description

Each pipeline constructor initializes the underlying model, tokenizer, and supporting infrastructure needed for its specific NLP task. Constructors accept configuration parameters that control model selection, hardware placement (GPU vs. CPU), quantization for reduced memory usage, and operation-specific options. Once constructed, each pipeline instance is a callable object that transforms input data through its designated operation.

  • Textractor -- Sets up a FileToHTML backend and an HTMLToMarkdown renderer. The sections flag controls whether output preserves document section structure. The backend parameter selects the file parsing engine (defaults to the best available).
  • Summary -- Delegates to the HFPipeline base class, requesting a "summarization" pipeline from Hugging Face Transformers. Optionally loads a custom model path.
  • Translation -- Loads the M2M100 multilingual model by default (or a custom model), initializes language detection, and prepares a model cache for source-target translation pairs.
  • LLM -- Uses a GenerationFactory to instantiate the appropriate backend (Transformers, llama.cpp, LiteLLM, or custom) based on the provided path and method parameters.

Usage

Use these constructors when you need to instantiate a pipeline for text extraction, summarization, translation, or language generation. Each constructed instance can be called directly or passed as an action to a Task for inclusion in a Workflow.

Code Reference

Source Location

  • Repository: txtai
  • File (Textractor): src/python/txtai/pipeline/data/textractor.py (lines 23-49)
  • File (Summary): src/python/txtai/pipeline/text/summary.py (lines 15-16)
  • File (Translation): src/python/txtai/pipeline/text/translation.py (lines 28-53)
  • File (LLM): src/python/txtai/pipeline/llm/llm.py (lines 25-39)

Signature

Textractor:

class Textractor(Segmentation):
    def __init__(
        self,
        sentences=False,
        lines=False,
        paragraphs=False,
        minlength=None,
        join=False,
        sections=False,
        cleantext=True,
        chunker=None,
        headers=None,
        backend="available",
        **kwargs
    ):

Summary:

class Summary(HFPipeline):
    def __init__(self, path=None, quantize=False, gpu=True, model=None, **kwargs):

Translation:

class Translation(HFModel):
    def __init__(self, path=None, quantize=False, gpu=True, batch=64,
                 langdetect=None, findmodels=True):

LLM:

class LLM(Pipeline):
    def __init__(self, path=None, method=None, **kwargs):

Import

from txtai.pipeline import Textractor, Summary, Translation
from txtai.pipeline import LLM

I/O Contract

Inputs

Textractor Constructor:

Name Type Required Description
sentences bool No Segment output by sentences
lines bool No Segment output by lines
paragraphs bool No Segment output by paragraphs
minlength int No Minimum segment length to include
join bool No Join segmented text into a single string
sections bool No Preserve document section structure in output
cleantext bool No Apply text cleaning rules (default True)
chunker callable No Custom chunking function
headers dict No HTTP headers for remote URL retrieval
backend str No File parsing backend, defaults to "available" (best available)

Summary Constructor:

Name Type Required Description
path str No Hugging Face model hub id or local model path
quantize bool No Enable model quantization (default False)
gpu bool No Enable GPU acceleration (default True)
model object No Pre-loaded model instance

Translation Constructor:

Name Type Required Description
path str No Model path, defaults to "facebook/m2m100_418M"
quantize bool No Enable model quantization (default False)
gpu bool No Enable GPU acceleration (default True)
batch int No Batch size for incremental processing (default 64)
langdetect callable or str No Custom language detection function or model path
findmodels bool No Search Hugging Face Hub for source-target models (default True)

LLM Constructor:

Name Type Required Description
path str No Model path, defaults to "ibm-granite/granite-4.0-350m"
method str No LLM framework (inferred from path if not provided)
**kwargs dict No Additional model keyword arguments passed to GenerationFactory

Outputs

Name Type Description
pipeline instance Textractor / Summary / Translation / LLM A callable object. When called with text input, returns transformed text output.

Usage Examples

Basic Example

from txtai.pipeline import Textractor, Summary, Translation, LLM

# Create a text extraction pipeline with section awareness
textractor = Textractor(sections=True)
content = textractor("path/to/document.pdf")

# Create a summarization pipeline
summary = Summary()
result = summary("Long article text goes here...", maxlength=150)

# Create a translation pipeline
translate = Translation()
translated = translate("Hola, como estas?", target="en")

# Create an LLM generation pipeline
llm = LLM("microsoft/Phi-3-mini-4k-instruct")
response = llm("Explain pipeline chaining in txtai.")

Pipelines as Workflow Actions

from txtai.pipeline import Textractor, Summary
from txtai.workflow import Task, Workflow

# Create pipelines
textractor = Textractor(sections=True)
summary = Summary()

# Wrap pipelines in tasks
extract_task = Task(action=textractor)
summarize_task = Task(action=summary)

# Compose into a workflow
workflow = Workflow([extract_task, summarize_task])

# Execute: extract text from files then summarize
results = list(workflow(["report.pdf", "article.html"]))

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment