Implementation:Neuml Txtai Textractor Call

Knowledge Sources	txtai txtai Documentation
Domains	NLP, RAG
Last Updated	2026-02-10 00:00 GMT

Overview

Concrete tool for extracting and segmenting text from files, URLs, and raw input into indexable chunks provided by the txtai library.

Description

The Textractor class extends Segmentation to provide a unified pipeline for converting heterogeneous document sources into clean, structured text. It accepts local file paths, remote URLs, and raw HTML/text strings as input. The extraction process converts content to HTML via a configurable backend (e.g., Apache Tika), then transforms the HTML into Markdown. The inherited segmentation logic then splits the result into sentences, lines, paragraphs, sections, or custom chunks based on the constructor parameters.

The __init__ method configures the extraction backend, HTML-to-Markdown converter, and HTTP headers for remote fetches. The __call__ method (inherited from Segmentation) orchestrates the full pipeline: for each input, it calls the text method (overridden by Textractor) to perform extraction, then applies parsing, cleaning, and optional filtering.

Usage

Import and use Textractor when building the document ingestion stage of a RAG pipeline. It serves as the first step before embedding and indexing, converting raw documents into text segments suitable for an embeddings index.

Code Reference

Source Location

Repository: txtai
File: src/python/txtai/pipeline/data/textractor.py
Lines: 17-139

Signature

class Textractor(Segmentation):

    def __init__(
        self,
        sentences=False,
        lines=False,
        paragraphs=False,
        minlength=None,
        join=False,
        sections=False,
        cleantext=True,
        chunker=None,
        headers=None,
        backend="available",
        **kwargs
    ):

    # Inherited from Segmentation
    def __call__(self, text):

Import

from txtai.pipeline import Textractor

I/O Contract

Inputs

init Parameters

Name	Type	Required	Description
sentences	bool	No	Tokenize text into sentences if True, defaults to False
lines	bool	No	Tokenize text into lines if True, defaults to False
paragraphs	bool	No	Tokenize text into paragraphs if True, defaults to False
minlength	int	No	Require at least minlength characters per text element, defaults to None
join	bool	No	Join tokenized sections back together if True, defaults to False
sections	bool	No	Tokenize text into sections (splits on section or page breaks) if True, defaults to False
cleantext	bool	No	Apply text cleaning rules, defaults to True
chunker	str	No	Name of a third-party chunker to tokenize text, defaults to None
headers	dict	No	HTTP headers for remote URL requests, defaults to empty dict
backend	str	No	File-to-HTML conversion backend, defaults to "available"
**kwargs	dict	No	Additional keyword arguments passed to Segmentation and chunker

call Parameters

Name	Type	Required	Description
text	str or list	Yes	A file path, URL, raw text/HTML string, or a list of these inputs

Outputs

Name	Type	Description
result	str, list, or list of lists	Segmented text. Returns a string if no tokenization is enabled and input is a string. Returns a list of strings if tokenization is enabled and input is a string. Returns a list of results if input is a list.

Usage Examples

Basic Example

from txtai.pipeline import Textractor

# Extract text from a file, split into paragraphs
textractor = Textractor(paragraphs=True)
paragraphs = textractor("path/to/document.pdf")

for paragraph in paragraphs:
    print(paragraph)

Sentence-Level Extraction from URL

from txtai.pipeline import Textractor

# Extract and split into sentences with minimum length filtering
textractor = Textractor(sentences=True, minlength=20)
sentences = textractor("https://example.com/article.html")

for sentence in sentences:
    print(sentence)

Batch Processing with Join

from txtai.pipeline import Textractor

# Extract from multiple files, join segments into single strings
textractor = Textractor(paragraphs=True, join=True)
texts = textractor(["doc1.pdf", "doc2.html", "doc3.txt"])

# Each element in texts is a single joined string
for text in texts:
    print(text[:200])

RAG Pipeline Integration

from txtai.pipeline import Textractor
from txtai import Embeddings

# Step 1: Extract and segment documents
textractor = Textractor(paragraphs=True, minlength=50)
chunks = textractor("corpus.pdf")

# Step 2: Index chunks for RAG
embeddings = Embeddings({"content": True})
embeddings.index([(uid, text, None) for uid, text in enumerate(chunks)])

Related Pages

Implements Principle

Principle:Neuml_Txtai_Text_Extraction

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment