Implementation:Neuml Txtai Textractor

Knowledge Sources	txtai txtai Documentation
Domains	NLP, RAG
Last Updated	2026-02-09 00:00 GMT

Overview

Concrete tool for extracting text from diverse file formats and segmenting it into chunks, provided by the txtai library.

Description

The Textractor class is a txtai pipeline that converts documents in various formats (PDF, DOCX, HTML, URLs, raw text) into clean, segmented Markdown text. It extends the Segmentation base class and adds file-format handling through a three-stage internal pipeline: FileToHTML converts binary documents to HTML, HTMLToMarkdown normalizes HTML to Markdown, and the inherited Segmentation logic splits the Markdown into chunks.

Textractor auto-detects whether each input is a local file path, an HTTP/HTTPS URL, or raw text/HTML. For URLs, it downloads content to a temporary file before processing. For local files, it uses the configured backend (Apache Tika or the best available parser) to convert to HTML. For raw text or HTML strings, it passes the content directly to the Markdown converter.

The chunking behavior is controlled by mutually exclusive boolean flags: sentences, lines, paragraphs, and sections. When none are set, the full extracted text is returned as a single string. The minlength parameter filters out segments shorter than a specified character count. The chunker parameter allows integration with third-party chunking libraries such as Chonkie.

Usage

Use Textractor when you need to:

Extract readable text from PDFs, DOCX, XLSX, PPTX, or other binary document formats.
Scrape and clean content from web URLs.
Segment extracted text into sentences, lines, paragraphs, or sections for indexing.
Prepare text chunks as input for an embeddings index in a RAG pipeline.

Code Reference

Source Location

Repository: txtai
File: src/python/txtai/pipeline/data/textractor.py
Lines: L23-79

Signature

class Textractor(Segmentation):
    def __init__(
        self,
        sentences=False,
        lines=False,
        paragraphs=False,
        minlength=None,
        join=False,
        sections=False,
        cleantext=True,
        chunker=None,
        headers=None,
        backend="available",
        **kwargs
    ):
        ...

    def __call__(self, text):
        ...

Import

from txtai.pipeline import Textractor

I/O Contract

Inputs

Name	Type	Required	Description
text	`str` or `list[str]`	Yes	File path(s), URL(s), or raw text/HTML string(s) to extract from
sentences	`bool`	No	Tokenize output into sentences using NLTK. Default: `False`
lines	`bool`	No	Split output on line breaks. Default: `False`
paragraphs	`bool`	No	Split output on double line breaks (paragraph boundaries). Default: `False`
sections	`bool`	No	Split output on section/page breaks. Default: `False`
minlength	`int` or `None`	No	Minimum character length for segments; shorter segments are discarded. Default: `None`
join	`bool`	No	Rejoin tokenized sections into a single string. Default: `False`
cleantext	`bool`	No	Apply text cleaning rules to extracted text. Default: `True`
chunker	`str` or `None`	No	Name of a third-party chunker (e.g., Chonkie) for custom tokenization. Default: `None`
headers	`dict` or `None`	No	HTTP headers to use when downloading remote URLs. Default: `None`
backend	`str`	No	Parser backend: `"tika"` for Apache Tika, `"available"` to auto-detect. Default: `"available"`

Outputs

Name	Type	Description
result	`str` or `list[str]`	Extracted Markdown text. Returns a single string if no chunking mode is set; returns a `list[str]` of segments if a chunking mode (sentences, lines, paragraphs, sections, or chunker) is enabled.

Usage Examples

Basic Example: Extract Full Text from a PDF

from txtai.pipeline import Textractor

# Create a Textractor with no chunking
textractor = Textractor()

# Extract text from a PDF file
text = textractor("/data/documents/report.pdf")
print(text)  # Full document text as Markdown string

Paragraph Chunking with Minimum Length

from txtai.pipeline import Textractor

# Create a Textractor that splits into paragraphs
textractor = Textractor(paragraphs=True, minlength=100)

# Extract and chunk a document
chunks = textractor("/data/documents/whitepaper.pdf")
# chunks is a list of paragraph strings, each at least 100 chars
for i, chunk in enumerate(chunks):
    print(f"Chunk {i}: {chunk[:80]}...")

Extracting from URLs

from txtai.pipeline import Textractor

# Create a Textractor for sentence-level extraction
textractor = Textractor(sentences=True)

# Extract sentences from a web page
sentences = textractor("https://example.com/article.html")
print(f"Extracted {len(sentences)} sentences")

Section-Based Chunking for RAG

from txtai.pipeline import Textractor

# Section-level chunking ideal for RAG pipelines
textractor = Textractor(sections=True, minlength=50)

# Process multiple documents
documents = [
    "/data/docs/manual.pdf",
    "/data/docs/guide.docx",
    "https://example.com/faq.html",
]

all_chunks = []
for doc in documents:
    chunks = textractor(doc)
    if isinstance(chunks, list):
        all_chunks.extend(chunks)
    else:
        all_chunks.append(chunks)

print(f"Total chunks: {len(all_chunks)}")

Using Custom HTTP Headers

from txtai.pipeline import Textractor

# Set custom headers for authenticated URL access
textractor = Textractor(
    paragraphs=True,
    headers={"Authorization": "Bearer YOUR_TOKEN"}
)

text = textractor("https://api.example.com/protected/document.html")

Related Pages

Implements Principle

Principle:Neuml_Txtai_Text_Extraction

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment