Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Neuml Txtai Textractor

From Leeroopedia


Knowledge Sources
Domains NLP, RAG
Last Updated 2026-02-09 00:00 GMT

Overview

Concrete tool for extracting text from diverse file formats and segmenting it into chunks, provided by the txtai library.

Description

The Textractor class is a txtai pipeline that converts documents in various formats (PDF, DOCX, HTML, URLs, raw text) into clean, segmented Markdown text. It extends the Segmentation base class and adds file-format handling through a three-stage internal pipeline: FileToHTML converts binary documents to HTML, HTMLToMarkdown normalizes HTML to Markdown, and the inherited Segmentation logic splits the Markdown into chunks.

Textractor auto-detects whether each input is a local file path, an HTTP/HTTPS URL, or raw text/HTML. For URLs, it downloads content to a temporary file before processing. For local files, it uses the configured backend (Apache Tika or the best available parser) to convert to HTML. For raw text or HTML strings, it passes the content directly to the Markdown converter.

The chunking behavior is controlled by mutually exclusive boolean flags: sentences, lines, paragraphs, and sections. When none are set, the full extracted text is returned as a single string. The minlength parameter filters out segments shorter than a specified character count. The chunker parameter allows integration with third-party chunking libraries such as Chonkie.

Usage

Use Textractor when you need to:

  • Extract readable text from PDFs, DOCX, XLSX, PPTX, or other binary document formats.
  • Scrape and clean content from web URLs.
  • Segment extracted text into sentences, lines, paragraphs, or sections for indexing.
  • Prepare text chunks as input for an embeddings index in a RAG pipeline.

Code Reference

Source Location

  • Repository: txtai
  • File: src/python/txtai/pipeline/data/textractor.py
  • Lines: L23-79

Signature

class Textractor(Segmentation):
    def __init__(
        self,
        sentences=False,
        lines=False,
        paragraphs=False,
        minlength=None,
        join=False,
        sections=False,
        cleantext=True,
        chunker=None,
        headers=None,
        backend="available",
        **kwargs
    ):
        ...

    def __call__(self, text):
        ...

Import

from txtai.pipeline import Textractor

I/O Contract

Inputs

Name Type Required Description
text str or list[str] Yes File path(s), URL(s), or raw text/HTML string(s) to extract from
sentences bool No Tokenize output into sentences using NLTK. Default: False
lines bool No Split output on line breaks. Default: False
paragraphs bool No Split output on double line breaks (paragraph boundaries). Default: False
sections bool No Split output on section/page breaks. Default: False
minlength int or None No Minimum character length for segments; shorter segments are discarded. Default: None
join bool No Rejoin tokenized sections into a single string. Default: False
cleantext bool No Apply text cleaning rules to extracted text. Default: True
chunker str or None No Name of a third-party chunker (e.g., Chonkie) for custom tokenization. Default: None
headers dict or None No HTTP headers to use when downloading remote URLs. Default: None
backend str No Parser backend: "tika" for Apache Tika, "available" to auto-detect. Default: "available"

Outputs

Name Type Description
result str or list[str] Extracted Markdown text. Returns a single string if no chunking mode is set; returns a list[str] of segments if a chunking mode (sentences, lines, paragraphs, sections, or chunker) is enabled.

Usage Examples

Basic Example: Extract Full Text from a PDF

from txtai.pipeline import Textractor

# Create a Textractor with no chunking
textractor = Textractor()

# Extract text from a PDF file
text = textractor("/data/documents/report.pdf")
print(text)  # Full document text as Markdown string

Paragraph Chunking with Minimum Length

from txtai.pipeline import Textractor

# Create a Textractor that splits into paragraphs
textractor = Textractor(paragraphs=True, minlength=100)

# Extract and chunk a document
chunks = textractor("/data/documents/whitepaper.pdf")
# chunks is a list of paragraph strings, each at least 100 chars
for i, chunk in enumerate(chunks):
    print(f"Chunk {i}: {chunk[:80]}...")

Extracting from URLs

from txtai.pipeline import Textractor

# Create a Textractor for sentence-level extraction
textractor = Textractor(sentences=True)

# Extract sentences from a web page
sentences = textractor("https://example.com/article.html")
print(f"Extracted {len(sentences)} sentences")

Section-Based Chunking for RAG

from txtai.pipeline import Textractor

# Section-level chunking ideal for RAG pipelines
textractor = Textractor(sections=True, minlength=50)

# Process multiple documents
documents = [
    "/data/docs/manual.pdf",
    "/data/docs/guide.docx",
    "https://example.com/faq.html",
]

all_chunks = []
for doc in documents:
    chunks = textractor(doc)
    if isinstance(chunks, list):
        all_chunks.extend(chunks)
    else:
        all_chunks.append(chunks)

print(f"Total chunks: {len(all_chunks)}")

Using Custom HTTP Headers

from txtai.pipeline import Textractor

# Set custom headers for authenticated URL access
textractor = Textractor(
    paragraphs=True,
    headers={"Authorization": "Bearer YOUR_TOKEN"}
)

text = textractor("https://api.example.com/protected/document.html")

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment