Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Neuml Txtai Textractor Call

From Leeroopedia
Revision as of 16:05, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Neuml_Txtai_Textractor_Call.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains NLP, RAG
Last Updated 2026-02-10 00:00 GMT

Overview

Concrete tool for extracting and segmenting text from files, URLs, and raw input into indexable chunks provided by the txtai library.

Description

The Textractor class extends Segmentation to provide a unified pipeline for converting heterogeneous document sources into clean, structured text. It accepts local file paths, remote URLs, and raw HTML/text strings as input. The extraction process converts content to HTML via a configurable backend (e.g., Apache Tika), then transforms the HTML into Markdown. The inherited segmentation logic then splits the result into sentences, lines, paragraphs, sections, or custom chunks based on the constructor parameters.

The __init__ method configures the extraction backend, HTML-to-Markdown converter, and HTTP headers for remote fetches. The __call__ method (inherited from Segmentation) orchestrates the full pipeline: for each input, it calls the text method (overridden by Textractor) to perform extraction, then applies parsing, cleaning, and optional filtering.

Usage

Import and use Textractor when building the document ingestion stage of a RAG pipeline. It serves as the first step before embedding and indexing, converting raw documents into text segments suitable for an embeddings index.

Code Reference

Source Location

  • Repository: txtai
  • File: src/python/txtai/pipeline/data/textractor.py
  • Lines: 17-139

Signature

class Textractor(Segmentation):

    def __init__(
        self,
        sentences=False,
        lines=False,
        paragraphs=False,
        minlength=None,
        join=False,
        sections=False,
        cleantext=True,
        chunker=None,
        headers=None,
        backend="available",
        **kwargs
    ):

    # Inherited from Segmentation
    def __call__(self, text):

Import

from txtai.pipeline import Textractor

I/O Contract

Inputs

__init__ Parameters

Name Type Required Description
sentences bool No Tokenize text into sentences if True, defaults to False
lines bool No Tokenize text into lines if True, defaults to False
paragraphs bool No Tokenize text into paragraphs if True, defaults to False
minlength int No Require at least minlength characters per text element, defaults to None
join bool No Join tokenized sections back together if True, defaults to False
sections bool No Tokenize text into sections (splits on section or page breaks) if True, defaults to False
cleantext bool No Apply text cleaning rules, defaults to True
chunker str No Name of a third-party chunker to tokenize text, defaults to None
headers dict No HTTP headers for remote URL requests, defaults to empty dict
backend str No File-to-HTML conversion backend, defaults to "available"
**kwargs dict No Additional keyword arguments passed to Segmentation and chunker

__call__ Parameters

Name Type Required Description
text str or list Yes A file path, URL, raw text/HTML string, or a list of these inputs

Outputs

Name Type Description
result str, list, or list of lists Segmented text. Returns a string if no tokenization is enabled and input is a string. Returns a list of strings if tokenization is enabled and input is a string. Returns a list of results if input is a list.

Usage Examples

Basic Example

from txtai.pipeline import Textractor

# Extract text from a file, split into paragraphs
textractor = Textractor(paragraphs=True)
paragraphs = textractor("path/to/document.pdf")

for paragraph in paragraphs:
    print(paragraph)

Sentence-Level Extraction from URL

from txtai.pipeline import Textractor

# Extract and split into sentences with minimum length filtering
textractor = Textractor(sentences=True, minlength=20)
sentences = textractor("https://example.com/article.html")

for sentence in sentences:
    print(sentence)

Batch Processing with Join

from txtai.pipeline import Textractor

# Extract from multiple files, join segments into single strings
textractor = Textractor(paragraphs=True, join=True)
texts = textractor(["doc1.pdf", "doc2.html", "doc3.txt"])

# Each element in texts is a single joined string
for text in texts:
    print(text[:200])

RAG Pipeline Integration

from txtai.pipeline import Textractor
from txtai import Embeddings

# Step 1: Extract and segment documents
textractor = Textractor(paragraphs=True, minlength=50)
chunks = textractor("corpus.pdf")

# Step 2: Index chunks for RAG
embeddings = Embeddings({"content": True})
embeddings.index([(uid, text, None) for uid, text in enumerate(chunks)])

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment