Implementation:Neuml Txtai Textractor Call
| Knowledge Sources | |
|---|---|
| Domains | NLP, RAG |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Concrete tool for extracting and segmenting text from files, URLs, and raw input into indexable chunks provided by the txtai library.
Description
The Textractor class extends Segmentation to provide a unified pipeline for converting heterogeneous document sources into clean, structured text. It accepts local file paths, remote URLs, and raw HTML/text strings as input. The extraction process converts content to HTML via a configurable backend (e.g., Apache Tika), then transforms the HTML into Markdown. The inherited segmentation logic then splits the result into sentences, lines, paragraphs, sections, or custom chunks based on the constructor parameters.
The __init__ method configures the extraction backend, HTML-to-Markdown converter, and HTTP headers for remote fetches. The __call__ method (inherited from Segmentation) orchestrates the full pipeline: for each input, it calls the text method (overridden by Textractor) to perform extraction, then applies parsing, cleaning, and optional filtering.
Usage
Import and use Textractor when building the document ingestion stage of a RAG pipeline. It serves as the first step before embedding and indexing, converting raw documents into text segments suitable for an embeddings index.
Code Reference
Source Location
- Repository: txtai
- File:
src/python/txtai/pipeline/data/textractor.py - Lines: 17-139
Signature
class Textractor(Segmentation):
def __init__(
self,
sentences=False,
lines=False,
paragraphs=False,
minlength=None,
join=False,
sections=False,
cleantext=True,
chunker=None,
headers=None,
backend="available",
**kwargs
):
# Inherited from Segmentation
def __call__(self, text):
Import
from txtai.pipeline import Textractor
I/O Contract
Inputs
__init__ Parameters
| Name | Type | Required | Description |
|---|---|---|---|
| sentences | bool | No | Tokenize text into sentences if True, defaults to False |
| lines | bool | No | Tokenize text into lines if True, defaults to False |
| paragraphs | bool | No | Tokenize text into paragraphs if True, defaults to False |
| minlength | int | No | Require at least minlength characters per text element, defaults to None |
| join | bool | No | Join tokenized sections back together if True, defaults to False |
| sections | bool | No | Tokenize text into sections (splits on section or page breaks) if True, defaults to False |
| cleantext | bool | No | Apply text cleaning rules, defaults to True |
| chunker | str | No | Name of a third-party chunker to tokenize text, defaults to None |
| headers | dict | No | HTTP headers for remote URL requests, defaults to empty dict |
| backend | str | No | File-to-HTML conversion backend, defaults to "available" |
| **kwargs | dict | No | Additional keyword arguments passed to Segmentation and chunker |
__call__ Parameters
| Name | Type | Required | Description |
|---|---|---|---|
| text | str or list | Yes | A file path, URL, raw text/HTML string, or a list of these inputs |
Outputs
| Name | Type | Description |
|---|---|---|
| result | str, list, or list of lists | Segmented text. Returns a string if no tokenization is enabled and input is a string. Returns a list of strings if tokenization is enabled and input is a string. Returns a list of results if input is a list. |
Usage Examples
Basic Example
from txtai.pipeline import Textractor
# Extract text from a file, split into paragraphs
textractor = Textractor(paragraphs=True)
paragraphs = textractor("path/to/document.pdf")
for paragraph in paragraphs:
print(paragraph)
Sentence-Level Extraction from URL
from txtai.pipeline import Textractor
# Extract and split into sentences with minimum length filtering
textractor = Textractor(sentences=True, minlength=20)
sentences = textractor("https://example.com/article.html")
for sentence in sentences:
print(sentence)
Batch Processing with Join
from txtai.pipeline import Textractor
# Extract from multiple files, join segments into single strings
textractor = Textractor(paragraphs=True, join=True)
texts = textractor(["doc1.pdf", "doc2.html", "doc3.txt"])
# Each element in texts is a single joined string
for text in texts:
print(text[:200])
RAG Pipeline Integration
from txtai.pipeline import Textractor
from txtai import Embeddings
# Step 1: Extract and segment documents
textractor = Textractor(paragraphs=True, minlength=50)
chunks = textractor("corpus.pdf")
# Step 2: Index chunks for RAG
embeddings = Embeddings({"content": True})
embeddings.index([(uid, text, None) for uid, text in enumerate(chunks)])