Implementation:Neuml Txtai Textractor
| Knowledge Sources | |
|---|---|
| Domains | NLP, RAG |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Concrete tool for extracting text from diverse file formats and segmenting it into chunks, provided by the txtai library.
Description
The Textractor class is a txtai pipeline that converts documents in various formats (PDF, DOCX, HTML, URLs, raw text) into clean, segmented Markdown text. It extends the Segmentation base class and adds file-format handling through a three-stage internal pipeline: FileToHTML converts binary documents to HTML, HTMLToMarkdown normalizes HTML to Markdown, and the inherited Segmentation logic splits the Markdown into chunks.
Textractor auto-detects whether each input is a local file path, an HTTP/HTTPS URL, or raw text/HTML. For URLs, it downloads content to a temporary file before processing. For local files, it uses the configured backend (Apache Tika or the best available parser) to convert to HTML. For raw text or HTML strings, it passes the content directly to the Markdown converter.
The chunking behavior is controlled by mutually exclusive boolean flags: sentences, lines, paragraphs, and sections. When none are set, the full extracted text is returned as a single string. The minlength parameter filters out segments shorter than a specified character count. The chunker parameter allows integration with third-party chunking libraries such as Chonkie.
Usage
Use Textractor when you need to:
- Extract readable text from PDFs, DOCX, XLSX, PPTX, or other binary document formats.
- Scrape and clean content from web URLs.
- Segment extracted text into sentences, lines, paragraphs, or sections for indexing.
- Prepare text chunks as input for an embeddings index in a RAG pipeline.
Code Reference
Source Location
- Repository: txtai
- File:
src/python/txtai/pipeline/data/textractor.py - Lines: L23-79
Signature
class Textractor(Segmentation):
def __init__(
self,
sentences=False,
lines=False,
paragraphs=False,
minlength=None,
join=False,
sections=False,
cleantext=True,
chunker=None,
headers=None,
backend="available",
**kwargs
):
...
def __call__(self, text):
...
Import
from txtai.pipeline import Textractor
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| text | str or list[str] |
Yes | File path(s), URL(s), or raw text/HTML string(s) to extract from |
| sentences | bool |
No | Tokenize output into sentences using NLTK. Default: False
|
| lines | bool |
No | Split output on line breaks. Default: False
|
| paragraphs | bool |
No | Split output on double line breaks (paragraph boundaries). Default: False
|
| sections | bool |
No | Split output on section/page breaks. Default: False
|
| minlength | int or None |
No | Minimum character length for segments; shorter segments are discarded. Default: None
|
| join | bool |
No | Rejoin tokenized sections into a single string. Default: False
|
| cleantext | bool |
No | Apply text cleaning rules to extracted text. Default: True
|
| chunker | str or None |
No | Name of a third-party chunker (e.g., Chonkie) for custom tokenization. Default: None
|
| headers | dict or None |
No | HTTP headers to use when downloading remote URLs. Default: None
|
| backend | str |
No | Parser backend: "tika" for Apache Tika, "available" to auto-detect. Default: "available"
|
Outputs
| Name | Type | Description |
|---|---|---|
| result | str or list[str] |
Extracted Markdown text. Returns a single string if no chunking mode is set; returns a list[str] of segments if a chunking mode (sentences, lines, paragraphs, sections, or chunker) is enabled.
|
Usage Examples
Basic Example: Extract Full Text from a PDF
from txtai.pipeline import Textractor
# Create a Textractor with no chunking
textractor = Textractor()
# Extract text from a PDF file
text = textractor("/data/documents/report.pdf")
print(text) # Full document text as Markdown string
Paragraph Chunking with Minimum Length
from txtai.pipeline import Textractor
# Create a Textractor that splits into paragraphs
textractor = Textractor(paragraphs=True, minlength=100)
# Extract and chunk a document
chunks = textractor("/data/documents/whitepaper.pdf")
# chunks is a list of paragraph strings, each at least 100 chars
for i, chunk in enumerate(chunks):
print(f"Chunk {i}: {chunk[:80]}...")
Extracting from URLs
from txtai.pipeline import Textractor
# Create a Textractor for sentence-level extraction
textractor = Textractor(sentences=True)
# Extract sentences from a web page
sentences = textractor("https://example.com/article.html")
print(f"Extracted {len(sentences)} sentences")
Section-Based Chunking for RAG
from txtai.pipeline import Textractor
# Section-level chunking ideal for RAG pipelines
textractor = Textractor(sections=True, minlength=50)
# Process multiple documents
documents = [
"/data/docs/manual.pdf",
"/data/docs/guide.docx",
"https://example.com/faq.html",
]
all_chunks = []
for doc in documents:
chunks = textractor(doc)
if isinstance(chunks, list):
all_chunks.extend(chunks)
else:
all_chunks.append(chunks)
print(f"Total chunks: {len(all_chunks)}")
Using Custom HTTP Headers
from txtai.pipeline import Textractor
# Set custom headers for authenticated URL access
textractor = Textractor(
paragraphs=True,
headers={"Authorization": "Bearer YOUR_TOKEN"}
)
text = textractor("https://api.example.com/protected/document.html")