Implementation:Neuml Txtai Document Collection Pattern
| Knowledge Sources | |
|---|---|
| Domains | NLP, RAG |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Concrete pattern for gathering source documents (file paths, URLs, or raw text strings) as the first step of a txtai RAG pipeline.
Description
Document collection in a txtai RAG workflow is user-defined code, not a library API. The user is responsible for assembling a list[str] of inputs that will be passed to the Textractor pipeline for extraction and chunking. Each string in the list can be a local file path, an HTTP/HTTPS URL, or a raw text string.
The Textractor pipeline (the next stage) accepts each of these input types transparently: it detects whether a string is a valid local file, a URL, or raw text, and handles each case accordingly. This means the document collection step simply needs to produce a flat list of strings -- no special formatting, wrapping, or object construction is required.
Because this is a pattern rather than a specific API, users have full freedom in how they construct the list. Common approaches include directory scanning with os.walk or glob, reading paths from a configuration file, querying a database for URLs, or combining multiple sources programmatically.
Usage
Use this pattern at the start of any txtai RAG pipeline to assemble the input document set. Apply it when you need to:
- Scan a directory tree for files of specific types (PDF, DOCX, HTML, TXT).
- Collect URLs from a web scraping step or a URL list file.
- Combine in-memory text with file-based sources into a single input list.
- Filter or deduplicate source references before passing them to extraction.
Code Reference
Source Location
- Repository: txtai
- File: User-defined code (no specific library file)
- Lines: N/A -- this is a usage pattern, not a library API
Interface Contract
The output of the document collection step must conform to the following interface:
documents: list[str]
# Each element is one of:
# - A local file path: "/data/reports/q3_report.pdf"
# - An HTTP/HTTPS URL: "https://example.com/article.html"
# - A raw text string: "This is some inline text content."
Accepted Input Types
| Input Type | Example | How Textractor Handles It |
|---|---|---|
| Local file path | "/data/docs/report.pdf" |
Detects file exists on disk, parses via FileToHTML backend |
| HTTP/HTTPS URL | "https://example.com/page.html" |
Downloads content, parses via FileToHTML or reads directly |
| File URL | "file:///data/docs/report.pdf" |
Strips file:// prefix, treats as local path
|
| Raw text/HTML | " |
Detects not a valid path/URL, treats as raw HTML input |
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| source_directories | list[str] |
No | Directory paths to scan for files |
| file_extensions | list[str] |
No | File extensions to include (e.g., [".pdf", ".docx"])
|
| urls | list[str] |
No | Explicit URLs to include |
| raw_texts | list[str] |
No | In-memory text strings to include |
Outputs
| Name | Type | Description |
|---|---|---|
| documents | list[str] |
Flat list of file paths, URLs, or raw text strings ready for Textractor |
Usage Examples
Basic Example: Scanning a Directory
import glob
# Collect all PDF files from a directory
documents = glob.glob("/data/reports/**/*.pdf", recursive=True)
# Result: ["/data/reports/q1.pdf", "/data/reports/q2.pdf", ...]
Collecting URLs
# Collect documents from a list of URLs
documents = [
"https://example.com/article1.html",
"https://example.com/article2.html",
"https://example.com/whitepaper.pdf",
]
Mixed Sources
import glob
# Combine local files, URLs, and raw text
local_files = glob.glob("/data/docs/*.pdf")
urls = ["https://example.com/faq.html"]
raw_text = ["This is an in-memory note about project requirements."]
documents = local_files + urls + raw_text
Filtered Collection with os.walk
import os
# Collect files with filtering by extension and size
allowed_extensions = {".pdf", ".docx", ".html", ".txt"}
max_size_bytes = 50 * 1024 * 1024 # 50 MB
documents = []
for root, dirs, files in os.walk("/data/corpus"):
for filename in files:
filepath = os.path.join(root, filename)
_, ext = os.path.splitext(filename)
if ext.lower() in allowed_extensions:
if os.path.getsize(filepath) <= max_size_bytes:
documents.append(filepath)
Full Pipeline Integration
import glob
from txtai.pipeline import Textractor
# Step 1: Document collection (user-defined)
documents = glob.glob("/data/knowledge_base/**/*.pdf", recursive=True)
# Step 2: Pass collected documents to Textractor
textractor = Textractor(paragraphs=True, minlength=100)
chunks = []
for doc in documents:
result = textractor(doc)
if isinstance(result, list):
chunks.extend(result)
else:
chunks.append(result)