Implementation:Neuml Txtai Document Collection Pattern

Knowledge Sources	txtai txtai Documentation
Domains	NLP, RAG
Last Updated	2026-02-09 00:00 GMT

Overview

Concrete pattern for gathering source documents (file paths, URLs, or raw text strings) as the first step of a txtai RAG pipeline.

Description

Document collection in a txtai RAG workflow is user-defined code, not a library API. The user is responsible for assembling a list[str] of inputs that will be passed to the Textractor pipeline for extraction and chunking. Each string in the list can be a local file path, an HTTP/HTTPS URL, or a raw text string.

The Textractor pipeline (the next stage) accepts each of these input types transparently: it detects whether a string is a valid local file, a URL, or raw text, and handles each case accordingly. This means the document collection step simply needs to produce a flat list of strings -- no special formatting, wrapping, or object construction is required.

Because this is a pattern rather than a specific API, users have full freedom in how they construct the list. Common approaches include directory scanning with os.walk or glob, reading paths from a configuration file, querying a database for URLs, or combining multiple sources programmatically.

Usage

Use this pattern at the start of any txtai RAG pipeline to assemble the input document set. Apply it when you need to:

Scan a directory tree for files of specific types (PDF, DOCX, HTML, TXT).
Collect URLs from a web scraping step or a URL list file.
Combine in-memory text with file-based sources into a single input list.
Filter or deduplicate source references before passing them to extraction.

Code Reference

Source Location

Repository: txtai
File: User-defined code (no specific library file)
Lines: N/A -- this is a usage pattern, not a library API

Interface Contract

The output of the document collection step must conform to the following interface:

documents: list[str]
# Each element is one of:
#   - A local file path: "/data/reports/q3_report.pdf"
#   - An HTTP/HTTPS URL: "https://example.com/article.html"
#   - A raw text string: "This is some inline text content."

Accepted Input Types

Input Type	Example	How Textractor Handles It
Local file path	`"/data/docs/report.pdf"`	Detects file exists on disk, parses via FileToHTML backend
HTTP/HTTPS URL	`"https://example.com/page.html"`	Downloads content, parses via FileToHTML or reads directly
File URL	`"file:///data/docs/report.pdf"`	Strips `file://` prefix, treats as local path
Raw text/HTML	`"Some HTML content "`	Detects not a valid path/URL, treats as raw HTML input

I/O Contract

Inputs

Name	Type	Required	Description
source_directories	`list[str]`	No	Directory paths to scan for files
file_extensions	`list[str]`	No	File extensions to include (e.g., `[".pdf", ".docx"]`)
urls	`list[str]`	No	Explicit URLs to include
raw_texts	`list[str]`	No	In-memory text strings to include

Outputs

Name	Type	Description
documents	`list[str]`	Flat list of file paths, URLs, or raw text strings ready for Textractor

Usage Examples

Basic Example: Scanning a Directory

import glob

# Collect all PDF files from a directory
documents = glob.glob("/data/reports/**/*.pdf", recursive=True)

# Result: ["/data/reports/q1.pdf", "/data/reports/q2.pdf", ...]

Collecting URLs

# Collect documents from a list of URLs
documents = [
    "https://example.com/article1.html",
    "https://example.com/article2.html",
    "https://example.com/whitepaper.pdf",
]

Mixed Sources

import glob

# Combine local files, URLs, and raw text
local_files = glob.glob("/data/docs/*.pdf")
urls = ["https://example.com/faq.html"]
raw_text = ["This is an in-memory note about project requirements."]

documents = local_files + urls + raw_text

Filtered Collection with os.walk

import os

# Collect files with filtering by extension and size
allowed_extensions = {".pdf", ".docx", ".html", ".txt"}
max_size_bytes = 50 * 1024 * 1024  # 50 MB

documents = []
for root, dirs, files in os.walk("/data/corpus"):
    for filename in files:
        filepath = os.path.join(root, filename)
        _, ext = os.path.splitext(filename)
        if ext.lower() in allowed_extensions:
            if os.path.getsize(filepath) <= max_size_bytes:
                documents.append(filepath)

Full Pipeline Integration

import glob
from txtai.pipeline import Textractor

# Step 1: Document collection (user-defined)
documents = glob.glob("/data/knowledge_base/**/*.pdf", recursive=True)

# Step 2: Pass collected documents to Textractor
textractor = Textractor(paragraphs=True, minlength=100)
chunks = []
for doc in documents:
    result = textractor(doc)
    if isinstance(result, list):
        chunks.extend(result)
    else:
        chunks.append(result)

Related Pages

Implements Principle

Principle:Neuml_Txtai_Document_Collection

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment