Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Neuml Txtai Document Collection Pattern

From Leeroopedia


Knowledge Sources
Domains NLP, RAG
Last Updated 2026-02-09 00:00 GMT

Overview

Concrete pattern for gathering source documents (file paths, URLs, or raw text strings) as the first step of a txtai RAG pipeline.

Description

Document collection in a txtai RAG workflow is user-defined code, not a library API. The user is responsible for assembling a list[str] of inputs that will be passed to the Textractor pipeline for extraction and chunking. Each string in the list can be a local file path, an HTTP/HTTPS URL, or a raw text string.

The Textractor pipeline (the next stage) accepts each of these input types transparently: it detects whether a string is a valid local file, a URL, or raw text, and handles each case accordingly. This means the document collection step simply needs to produce a flat list of strings -- no special formatting, wrapping, or object construction is required.

Because this is a pattern rather than a specific API, users have full freedom in how they construct the list. Common approaches include directory scanning with os.walk or glob, reading paths from a configuration file, querying a database for URLs, or combining multiple sources programmatically.

Usage

Use this pattern at the start of any txtai RAG pipeline to assemble the input document set. Apply it when you need to:

  • Scan a directory tree for files of specific types (PDF, DOCX, HTML, TXT).
  • Collect URLs from a web scraping step or a URL list file.
  • Combine in-memory text with file-based sources into a single input list.
  • Filter or deduplicate source references before passing them to extraction.

Code Reference

Source Location

  • Repository: txtai
  • File: User-defined code (no specific library file)
  • Lines: N/A -- this is a usage pattern, not a library API

Interface Contract

The output of the document collection step must conform to the following interface:

documents: list[str]
# Each element is one of:
#   - A local file path: "/data/reports/q3_report.pdf"
#   - An HTTP/HTTPS URL: "https://example.com/article.html"
#   - A raw text string: "This is some inline text content."

Accepted Input Types

Input Type Example How Textractor Handles It
Local file path "/data/docs/report.pdf" Detects file exists on disk, parses via FileToHTML backend
HTTP/HTTPS URL "https://example.com/page.html" Downloads content, parses via FileToHTML or reads directly
File URL "file:///data/docs/report.pdf" Strips file:// prefix, treats as local path
Raw text/HTML "

Some HTML content

"
Detects not a valid path/URL, treats as raw HTML input

I/O Contract

Inputs

Name Type Required Description
source_directories list[str] No Directory paths to scan for files
file_extensions list[str] No File extensions to include (e.g., [".pdf", ".docx"])
urls list[str] No Explicit URLs to include
raw_texts list[str] No In-memory text strings to include

Outputs

Name Type Description
documents list[str] Flat list of file paths, URLs, or raw text strings ready for Textractor

Usage Examples

Basic Example: Scanning a Directory

import glob

# Collect all PDF files from a directory
documents = glob.glob("/data/reports/**/*.pdf", recursive=True)

# Result: ["/data/reports/q1.pdf", "/data/reports/q2.pdf", ...]

Collecting URLs

# Collect documents from a list of URLs
documents = [
    "https://example.com/article1.html",
    "https://example.com/article2.html",
    "https://example.com/whitepaper.pdf",
]

Mixed Sources

import glob

# Combine local files, URLs, and raw text
local_files = glob.glob("/data/docs/*.pdf")
urls = ["https://example.com/faq.html"]
raw_text = ["This is an in-memory note about project requirements."]

documents = local_files + urls + raw_text

Filtered Collection with os.walk

import os

# Collect files with filtering by extension and size
allowed_extensions = {".pdf", ".docx", ".html", ".txt"}
max_size_bytes = 50 * 1024 * 1024  # 50 MB

documents = []
for root, dirs, files in os.walk("/data/corpus"):
    for filename in files:
        filepath = os.path.join(root, filename)
        _, ext = os.path.splitext(filename)
        if ext.lower() in allowed_extensions:
            if os.path.getsize(filepath) <= max_size_bytes:
                documents.append(filepath)

Full Pipeline Integration

import glob
from txtai.pipeline import Textractor

# Step 1: Document collection (user-defined)
documents = glob.glob("/data/knowledge_base/**/*.pdf", recursive=True)

# Step 2: Pass collected documents to Textractor
textractor = Textractor(paragraphs=True, minlength=100)
chunks = []
for doc in documents:
    result = textractor(doc)
    if isinstance(result, list):
        chunks.extend(result)
    else:
        chunks.append(result)

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment