Workflow:Neuml Txtai RAG Pipeline

Knowledge Sources	txtai txtai RAG Docs txtai Textractor Docs txtai LLM Docs
Domains	RAG, LLMs, Semantic_Search
Last Updated	2026-02-09 18:00 GMT

Overview

End-to-end process for building a Retrieval Augmented Generation system that extracts text from documents, indexes them into a semantic search database, and answers questions using an LLM grounded in retrieved context.

Description

This workflow implements the standard RAG pattern: ingest documents, chunk and index their content, then combine an embeddings-based retrieval step with a large language model to generate grounded answers. txtai's RAG pipeline class unifies the retrieval and generation steps into a single callable. The pipeline supports multiple LLM backends (Hugging Face Transformers, llama.cpp, LiteLLM for API models like OpenAI/Claude), customizable prompt templates, configurable context windows, and output formatting options including citations. Text extraction is handled by the Textractor pipeline which supports PDFs, Office documents, HTML, and URLs via multiple backends (docling, BeautifulSoup, Tika).

Usage

Execute this workflow when you have a collection of documents (PDFs, web pages, text files) and need to build a "chat with your data" application. This is the appropriate choice when users need natural language answers derived from specific source material, with reduced hallucination risk compared to using an LLM alone.

Execution Steps

Step 1: Collect Source Documents

Gather the source files that will form the knowledge base. These may be local files (PDFs, DOCX, TXT), URLs, or programmatically fetched data. Organize them into an accessible directory or list.

Key considerations:

Supported formats include PDF, Office documents, HTML, plain text, and URLs
Files can be collected from local directories, cloud storage, or web sources
The Textractor pipeline handles format conversion automatically

Step 2: Extract and Chunk Text

Use the Textractor pipeline to convert source documents into text and segment them into appropriately sized chunks. Section-based chunking splits at document structure boundaries (headings, paragraphs). Alternative strategies include sentence-level, semantic, and iterative chunking.

Key considerations:

Section-based chunking (sections=True) preserves document structure
The backend parameter selects the extraction engine (docling, beautifulsoup, tika)
Chunk size affects retrieval quality: too large dilutes relevance, too small loses context
Each chunk should be paired with a source identifier for traceability

Step 3: Build the Embeddings Index

Create an Embeddings instance with content storage enabled and index the extracted chunks. The vector model converts text chunks into embeddings for similarity search. Save the index for reuse.

Key considerations:

Enable content=True to store full text alongside vectors for RAG retrieval
Select a vector model appropriate for the domain and language
Set maxlength to match the embedding model's maximum token capacity
Tuples of (source_id, chunk_text) preserve document provenance

Step 4: Configure the RAG Pipeline

Instantiate the RAG class with the embeddings index, an LLM model path, a prompt template, and generation parameters. The template must contain Template:Question and Template:Context placeholders. Configure the system prompt, output format, and context retrieval parameters (number of results, minimum score).

Key considerations:

The LLM can be a local Hugging Face model, a llama.cpp GGUF model, or a remote API via LiteLLM
The template structures how retrieved context is presented to the LLM
Output format options: "default" returns (name, answer) tuples, "flatten" returns answer strings, "reference" includes source references
Context count (topn), minimum score, and minimum tokens control retrieval quality

Step 5: Execute RAG Queries

Call the RAG pipeline with user questions. The pipeline automatically retrieves relevant context from the embeddings index, constructs a prompt using the template, and generates an answer using the LLM. Results can be streamed for real-time output.

Key considerations:

The maxlength parameter controls the maximum generation length
Streaming mode (stream=True) provides token-by-token output for responsive UIs
The stripthink parameter removes reasoning tokens from thinking-enabled models
Multiple questions can be processed in batch

Execution Diagram

GitHub URL

Workflow Repository