Workflow:Neuml Txtai RAG Pipeline
| Knowledge Sources | |
|---|---|
| Domains | RAG, LLMs, Semantic_Search |
| Last Updated | 2026-02-09 18:00 GMT |
Overview
End-to-end process for building a Retrieval Augmented Generation system that extracts text from documents, indexes them into a semantic search database, and answers questions using an LLM grounded in retrieved context.
Description
This workflow implements the standard RAG pattern: ingest documents, chunk and index their content, then combine an embeddings-based retrieval step with a large language model to generate grounded answers. txtai's RAG pipeline class unifies the retrieval and generation steps into a single callable. The pipeline supports multiple LLM backends (Hugging Face Transformers, llama.cpp, LiteLLM for API models like OpenAI/Claude), customizable prompt templates, configurable context windows, and output formatting options including citations. Text extraction is handled by the Textractor pipeline which supports PDFs, Office documents, HTML, and URLs via multiple backends (docling, BeautifulSoup, Tika).
Usage
Execute this workflow when you have a collection of documents (PDFs, web pages, text files) and need to build a "chat with your data" application. This is the appropriate choice when users need natural language answers derived from specific source material, with reduced hallucination risk compared to using an LLM alone.
Execution Steps
Step 1: Collect Source Documents
Gather the source files that will form the knowledge base. These may be local files (PDFs, DOCX, TXT), URLs, or programmatically fetched data. Organize them into an accessible directory or list.
Key considerations:
- Supported formats include PDF, Office documents, HTML, plain text, and URLs
- Files can be collected from local directories, cloud storage, or web sources
- The Textractor pipeline handles format conversion automatically
Step 2: Extract and Chunk Text
Use the Textractor pipeline to convert source documents into text and segment them into appropriately sized chunks. Section-based chunking splits at document structure boundaries (headings, paragraphs). Alternative strategies include sentence-level, semantic, and iterative chunking.
Key considerations:
- Section-based chunking (sections=True) preserves document structure
- The backend parameter selects the extraction engine (docling, beautifulsoup, tika)
- Chunk size affects retrieval quality: too large dilutes relevance, too small loses context
- Each chunk should be paired with a source identifier for traceability
Step 3: Build the Embeddings Index
Create an Embeddings instance with content storage enabled and index the extracted chunks. The vector model converts text chunks into embeddings for similarity search. Save the index for reuse.
Key considerations:
- Enable content=True to store full text alongside vectors for RAG retrieval
- Select a vector model appropriate for the domain and language
- Set maxlength to match the embedding model's maximum token capacity
- Tuples of (source_id, chunk_text) preserve document provenance
Step 4: Configure the RAG Pipeline
Instantiate the RAG class with the embeddings index, an LLM model path, a prompt template, and generation parameters. The template must contain Template:Question and Template:Context placeholders. Configure the system prompt, output format, and context retrieval parameters (number of results, minimum score).
Key considerations:
- The LLM can be a local Hugging Face model, a llama.cpp GGUF model, or a remote API via LiteLLM
- The template structures how retrieved context is presented to the LLM
- Output format options: "default" returns (name, answer) tuples, "flatten" returns answer strings, "reference" includes source references
- Context count (topn), minimum score, and minimum tokens control retrieval quality
Step 5: Execute RAG Queries
Call the RAG pipeline with user questions. The pipeline automatically retrieves relevant context from the embeddings index, constructs a prompt using the template, and generates an answer using the LLM. Results can be streamed for real-time output.
Key considerations:
- The maxlength parameter controls the maximum generation length
- Streaming mode (stream=True) provides token-by-token output for responsive UIs
- The stripthink parameter removes reasoning tokens from thinking-enabled models
- Multiple questions can be processed in batch