Workflow:Run llama Llama index RAG Query Pipeline

Knowledge Sources	LlamaIndex LlamaIndex Docs
Domains	LLMs, RAG, Information_Retrieval
Last Updated	2026-02-11 19:00 GMT

Overview

End-to-end process for building a Retrieval-Augmented Generation (RAG) pipeline that loads documents, creates a vector index, and answers natural language queries with source-grounded responses.

Description

This workflow implements the core LlamaIndex use case: augmenting Large Language Models with private data. It reads documents from a directory, splits them into manageable chunks (nodes), generates vector embeddings for each chunk, stores them in a vector index, and exposes a query engine that retrieves relevant context and synthesizes an answer using an LLM. The process covers data loading, text splitting, embedding generation, vector storage, retrieval, and response synthesis.

Usage

Execute this workflow when you have a collection of documents (text files, PDFs, or other supported formats) and need to build a question-answering system that retrieves relevant passages and generates grounded answers using an LLM. This is the standard starting point for any LlamaIndex application.

Execution Steps

Step 1: Configure Settings

Set up the global configuration for the LLM, embedding model, and tokenizer. The Settings singleton manages these components centrally so all downstream operations use consistent models. If using OpenAI defaults, only an API key is required; for other providers, explicitly set the LLM, embedding model, and tokenizer.

Key considerations:

Choose an embedding model appropriate for your domain
Set the tokenizer to match the LLM being used
Settings propagate automatically to all LlamaIndex components via ContextVar

Step 2: Load Documents

Read source documents from the filesystem using SimpleDirectoryReader. The reader automatically detects file types and converts them into Document objects with text content and metadata. It supports recursive directory traversal and file filtering by extension.

Key considerations:

Each file becomes one or more Document objects
Metadata (filename, path, creation date) is attached automatically
Custom file extractors can be registered for specialized formats

Step 3: Build Vector Index

Create a VectorStoreIndex from the loaded documents. This step internally performs text splitting into nodes using the configured node parser, generates embeddings for each node in batches, and stores both the nodes and their embeddings in a vector store. The default vector store is in-memory (SimpleVectorStore).

Key considerations:

The default chunk size is 1024 tokens with 200-token overlap
Embeddings are generated in batches (default batch size: 2048)
For large datasets, consider using a persistent vector store backend
The index can be persisted to disk via StorageContext

Step 4: Create Query Engine

Convert the vector index into a query engine by calling as_query_engine(). This creates a retriever (for finding relevant nodes) and a response synthesizer (for generating answers from context). The response mode controls how context is combined with the LLM prompt.

Key considerations:

Response modes include compact, refine, tree_summarize, and simple_summarize
The similarity_top_k parameter controls how many nodes are retrieved (default: 2)
Streaming can be enabled for real-time token delivery

Step 5: Execute Query

Submit a natural language query to the query engine. The engine embeds the query, performs vector similarity search to retrieve relevant nodes, and feeds those nodes as context to the LLM for answer synthesis. The response includes both the generated answer and references to source nodes.

Key considerations:

The query is embedded using the same model used for documents
Source nodes are accessible via response.source_nodes for attribution
Async queries are supported via aquery()

Step 6: Persist and Reload

Save the index to disk for later reuse, avoiding the need to re-embed documents. The StorageContext manages persistence of the document store, index store, and vector store. Reloading reconstructs the full index from persisted data.

Key considerations:

Default persistence directory is ./storage
All three stores (docstore, index store, vector store) are saved
load_index_from_storage() restores the index without re-embedding

Execution Diagram

GitHub URL

Workflow Repository