Principle:Run llama Llama index Evaluation Dataset Generation
Overview
Evaluation Dataset Generation is the foundational step in any RAG evaluation pipeline: creating high-quality question-answer pairs that serve as benchmarks for measuring retrieval and generation quality. Rather than manually crafting hundreds of test questions, LlamaIndex provides LLM-based synthetic dataset generation that automatically produces diverse, contextually grounded QA pairs from your source documents.
The core insight is that an LLM can read document chunks and generate natural-language questions that a human might plausibly ask, along with reference answers derived directly from the source material. These generated pairs become ground truth for downstream evaluation, enabling systematic measurement of faithfulness, relevancy, and correctness across an entire RAG pipeline.
RAG Evaluation Dataset Generation Synthetic Benchmarks LLM Testing
Why Synthetic Dataset Generation Matters
Manual evaluation dataset creation is time-consuming, expensive, and difficult to scale. For a production RAG system ingesting thousands of documents, hand-writing evaluation questions for each document chunk is impractical. Synthetic generation addresses this by:
- Scaling coverage — generating questions proportional to corpus size, ensuring broad evaluation across all document sections
- Reducing bias — LLM-generated questions surface query patterns that manual authors might overlook
- Enabling rapid iteration — regenerating evaluation sets when documents change or when testing different chunking strategies
- Grounding in source material — every generated question is tied to a specific document node, providing traceable ground truth
LLM-Based Question Generation from Documents
The generation process follows a structured pipeline:
| Step | Description | Output |
|---|---|---|
| Document Ingestion | Load source documents into the system | List of Document objects |
| Node Parsing | Split documents into chunks using configured transformations | List of TextNode objects |
| Question Synthesis | LLM reads each node and generates questions answerable from that node's content | List of question strings per node |
| Answer Generation | LLM generates reference answers using the source node as context | QA pairs with source attribution |
| Dataset Assembly | Combine all QA pairs into a structured dataset object | QueryResponseDataset or LabelledRagDataset |
The question generation step is controlled by a text_question_template that instructs the LLM how to formulate questions, and a question_gen_query that specifies the style and number of questions. Keywords can be required or excluded to focus generation on specific topics.
Creating Ground Truth QA Pairs
Ground truth in RAG evaluation means having a known-correct answer alongside the source context that justifies it. This three-part structure (question, answer, source context) enables multiple evaluation dimensions:
- Faithfulness — does the RAG system's answer align with the retrieved context, without hallucination?
- Relevancy — is the retrieved context actually relevant to the question?
- Correctness — does the generated answer match the ground truth reference answer?
Without ground truth pairs, evaluation is limited to subjective human review. With them, evaluation becomes automated, reproducible, and quantifiable.
Synthetic Benchmark Creation Strategies
Per-Chunk Generation
The default approach generates a configurable number of questions per document chunk (controlled by num_questions_per_chunk). This ensures every section of the corpus is represented in the evaluation set.
Keyword-Filtered Generation
Using required_keywords and exclude_keywords parameters, generation can be focused on specific topics or domains within the corpus. This is useful for creating targeted evaluation sets for particular use cases.
Parallel Generation
The newer RagDatasetGenerator supports a workers parameter for parallel generation, significantly reducing time for large corpora. This is essential for production-scale evaluation where datasets may need to be regenerated frequently.
Dataset Versioning
Generated datasets can be serialized and versioned, enabling reproducible evaluation across different pipeline configurations. The LabelledRagDataset format includes metadata linking each QA pair to its source node, supporting detailed failure analysis.
Deprecated vs. Preferred APIs
LlamaIndex provides two generation APIs:
| API | Status | Key Difference |
|---|---|---|
| DatasetGenerator | Deprecated | Simpler interface, returns QueryResponseDataset |
| RagDatasetGenerator | Preferred | Supports parallel workers, returns LabelledRagDataset with richer metadata |
New implementations should use RagDatasetGenerator for its improved parallelism, better output format, and continued maintenance.
Knowledge Sources
LlamaIndex Evaluation LlamaIndex Dataset Generation