Principle:Run llama Llama index Evaluation Dataset Generation

Overview

Evaluation Dataset Generation is the foundational step in any RAG evaluation pipeline: creating high-quality question-answer pairs that serve as benchmarks for measuring retrieval and generation quality. Rather than manually crafting hundreds of test questions, LlamaIndex provides LLM-based synthetic dataset generation that automatically produces diverse, contextually grounded QA pairs from your source documents.

The core insight is that an LLM can read document chunks and generate natural-language questions that a human might plausibly ask, along with reference answers derived directly from the source material. These generated pairs become ground truth for downstream evaluation, enabling systematic measurement of faithfulness, relevancy, and correctness across an entire RAG pipeline.

RAG Evaluation Dataset Generation Synthetic Benchmarks LLM Testing

Why Synthetic Dataset Generation Matters

Manual evaluation dataset creation is time-consuming, expensive, and difficult to scale. For a production RAG system ingesting thousands of documents, hand-writing evaluation questions for each document chunk is impractical. Synthetic generation addresses this by:

Scaling coverage — generating questions proportional to corpus size, ensuring broad evaluation across all document sections
Reducing bias — LLM-generated questions surface query patterns that manual authors might overlook
Enabling rapid iteration — regenerating evaluation sets when documents change or when testing different chunking strategies
Grounding in source material — every generated question is tied to a specific document node, providing traceable ground truth

LLM-Based Question Generation from Documents

The generation process follows a structured pipeline:

Step	Description	Output
Document Ingestion	Load source documents into the system	List of Document objects
Node Parsing	Split documents into chunks using configured transformations	List of TextNode objects
Question Synthesis	LLM reads each node and generates questions answerable from that node's content	List of question strings per node
Answer Generation	LLM generates reference answers using the source node as context	QA pairs with source attribution
Dataset Assembly	Combine all QA pairs into a structured dataset object	QueryResponseDataset or LabelledRagDataset

The question generation step is controlled by a text_question_template that instructs the LLM how to formulate questions, and a question_gen_query that specifies the style and number of questions. Keywords can be required or excluded to focus generation on specific topics.

Creating Ground Truth QA Pairs

Ground truth in RAG evaluation means having a known-correct answer alongside the source context that justifies it. This three-part structure (question, answer, source context) enables multiple evaluation dimensions:

Faithfulness — does the RAG system's answer align with the retrieved context, without hallucination?
Relevancy — is the retrieved context actually relevant to the question?
Correctness — does the generated answer match the ground truth reference answer?

Without ground truth pairs, evaluation is limited to subjective human review. With them, evaluation becomes automated, reproducible, and quantifiable.

Synthetic Benchmark Creation Strategies

Per-Chunk Generation

The default approach generates a configurable number of questions per document chunk (controlled by num_questions_per_chunk). This ensures every section of the corpus is represented in the evaluation set.

Keyword-Filtered Generation

Using required_keywords and exclude_keywords parameters, generation can be focused on specific topics or domains within the corpus. This is useful for creating targeted evaluation sets for particular use cases.

Parallel Generation

The newer RagDatasetGenerator supports a workers parameter for parallel generation, significantly reducing time for large corpora. This is essential for production-scale evaluation where datasets may need to be regenerated frequently.

Dataset Versioning

Generated datasets can be serialized and versioned, enabling reproducible evaluation across different pipeline configurations. The LabelledRagDataset format includes metadata linking each QA pair to its source node, supporting detailed failure analysis.

Deprecated vs. Preferred APIs

LlamaIndex provides two generation APIs:

API	Status	Key Difference
DatasetGenerator	Deprecated	Simpler interface, returns QueryResponseDataset
RagDatasetGenerator	Preferred	Supports parallel workers, returns LabelledRagDataset with richer metadata

New implementations should use RagDatasetGenerator for its improved parallelism, better output format, and continued maintenance.

Knowledge Sources

LlamaIndex Evaluation LlamaIndex Dataset Generation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment