Implementation:Run llama Llama index RagDatasetGenerator
Overview
RagDatasetGenerator generates question and question-answer pair datasets from documents or nodes using an LLM. It is designed to produce LabelledRagDataset instances suitable for evaluating RAG (Retrieval-Augmented Generation) pipelines. This is marked as a beta feature in the source code.
Source file: llama-index-core/llama_index/core/llama_dataset/generator.py (261 lines)
Class Hierarchy
PromptMixin └── RagDatasetGenerator
RagDatasetGenerator inherits from PromptMixin, giving it prompt management capabilities including prompt retrieval and update methods.
Constructor
def __init__(
self,
nodes: List[BaseNode],
llm: Optional[LLM] = None,
num_questions_per_chunk: int = 3,
text_question_template: Optional[BasePromptTemplate] = None,
text_qa_template: Optional[BasePromptTemplate] = None,
question_gen_query: Optional[str] = None,
metadata_mode: MetadataMode = MetadataMode.NONE,
show_progress: bool = False,
workers: int = DEFAULT_NUM_WORKERS,
) -> None:
| Parameter | Type | Default | Description |
|---|---|---|---|
nodes |
List[BaseNode] |
required | Pre-processed nodes to generate questions from |
llm |
Optional[LLM] |
Settings.llm |
Language model for question and answer generation |
num_questions_per_chunk |
int |
3 |
Number of questions to generate per chunk |
text_question_template |
Optional[BasePromptTemplate] |
DEFAULT_QUESTION_GENERATION_PROMPT | Template for generating questions from context |
text_qa_template |
Optional[BasePromptTemplate] |
DEFAULT_TEXT_QA_PROMPT |
Template for generating answers from questions |
question_gen_query |
Optional[str] |
Auto-generated teacher prompt | Query string used to instruct the LLM to generate questions |
metadata_mode |
MetadataMode |
MetadataMode.NONE |
Controls how metadata is included in node content |
show_progress |
bool |
False |
Whether to display progress bars |
workers |
int |
DEFAULT_NUM_WORKERS |
Number of concurrent workers for async tasks |
Default Question Generation Prompt
Context information is below.
---------------------
{context_str}
---------------------
Given the context information and not prior knowledge.
generate only questions based on the below query.
{query_str}
The default question_gen_query instructs the LLM as a "Teacher/Professor" to create diverse quiz questions restricted to the provided context.
Factory Method
from_documents
@classmethod
def from_documents(
cls,
documents: Sequence[Document],
llm: Optional[LLM] = None,
transformations: Optional[List[TransformComponent]] = None,
num_questions_per_chunk: int = 3,
text_question_template: Optional[BasePromptTemplate] = None,
text_qa_template: Optional[BasePromptTemplate] = None,
question_gen_query: Optional[str] = None,
required_keywords: Optional[List[str]] = None,
exclude_keywords: Optional[List[str]] = None,
show_progress: bool = False,
workers: int = DEFAULT_NUM_WORKERS,
) -> RagDatasetGenerator:
Creates a generator directly from documents by:
- Running transformations (from
Settings.transformationsby default) to convert documents into nodes. - Applying
KeywordNodePostprocessorto filter nodes by required and excluded keywords. - Constructing the generator with the filtered nodes.
Core Generation Logic
_agenerate_dataset (private async)
async def _agenerate_dataset(
self,
nodes: List[BaseNode],
labelled: bool = False,
) -> LabelledRagDataset:
The central generation method, which operates as follows:
- Question Generation Phase: For each node, creates a
SummaryIndexfrom a document built from the node's content and metadata. A query engine is created from each index with the question generation template, and all questions are generated concurrently usingrun_jobs.
- Response Parsing: Each LLM response is split by newlines and cleaned (stripping leading numbering like "1)" or "1."). Empty lines are removed, and results are truncated to
num_questions_per_chunk. A warning is emitted if fewer questions than requested are generated.
- Answer Generation Phase (labelled=True): If labelled mode is enabled, for each generated question, a new query engine is created using the
text_qa_templateand the question is queried to generate a reference answer. These tasks also run concurrently.
- Example Construction: For each question (and optionally its answer), a
LabelledRagDataExampleis created with the query, reference contexts, reference answer, andCreatedBymetadata indicating AI authorship with the model name.
- Dataset Assembly: All examples are collected into a
LabelledRagDataset.
Public Generation Methods
| Method | Async | Description |
|---|---|---|
generate_questions_from_nodes |
No | Generates questions only (no reference answers) |
agenerate_questions_from_nodes |
Yes | Async version of the above |
generate_dataset_from_nodes |
No | Generates full question-answer pairs |
agenerate_dataset_from_nodes |
Yes | Async version of the above |
The synchronous methods use asyncio_run to wrap their async counterparts.
PromptMixin Implementation
_get_prompts
Returns a dictionary containing:
"text_question_template"-- the question generation prompt"text_qa_template"-- the QA answer generation prompt
_update_prompts
Allows updating either or both prompt templates by key.
_get_prompt_modules
Returns an empty dictionary (no nested prompt modules).
Dependencies
llama_index.core.SummaryIndex-- used to build per-node indices for queryingllama_index.core.async_utils-- providesrun_jobsandasyncio_runfor concurrent executionllama_index.core.llama_dataset-- providesCreatedBy,CreatedByType,LabelledRagDataExample,LabelledRagDatasetllama_index.core.postprocessor.node.KeywordNodePostprocessor-- for keyword-based filteringllama_index.core.settings.Settings-- provides default LLM and transformations