Implementation:Run llama Llama index RagDatasetGenerator

Overview

RagDatasetGenerator generates question and question-answer pair datasets from documents or nodes using an LLM. It is designed to produce LabelledRagDataset instances suitable for evaluating RAG (Retrieval-Augmented Generation) pipelines. This is marked as a beta feature in the source code.

Source file: llama-index-core/llama_index/core/llama_dataset/generator.py (261 lines)

Class Hierarchy

PromptMixin
  └── RagDatasetGenerator

RagDatasetGenerator inherits from PromptMixin, giving it prompt management capabilities including prompt retrieval and update methods.

Constructor

def __init__(
    self,
    nodes: List[BaseNode],
    llm: Optional[LLM] = None,
    num_questions_per_chunk: int = 3,
    text_question_template: Optional[BasePromptTemplate] = None,
    text_qa_template: Optional[BasePromptTemplate] = None,
    question_gen_query: Optional[str] = None,
    metadata_mode: MetadataMode = MetadataMode.NONE,
    show_progress: bool = False,
    workers: int = DEFAULT_NUM_WORKERS,
) -> None:

Parameter	Type	Default	Description
`nodes`	`List[BaseNode]`	required	Pre-processed nodes to generate questions from
`llm`	`Optional[LLM]`	`Settings.llm`	Language model for question and answer generation
`num_questions_per_chunk`	`int`	`3`	Number of questions to generate per chunk
`text_question_template`	`Optional[BasePromptTemplate]`	DEFAULT_QUESTION_GENERATION_PROMPT	Template for generating questions from context
`text_qa_template`	`Optional[BasePromptTemplate]`	`DEFAULT_TEXT_QA_PROMPT`	Template for generating answers from questions
`question_gen_query`	`Optional[str]`	Auto-generated teacher prompt	Query string used to instruct the LLM to generate questions
`metadata_mode`	`MetadataMode`	`MetadataMode.NONE`	Controls how metadata is included in node content
`show_progress`	`bool`	`False`	Whether to display progress bars
`workers`	`int`	`DEFAULT_NUM_WORKERS`	Number of concurrent workers for async tasks

Default Question Generation Prompt

Context information is below.
---------------------
{context_str}
---------------------
Given the context information and not prior knowledge.
generate only questions based on the below query.
{query_str}

The default question_gen_query instructs the LLM as a "Teacher/Professor" to create diverse quiz questions restricted to the provided context.

Factory Method

from_documents

@classmethod
def from_documents(
    cls,
    documents: Sequence[Document],
    llm: Optional[LLM] = None,
    transformations: Optional[List[TransformComponent]] = None,
    num_questions_per_chunk: int = 3,
    text_question_template: Optional[BasePromptTemplate] = None,
    text_qa_template: Optional[BasePromptTemplate] = None,
    question_gen_query: Optional[str] = None,
    required_keywords: Optional[List[str]] = None,
    exclude_keywords: Optional[List[str]] = None,
    show_progress: bool = False,
    workers: int = DEFAULT_NUM_WORKERS,
) -> RagDatasetGenerator:

Creates a generator directly from documents by:

Running transformations (from Settings.transformations by default) to convert documents into nodes.
Applying KeywordNodePostprocessor to filter nodes by required and excluded keywords.
Constructing the generator with the filtered nodes.

Core Generation Logic

_agenerate_dataset (private async)

async def _agenerate_dataset(
    self,
    nodes: List[BaseNode],
    labelled: bool = False,
) -> LabelledRagDataset:

The central generation method, which operates as follows:

Question Generation Phase: For each node, creates a SummaryIndex from a document built from the node's content and metadata. A query engine is created from each index with the question generation template, and all questions are generated concurrently using run_jobs.

Response Parsing: Each LLM response is split by newlines and cleaned (stripping leading numbering like "1)" or "1."). Empty lines are removed, and results are truncated to num_questions_per_chunk. A warning is emitted if fewer questions than requested are generated.

Answer Generation Phase (labelled=True): If labelled mode is enabled, for each generated question, a new query engine is created using the text_qa_template and the question is queried to generate a reference answer. These tasks also run concurrently.

Example Construction: For each question (and optionally its answer), a LabelledRagDataExample is created with the query, reference contexts, reference answer, and CreatedBy metadata indicating AI authorship with the model name.

Dataset Assembly: All examples are collected into a LabelledRagDataset.

Public Generation Methods

Method	Async	Description
`generate_questions_from_nodes`	No	Generates questions only (no reference answers)
`agenerate_questions_from_nodes`	Yes	Async version of the above
`generate_dataset_from_nodes`	No	Generates full question-answer pairs
`agenerate_dataset_from_nodes`	Yes	Async version of the above

The synchronous methods use asyncio_run to wrap their async counterparts.

PromptMixin Implementation

_get_prompts

Returns a dictionary containing:

"text_question_template" -- the question generation prompt
"text_qa_template" -- the QA answer generation prompt

_update_prompts

Allows updating either or both prompt templates by key.

_get_prompt_modules

Returns an empty dictionary (no nested prompt modules).

Dependencies

llama_index.core.SummaryIndex -- used to build per-node indices for querying
llama_index.core.async_utils -- provides run_jobs and asyncio_run for concurrent execution
llama_index.core.llama_dataset -- provides CreatedBy, CreatedByType, LabelledRagDataExample, LabelledRagDataset
llama_index.core.postprocessor.node.KeywordNodePostprocessor -- for keyword-based filtering
llama_index.core.settings.Settings -- provides default LLM and transformations

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment