Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Run llama Llama index RagDatasetGenerator

From Leeroopedia

Overview

RagDatasetGenerator generates question and question-answer pair datasets from documents or nodes using an LLM. It is designed to produce LabelledRagDataset instances suitable for evaluating RAG (Retrieval-Augmented Generation) pipelines. This is marked as a beta feature in the source code.

Source file: llama-index-core/llama_index/core/llama_dataset/generator.py (261 lines)

Class Hierarchy

PromptMixin
  └── RagDatasetGenerator

RagDatasetGenerator inherits from PromptMixin, giving it prompt management capabilities including prompt retrieval and update methods.

Constructor

def __init__(
    self,
    nodes: List[BaseNode],
    llm: Optional[LLM] = None,
    num_questions_per_chunk: int = 3,
    text_question_template: Optional[BasePromptTemplate] = None,
    text_qa_template: Optional[BasePromptTemplate] = None,
    question_gen_query: Optional[str] = None,
    metadata_mode: MetadataMode = MetadataMode.NONE,
    show_progress: bool = False,
    workers: int = DEFAULT_NUM_WORKERS,
) -> None:
Parameter Type Default Description
nodes List[BaseNode] required Pre-processed nodes to generate questions from
llm Optional[LLM] Settings.llm Language model for question and answer generation
num_questions_per_chunk int 3 Number of questions to generate per chunk
text_question_template Optional[BasePromptTemplate] DEFAULT_QUESTION_GENERATION_PROMPT Template for generating questions from context
text_qa_template Optional[BasePromptTemplate] DEFAULT_TEXT_QA_PROMPT Template for generating answers from questions
question_gen_query Optional[str] Auto-generated teacher prompt Query string used to instruct the LLM to generate questions
metadata_mode MetadataMode MetadataMode.NONE Controls how metadata is included in node content
show_progress bool False Whether to display progress bars
workers int DEFAULT_NUM_WORKERS Number of concurrent workers for async tasks

Default Question Generation Prompt

Context information is below.
---------------------
{context_str}
---------------------
Given the context information and not prior knowledge.
generate only questions based on the below query.
{query_str}

The default question_gen_query instructs the LLM as a "Teacher/Professor" to create diverse quiz questions restricted to the provided context.

Factory Method

from_documents

@classmethod
def from_documents(
    cls,
    documents: Sequence[Document],
    llm: Optional[LLM] = None,
    transformations: Optional[List[TransformComponent]] = None,
    num_questions_per_chunk: int = 3,
    text_question_template: Optional[BasePromptTemplate] = None,
    text_qa_template: Optional[BasePromptTemplate] = None,
    question_gen_query: Optional[str] = None,
    required_keywords: Optional[List[str]] = None,
    exclude_keywords: Optional[List[str]] = None,
    show_progress: bool = False,
    workers: int = DEFAULT_NUM_WORKERS,
) -> RagDatasetGenerator:

Creates a generator directly from documents by:

  1. Running transformations (from Settings.transformations by default) to convert documents into nodes.
  2. Applying KeywordNodePostprocessor to filter nodes by required and excluded keywords.
  3. Constructing the generator with the filtered nodes.

Core Generation Logic

_agenerate_dataset (private async)

async def _agenerate_dataset(
    self,
    nodes: List[BaseNode],
    labelled: bool = False,
) -> LabelledRagDataset:

The central generation method, which operates as follows:

  1. Question Generation Phase: For each node, creates a SummaryIndex from a document built from the node's content and metadata. A query engine is created from each index with the question generation template, and all questions are generated concurrently using run_jobs.
  1. Response Parsing: Each LLM response is split by newlines and cleaned (stripping leading numbering like "1)" or "1."). Empty lines are removed, and results are truncated to num_questions_per_chunk. A warning is emitted if fewer questions than requested are generated.
  1. Answer Generation Phase (labelled=True): If labelled mode is enabled, for each generated question, a new query engine is created using the text_qa_template and the question is queried to generate a reference answer. These tasks also run concurrently.
  1. Example Construction: For each question (and optionally its answer), a LabelledRagDataExample is created with the query, reference contexts, reference answer, and CreatedBy metadata indicating AI authorship with the model name.
  1. Dataset Assembly: All examples are collected into a LabelledRagDataset.

Public Generation Methods

Method Async Description
generate_questions_from_nodes No Generates questions only (no reference answers)
agenerate_questions_from_nodes Yes Async version of the above
generate_dataset_from_nodes No Generates full question-answer pairs
agenerate_dataset_from_nodes Yes Async version of the above

The synchronous methods use asyncio_run to wrap their async counterparts.

PromptMixin Implementation

_get_prompts

Returns a dictionary containing:

  • "text_question_template" -- the question generation prompt
  • "text_qa_template" -- the QA answer generation prompt

_update_prompts

Allows updating either or both prompt templates by key.

_get_prompt_modules

Returns an empty dictionary (no nested prompt modules).

Dependencies

  • llama_index.core.SummaryIndex -- used to build per-node indices for querying
  • llama_index.core.async_utils -- provides run_jobs and asyncio_run for concurrent execution
  • llama_index.core.llama_dataset -- provides CreatedBy, CreatedByType, LabelledRagDataExample, LabelledRagDataset
  • llama_index.core.postprocessor.node.KeywordNodePostprocessor -- for keyword-based filtering
  • llama_index.core.settings.Settings -- provides default LLM and transformations

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment