Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Run llama Llama index DatasetGenerator From Documents

From Leeroopedia

Overview

DatasetGenerator_From_Documents covers the two primary APIs for generating evaluation datasets from document collections in LlamaIndex: the deprecated DatasetGenerator and the preferred RagDatasetGenerator. Both provide a from_documents class method that handles the full pipeline from raw documents to structured QA pairs suitable for RAG evaluation.

Principle:Run_llama_Llama_index_Evaluation_Dataset_Generation

RAG Evaluation Dataset Generation LlamaIndex API

Source Files

  • DatasetGenerator (deprecated): llama-index-core/llama_index/core/evaluation/dataset_generation.py, Lines 117–322
  • RagDatasetGenerator (preferred): llama-index-core/llama_index/core/llama_dataset/generator.py, Lines 48–243

Import Statements

# Deprecated API
from llama_index.core.evaluation import DatasetGenerator

# Preferred API
from llama_index.core.llama_dataset import RagDatasetGenerator

DatasetGenerator (Deprecated)

from_documents Class Method

Parameter Type Default Description
documents List[Document] required Source documents to generate questions from
llm Optional[LLM] None LLM for question and answer generation; uses Settings default if not provided
transformations Optional[List[TransformComponent]] None Node parsing transformations; defaults to Settings transformations
num_questions_per_chunk int 10 Number of questions to generate per document chunk
text_question_template Optional[BasePromptTemplate] None Template controlling how questions are formulated
text_qa_template Optional[BasePromptTemplate] None Template for generating reference answers
question_gen_query Optional[str] None Query string guiding question generation style
required_keywords Optional[List[str]] None Keywords that must appear in generated questions
exclude_keywords Optional[List[str]] None Keywords to exclude from generated questions
show_progress bool False Whether to display a progress bar during generation

Instance Methods

Method Parameters Return Type Description
generate_questions_from_nodes num (Optional[int]) List[str] Generates question strings from parsed nodes; num limits total questions returned
generate_dataset_from_nodes num (Optional[int]) QueryResponseDataset Generates full QA pairs with questions and reference answers

Example: DatasetGenerator (Deprecated)

from llama_index.core import SimpleDirectoryReader
from llama_index.core.evaluation import DatasetGenerator
from llama_index.llms.openai import OpenAI

# Load documents
documents = SimpleDirectoryReader("./data").load_data()

# Create generator
llm = OpenAI(model="gpt-4", temperature=0.0)
dataset_generator = DatasetGenerator.from_documents(
    documents=documents,
    llm=llm,
    num_questions_per_chunk=5,
    show_progress=True,
)

# Generate just questions
questions = dataset_generator.generate_questions_from_nodes(num=20)
print(f"Generated {len(questions)} questions")

# Generate full QA dataset
qa_dataset = dataset_generator.generate_dataset_from_nodes(num=20)

RagDatasetGenerator (Preferred)

from_documents Class Method

Parameter Type Default Description
documents List[Document] required Source documents to generate questions from
llm Optional[LLM] None LLM for question and answer generation; uses Settings default if not provided
transformations Optional[List[TransformComponent]] None Node parsing transformations; defaults to Settings transformations
num_questions_per_chunk int 3 Number of questions to generate per document chunk (note: lower default than DatasetGenerator)
text_question_template Optional[BasePromptTemplate] None Template controlling how questions are formulated
text_qa_template Optional[BasePromptTemplate] None Template for generating reference answers
question_gen_query Optional[str] None Query string guiding question generation style
required_keywords Optional[List[str]] None Keywords that must appear in generated questions
exclude_keywords Optional[List[str]] None Keywords to exclude from generated questions
show_progress bool False Whether to display a progress bar during generation
workers int 1 Number of parallel workers for generation (unique to RagDatasetGenerator)

Instance Methods

Method Parameters Return Type Description
generate_questions_from_nodes num (Optional[int]) List[str] Generates question strings from parsed nodes
generate_dataset_from_nodes num (Optional[int]) LabelledRagDataset Generates full labelled dataset with source node references

Example: RagDatasetGenerator (Preferred)

from llama_index.core import SimpleDirectoryReader
from llama_index.core.llama_dataset import RagDatasetGenerator
from llama_index.llms.openai import OpenAI

# Load documents
documents = SimpleDirectoryReader("./data").load_data()

# Create generator with parallel workers
llm = OpenAI(model="gpt-4", temperature=0.0)
rag_dataset_generator = RagDatasetGenerator.from_documents(
    documents=documents,
    llm=llm,
    num_questions_per_chunk=3,
    show_progress=True,
    workers=4,  # Parallel generation
)

# Generate labelled dataset
rag_dataset = rag_dataset_generator.generate_dataset_from_nodes()
print(f"Generated {len(rag_dataset.examples)} labelled examples")

# Each example contains question, reference answer, and source node info
for example in rag_dataset.examples[:3]:
    print(f"Q: {example.query}")
    print(f"A: {example.reference_answer}")
    print(f"Source: {example.reference_contexts}")
    print("---")

Keyword Filtering Example

from llama_index.core.llama_dataset import RagDatasetGenerator

# Generate questions focused on specific topics
dataset_generator = RagDatasetGenerator.from_documents(
    documents=documents,
    llm=llm,
    num_questions_per_chunk=3,
    required_keywords=["performance", "optimization"],
    exclude_keywords=["installation", "setup"],
    show_progress=True,
)

# Only generates questions containing required keywords
# and excluding specified keywords
dataset = dataset_generator.generate_dataset_from_nodes()

Custom Question Template Example

from llama_index.core import PromptTemplate
from llama_index.core.llama_dataset import RagDatasetGenerator

# Custom template for domain-specific question generation
custom_template = PromptTemplate(
    "Context information is below.\n"
    "---------------------\n"
    "{context_str}\n"
    "---------------------\n"
    "Given the context information and no prior knowledge, "
    "generate {num_questions_per_chunk} technical questions "
    "that a software engineer would ask about this content.\n"
)

dataset_generator = RagDatasetGenerator.from_documents(
    documents=documents,
    llm=llm,
    text_question_template=custom_template,
    num_questions_per_chunk=5,
)

questions = dataset_generator.generate_questions_from_nodes()

Key Differences Between APIs

Feature DatasetGenerator RagDatasetGenerator
Status Deprecated Preferred
Default questions per chunk 10 3
Parallel workers Not supported Supported via workers parameter
Output dataset type QueryResponseDataset LabelledRagDataset
Source node tracking Limited Full source context metadata
Import path llama_index.core.evaluation llama_index.core.llama_dataset

Knowledge Sources

LlamaIndex Evaluation LlamaIndex Dataset Generation

2026-02-11 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment