Implementation:Run llama Llama index DatasetGenerator From Documents
Appearance
Overview
DatasetGenerator_From_Documents covers the two primary APIs for generating evaluation datasets from document collections in LlamaIndex: the deprecated DatasetGenerator and the preferred RagDatasetGenerator. Both provide a from_documents class method that handles the full pipeline from raw documents to structured QA pairs suitable for RAG evaluation.
Principle:Run_llama_Llama_index_Evaluation_Dataset_Generation
RAG Evaluation Dataset Generation LlamaIndex API
Source Files
- DatasetGenerator (deprecated):
llama-index-core/llama_index/core/evaluation/dataset_generation.py, Lines 117–322 - RagDatasetGenerator (preferred):
llama-index-core/llama_index/core/llama_dataset/generator.py, Lines 48–243
Import Statements
# Deprecated API
from llama_index.core.evaluation import DatasetGenerator
# Preferred API
from llama_index.core.llama_dataset import RagDatasetGenerator
DatasetGenerator (Deprecated)
from_documents Class Method
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] |
required | Source documents to generate questions from |
| llm | Optional[LLM] |
None |
LLM for question and answer generation; uses Settings default if not provided |
| transformations | Optional[List[TransformComponent]] |
None |
Node parsing transformations; defaults to Settings transformations |
| num_questions_per_chunk | int |
10 |
Number of questions to generate per document chunk |
| text_question_template | Optional[BasePromptTemplate] |
None |
Template controlling how questions are formulated |
| text_qa_template | Optional[BasePromptTemplate] |
None |
Template for generating reference answers |
| question_gen_query | Optional[str] |
None |
Query string guiding question generation style |
| required_keywords | Optional[List[str]] |
None |
Keywords that must appear in generated questions |
| exclude_keywords | Optional[List[str]] |
None |
Keywords to exclude from generated questions |
| show_progress | bool |
False |
Whether to display a progress bar during generation |
Instance Methods
| Method | Parameters | Return Type | Description |
|---|---|---|---|
| generate_questions_from_nodes | num (Optional[int]) |
List[str] |
Generates question strings from parsed nodes; num limits total questions returned
|
| generate_dataset_from_nodes | num (Optional[int]) |
QueryResponseDataset |
Generates full QA pairs with questions and reference answers |
Example: DatasetGenerator (Deprecated)
from llama_index.core import SimpleDirectoryReader
from llama_index.core.evaluation import DatasetGenerator
from llama_index.llms.openai import OpenAI
# Load documents
documents = SimpleDirectoryReader("./data").load_data()
# Create generator
llm = OpenAI(model="gpt-4", temperature=0.0)
dataset_generator = DatasetGenerator.from_documents(
documents=documents,
llm=llm,
num_questions_per_chunk=5,
show_progress=True,
)
# Generate just questions
questions = dataset_generator.generate_questions_from_nodes(num=20)
print(f"Generated {len(questions)} questions")
# Generate full QA dataset
qa_dataset = dataset_generator.generate_dataset_from_nodes(num=20)
RagDatasetGenerator (Preferred)
from_documents Class Method
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] |
required | Source documents to generate questions from |
| llm | Optional[LLM] |
None |
LLM for question and answer generation; uses Settings default if not provided |
| transformations | Optional[List[TransformComponent]] |
None |
Node parsing transformations; defaults to Settings transformations |
| num_questions_per_chunk | int |
3 |
Number of questions to generate per document chunk (note: lower default than DatasetGenerator) |
| text_question_template | Optional[BasePromptTemplate] |
None |
Template controlling how questions are formulated |
| text_qa_template | Optional[BasePromptTemplate] |
None |
Template for generating reference answers |
| question_gen_query | Optional[str] |
None |
Query string guiding question generation style |
| required_keywords | Optional[List[str]] |
None |
Keywords that must appear in generated questions |
| exclude_keywords | Optional[List[str]] |
None |
Keywords to exclude from generated questions |
| show_progress | bool |
False |
Whether to display a progress bar during generation |
| workers | int |
1 |
Number of parallel workers for generation (unique to RagDatasetGenerator) |
Instance Methods
| Method | Parameters | Return Type | Description |
|---|---|---|---|
| generate_questions_from_nodes | num (Optional[int]) |
List[str] |
Generates question strings from parsed nodes |
| generate_dataset_from_nodes | num (Optional[int]) |
LabelledRagDataset |
Generates full labelled dataset with source node references |
Example: RagDatasetGenerator (Preferred)
from llama_index.core import SimpleDirectoryReader
from llama_index.core.llama_dataset import RagDatasetGenerator
from llama_index.llms.openai import OpenAI
# Load documents
documents = SimpleDirectoryReader("./data").load_data()
# Create generator with parallel workers
llm = OpenAI(model="gpt-4", temperature=0.0)
rag_dataset_generator = RagDatasetGenerator.from_documents(
documents=documents,
llm=llm,
num_questions_per_chunk=3,
show_progress=True,
workers=4, # Parallel generation
)
# Generate labelled dataset
rag_dataset = rag_dataset_generator.generate_dataset_from_nodes()
print(f"Generated {len(rag_dataset.examples)} labelled examples")
# Each example contains question, reference answer, and source node info
for example in rag_dataset.examples[:3]:
print(f"Q: {example.query}")
print(f"A: {example.reference_answer}")
print(f"Source: {example.reference_contexts}")
print("---")
Keyword Filtering Example
from llama_index.core.llama_dataset import RagDatasetGenerator
# Generate questions focused on specific topics
dataset_generator = RagDatasetGenerator.from_documents(
documents=documents,
llm=llm,
num_questions_per_chunk=3,
required_keywords=["performance", "optimization"],
exclude_keywords=["installation", "setup"],
show_progress=True,
)
# Only generates questions containing required keywords
# and excluding specified keywords
dataset = dataset_generator.generate_dataset_from_nodes()
Custom Question Template Example
from llama_index.core import PromptTemplate
from llama_index.core.llama_dataset import RagDatasetGenerator
# Custom template for domain-specific question generation
custom_template = PromptTemplate(
"Context information is below.\n"
"---------------------\n"
"{context_str}\n"
"---------------------\n"
"Given the context information and no prior knowledge, "
"generate {num_questions_per_chunk} technical questions "
"that a software engineer would ask about this content.\n"
)
dataset_generator = RagDatasetGenerator.from_documents(
documents=documents,
llm=llm,
text_question_template=custom_template,
num_questions_per_chunk=5,
)
questions = dataset_generator.generate_questions_from_nodes()
Key Differences Between APIs
| Feature | DatasetGenerator | RagDatasetGenerator |
|---|---|---|
| Status | Deprecated | Preferred |
| Default questions per chunk | 10 | 3 |
| Parallel workers | Not supported | Supported via workers parameter
|
| Output dataset type | QueryResponseDataset |
LabelledRagDataset
|
| Source node tracking | Limited | Full source context metadata |
| Import path | llama_index.core.evaluation |
llama_index.core.llama_dataset
|
Knowledge Sources
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment