Principle:Run llama Llama index QA Pair Generation
Overview
QA Pair Generation is the foundational data preparation step in the LlamaIndex embedding finetuning workflow. It involves using large language models to automatically generate synthetic question-answer pairs from document chunks, creating the labeled dataset needed for contrastive learning of embedding models.
The core idea is positive pair mining: given a document chunk, an LLM generates questions that the chunk answers. Each (question, document) pair becomes a positive training example, teaching the embedding model that certain queries and documents should be close in vector space.
Concept: Synthetic QA Dataset Generation
Embedding finetuning requires pairs of (query, relevant_document) for contrastive learning. Manually labeling such pairs is expensive and time-consuming. LlamaIndex automates this by:
- Chunking documents into TextNodes -- Each node represents a passage that can answer questions
- Prompting an LLM to generate questions -- For each chunk, the LLM produces questions that the chunk can answer
- Creating query-document relevance mappings -- Each generated question is linked to its source chunk as a positive pair
This approach is sometimes called synthetic data generation or distillation, where a powerful LLM's understanding is distilled into training data for a smaller embedding model.
Concept: Contrastive Learning Data Requirements
Contrastive learning for embeddings requires:
- Positive pairs -- A query and its relevant document should have similar embeddings
- Negative pairs -- A query and irrelevant documents should have dissimilar embeddings (often handled implicitly via in-batch negatives)
The QA pair generation step focuses on producing high-quality positive pairs. The MultipleNegativesRankingLoss used during training constructs negative pairs automatically from other examples in the same batch.
Concept: Dataset Structure
The generated dataset follows a specific structure optimized for embedding training:
| Field | Type | Description |
|---|---|---|
| queries | Dict[str, str] | Maps query IDs to question strings |
| corpus | Dict[str, str] | Maps document IDs to document text |
| relevant_docs | Dict[str, List[str]] | Maps query IDs to lists of relevant document IDs |
| mode | str | The embedding mode, defaults to "text" |
This structure decouples queries from documents, allowing flexible pairing and evaluation.
Concept: LLM Prompt Design for QA Generation
The default prompt template instructs the LLM to act as a teacher generating quiz questions:
DEFAULT_QA_GENERATE_PROMPT_TMPL = """\
Context information is below.
---------------------
{context_str}
---------------------
Given the context information and no prior knowledge.
generate only questions based on the below query.
You are a Teacher/ Professor. Your task is to setup \
{num_questions_per_chunk} questions for an upcoming \
quiz/examination. The questions should be diverse in nature \
across the document. Restrict the questions to the \
context information provided."
"""
Key design choices:
- Role prompting -- "Teacher/Professor" encourages pedagogically sound questions
- Diversity constraint -- "diverse in nature" avoids repetitive questions
- Grounding constraint -- "no prior knowledge" and "restrict to context" ensures questions are answerable from the chunk
- Configurable count --
num_questions_per_chunkcontrols generation volume
Concept: Robustness and Incremental Saving
Production-scale QA generation over large document corpora requires robustness:
- Retry logic -- LLM API calls can fail; retries with configurable limits handle transient errors
- Failure modes -- Choose between "fail" (halt on error) and "continue" (skip problematic chunks)
- Periodic saving -- Datasets are saved every N nodes to prevent data loss during long runs
- Resume support -- The function loads existing data and continues from where it left off
When to Use
- When finetuning embedding models for domain-specific retrieval tasks
- When you have a corpus of documents but no labeled query-document pairs
- When you want to improve retrieval quality for a specialized knowledge base
- When building custom RAG (Retrieval-Augmented Generation) pipelines
Knowledge Sources
LlamaIndex Embedding Finetuning Guide Sentence Transformers Training Overview
Metadata
Machine Learning Embeddings Finetuning RAG LlamaIndex
Implementation:Run_llama_Llama_index_Generate_QA_Embedding_Pairs