Principle:Run llama Llama index QA Pair Generation

Overview

QA Pair Generation is the foundational data preparation step in the LlamaIndex embedding finetuning workflow. It involves using large language models to automatically generate synthetic question-answer pairs from document chunks, creating the labeled dataset needed for contrastive learning of embedding models.

The core idea is positive pair mining: given a document chunk, an LLM generates questions that the chunk answers. Each (question, document) pair becomes a positive training example, teaching the embedding model that certain queries and documents should be close in vector space.

Concept: Synthetic QA Dataset Generation

Embedding finetuning requires pairs of (query, relevant_document) for contrastive learning. Manually labeling such pairs is expensive and time-consuming. LlamaIndex automates this by:

Chunking documents into TextNodes -- Each node represents a passage that can answer questions
Prompting an LLM to generate questions -- For each chunk, the LLM produces questions that the chunk can answer
Creating query-document relevance mappings -- Each generated question is linked to its source chunk as a positive pair

This approach is sometimes called synthetic data generation or distillation, where a powerful LLM's understanding is distilled into training data for a smaller embedding model.

Concept: Contrastive Learning Data Requirements

Contrastive learning for embeddings requires:

Positive pairs -- A query and its relevant document should have similar embeddings
Negative pairs -- A query and irrelevant documents should have dissimilar embeddings (often handled implicitly via in-batch negatives)

The QA pair generation step focuses on producing high-quality positive pairs. The MultipleNegativesRankingLoss used during training constructs negative pairs automatically from other examples in the same batch.

Concept: Dataset Structure

The generated dataset follows a specific structure optimized for embedding training:

Field	Type	Description
queries	Dict[str, str]	Maps query IDs to question strings
corpus	Dict[str, str]	Maps document IDs to document text
relevant_docs	Dict[str, List[str]]	Maps query IDs to lists of relevant document IDs
mode	str	The embedding mode, defaults to "text"

This structure decouples queries from documents, allowing flexible pairing and evaluation.

Concept: LLM Prompt Design for QA Generation

The default prompt template instructs the LLM to act as a teacher generating quiz questions:

DEFAULT_QA_GENERATE_PROMPT_TMPL = """\
Context information is below.

---------------------
{context_str}
---------------------

Given the context information and no prior knowledge.
generate only questions based on the below query.

You are a Teacher/ Professor. Your task is to setup \
{num_questions_per_chunk} questions for an upcoming \
quiz/examination. The questions should be diverse in nature \
across the document. Restrict the questions to the \
context information provided."
"""

Key design choices:

Role prompting -- "Teacher/Professor" encourages pedagogically sound questions
Diversity constraint -- "diverse in nature" avoids repetitive questions
Grounding constraint -- "no prior knowledge" and "restrict to context" ensures questions are answerable from the chunk
Configurable count -- num_questions_per_chunk controls generation volume

Concept: Robustness and Incremental Saving

Production-scale QA generation over large document corpora requires robustness:

Retry logic -- LLM API calls can fail; retries with configurable limits handle transient errors
Failure modes -- Choose between "fail" (halt on error) and "continue" (skip problematic chunks)
Periodic saving -- Datasets are saved every N nodes to prevent data loss during long runs
Resume support -- The function loads existing data and continues from where it left off

When to Use

When finetuning embedding models for domain-specific retrieval tasks
When you have a corpus of documents but no labeled query-document pairs
When you want to improve retrieval quality for a specialized knowledge base
When building custom RAG (Retrieval-Augmented Generation) pipelines

Knowledge Sources

LlamaIndex Embedding Finetuning Guide Sentence Transformers Training Overview

Metadata

Machine Learning Embeddings Finetuning RAG LlamaIndex

Implementation:Run_llama_Llama_index_Generate_QA_Embedding_Pairs

2026-02-11 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment