Implementation:Run llama Llama index Generate QA Embedding Pairs

Overview

generate_qa_embedding_pairs is a function in LlamaIndex that generates synthetic question-answer pairs from document nodes using an LLM. It produces an EmbeddingQAFinetuneDataset object containing queries, corpus documents, and relevance mappings suitable for contrastive embedding finetuning.

Source Location

Property	Value
File	`llama-index-finetuning/llama_index/finetuning/embeddings/common.py`
Lines	103-211
Type	Module-level function
Import	`from llama_index.finetuning import generate_qa_embedding_pairs`

Function Signature

def generate_qa_embedding_pairs(
    nodes: List[TextNode],
    llm: LLM,
    qa_generate_prompt_tmpl: str = DEFAULT_QA_GENERATE_PROMPT_TMPL,
    num_questions_per_chunk: int = 2,
    retry_limit: int = 3,
    on_failure: str = "continue",
    save_every: int = 500,
    output_path: str = "qa_finetune_dataset.json",
    verbose: bool = True,
) -> EmbeddingQAFinetuneDataset:

Parameters

Parameter	Type	Default	Description
nodes	`List[TextNode]`	required	List of TextNode objects to generate questions from. Each node represents a document chunk.
llm	`LLM`	required	The LLM instance used to generate questions (e.g., OpenAI GPT-4, Anthropic Claude).
qa_generate_prompt_tmpl	`str`	`DEFAULT_QA_GENERATE_PROMPT_TMPL`	Template for the question generation prompt. Must contain `{context_str}` and `{num_questions_per_chunk}` placeholders.
num_questions_per_chunk	`int`	`2`	Number of questions to generate per document chunk.
retry_limit	`int`	`3`	Maximum number of retries for failed LLM calls.
on_failure	`str`	`"continue"`	Action on repeated failure: `"continue"` skips the node, `"fail"` raises a RuntimeError.
save_every	`int`	`500`	Save intermediate results every N nodes.
output_path	`str`	`"qa_finetune_dataset.json"`	File path for saving the dataset JSON.
verbose	`bool`	`True`	If True, print progress and debug messages.

Return Value

Returns an EmbeddingQAFinetuneDataset instance with fields:

Field	Type	Description
queries	`Dict[str, str]`	Maps UUID query IDs to generated question strings
corpus	`Dict[str, str]`	Maps node IDs to document text content
relevant_docs	`Dict[str, List[str]]`	Maps query IDs to lists of relevant node IDs
mode	`str`	Defaults to `"text"`

EmbeddingQAFinetuneDataset Class

Defined at lines 12-61 of the same file:

class EmbeddingQAFinetuneDataset(BaseModel):
    queries: Dict[str, str]
    corpus: Dict[str, str]
    relevant_docs: Dict[str, List[str]]
    mode: str = "text"

    @property
    def query_docid_pairs(self) -> List[Tuple[str, List[str]]]:
        """Get query, relevant doc ids."""
        return [
            (query, self.relevant_docs[query_id])
            for query_id, query in self.queries.items()
        ]

    def save_json(self, path: str) -> None:
        """Save the dataset to a JSON file."""
        with open(path, "w") as f:
            json.dump(self.model_dump(), f, indent=4)

    @classmethod
    def from_json(cls, path: str) -> "EmbeddingQAFinetuneDataset":
        """Load the dataset from a JSON file."""
        with open(path) as f:
            data = json.load(f)
        return cls(**data)

Internal Behavior

The function proceeds through these steps:

Load existing data -- Calls load_existing_data(output_path) to resume from a previous run if the file exists
Build node dictionary -- Creates a mapping from node ID to text content (using MetadataMode.NONE)
Determine start index -- Skips nodes already processed based on corpus size
Iterate over remaining nodes -- For each unprocessed node:
- Formats the prompt with the node's text and num_questions_per_chunk
- Calls llm.complete(query) with retry logic
- Parses the LLM response by splitting on newlines and stripping numbering prefixes
- Truncates to num_questions_per_chunk questions; warns if fewer were generated
- Assigns a UUID to each question and maps it to the source node ID
Periodic save -- Every save_every nodes, constructs and saves the dataset to output_path
Final save -- Saves the complete dataset and returns it

Usage Example

from llama_index.core import SimpleDirectoryReader
from llama_index.core.node_parser import SentenceSplitter
from llama_index.llms.openai import OpenAI
from llama_index.finetuning import (
    generate_qa_embedding_pairs,
    EmbeddingQAFinetuneDataset,
)

# Load and chunk documents
documents = SimpleDirectoryReader("data/").load_data()
splitter = SentenceSplitter(chunk_size=512)
nodes = splitter.get_nodes_from_documents(documents)

# Generate QA pairs using an LLM
llm = OpenAI(model="gpt-4")
qa_dataset = generate_qa_embedding_pairs(
    nodes=nodes,
    llm=llm,
    num_questions_per_chunk=2,
    output_path="train_dataset.json",
)

# Save and reload
qa_dataset.save_json("train_dataset.json")
loaded_dataset = EmbeddingQAFinetuneDataset.from_json("train_dataset.json")

Dependencies

llama_index.core.schema.TextNode -- Input node type
llama_index.core.llms.utils.LLM -- LLM interface for question generation
llama_index.core.bridge.pydantic.BaseModel -- Base class for the dataset model
tqdm -- Progress bar display

Knowledge Sources

LlamaIndex Finetuning Source LlamaIndex Embedding Finetuning Guide

Metadata

Machine Learning Embeddings Finetuning LlamaIndex

Principle:Run_llama_Llama_index_QA_Pair_Generation

2026-02-11 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment