Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Run llama Llama index Generate QA Embedding Pairs

From Leeroopedia

Overview

generate_qa_embedding_pairs is a function in LlamaIndex that generates synthetic question-answer pairs from document nodes using an LLM. It produces an EmbeddingQAFinetuneDataset object containing queries, corpus documents, and relevance mappings suitable for contrastive embedding finetuning.

Source Location

Property Value
File llama-index-finetuning/llama_index/finetuning/embeddings/common.py
Lines 103-211
Type Module-level function
Import from llama_index.finetuning import generate_qa_embedding_pairs

Function Signature

def generate_qa_embedding_pairs(
    nodes: List[TextNode],
    llm: LLM,
    qa_generate_prompt_tmpl: str = DEFAULT_QA_GENERATE_PROMPT_TMPL,
    num_questions_per_chunk: int = 2,
    retry_limit: int = 3,
    on_failure: str = "continue",
    save_every: int = 500,
    output_path: str = "qa_finetune_dataset.json",
    verbose: bool = True,
) -> EmbeddingQAFinetuneDataset:

Parameters

Parameter Type Default Description
nodes List[TextNode] required List of TextNode objects to generate questions from. Each node represents a document chunk.
llm LLM required The LLM instance used to generate questions (e.g., OpenAI GPT-4, Anthropic Claude).
qa_generate_prompt_tmpl str DEFAULT_QA_GENERATE_PROMPT_TMPL Template for the question generation prompt. Must contain {context_str} and {num_questions_per_chunk} placeholders.
num_questions_per_chunk int 2 Number of questions to generate per document chunk.
retry_limit int 3 Maximum number of retries for failed LLM calls.
on_failure str "continue" Action on repeated failure: "continue" skips the node, "fail" raises a RuntimeError.
save_every int 500 Save intermediate results every N nodes.
output_path str "qa_finetune_dataset.json" File path for saving the dataset JSON.
verbose bool True If True, print progress and debug messages.

Return Value

Returns an EmbeddingQAFinetuneDataset instance with fields:

Field Type Description
queries Dict[str, str] Maps UUID query IDs to generated question strings
corpus Dict[str, str] Maps node IDs to document text content
relevant_docs Dict[str, List[str]] Maps query IDs to lists of relevant node IDs
mode str Defaults to "text"

EmbeddingQAFinetuneDataset Class

Defined at lines 12-61 of the same file:

class EmbeddingQAFinetuneDataset(BaseModel):
    queries: Dict[str, str]
    corpus: Dict[str, str]
    relevant_docs: Dict[str, List[str]]
    mode: str = "text"

    @property
    def query_docid_pairs(self) -> List[Tuple[str, List[str]]]:
        """Get query, relevant doc ids."""
        return [
            (query, self.relevant_docs[query_id])
            for query_id, query in self.queries.items()
        ]

    def save_json(self, path: str) -> None:
        """Save the dataset to a JSON file."""
        with open(path, "w") as f:
            json.dump(self.model_dump(), f, indent=4)

    @classmethod
    def from_json(cls, path: str) -> "EmbeddingQAFinetuneDataset":
        """Load the dataset from a JSON file."""
        with open(path) as f:
            data = json.load(f)
        return cls(**data)

Internal Behavior

The function proceeds through these steps:

  1. Load existing data -- Calls load_existing_data(output_path) to resume from a previous run if the file exists
  2. Build node dictionary -- Creates a mapping from node ID to text content (using MetadataMode.NONE)
  3. Determine start index -- Skips nodes already processed based on corpus size
  4. Iterate over remaining nodes -- For each unprocessed node:
    • Formats the prompt with the node's text and num_questions_per_chunk
    • Calls llm.complete(query) with retry logic
    • Parses the LLM response by splitting on newlines and stripping numbering prefixes
    • Truncates to num_questions_per_chunk questions; warns if fewer were generated
    • Assigns a UUID to each question and maps it to the source node ID
  5. Periodic save -- Every save_every nodes, constructs and saves the dataset to output_path
  6. Final save -- Saves the complete dataset and returns it

Usage Example

from llama_index.core import SimpleDirectoryReader
from llama_index.core.node_parser import SentenceSplitter
from llama_index.llms.openai import OpenAI
from llama_index.finetuning import (
    generate_qa_embedding_pairs,
    EmbeddingQAFinetuneDataset,
)

# Load and chunk documents
documents = SimpleDirectoryReader("data/").load_data()
splitter = SentenceSplitter(chunk_size=512)
nodes = splitter.get_nodes_from_documents(documents)

# Generate QA pairs using an LLM
llm = OpenAI(model="gpt-4")
qa_dataset = generate_qa_embedding_pairs(
    nodes=nodes,
    llm=llm,
    num_questions_per_chunk=2,
    output_path="train_dataset.json",
)

# Save and reload
qa_dataset.save_json("train_dataset.json")
loaded_dataset = EmbeddingQAFinetuneDataset.from_json("train_dataset.json")

Dependencies

  • llama_index.core.schema.TextNode -- Input node type
  • llama_index.core.llms.utils.LLM -- LLM interface for question generation
  • llama_index.core.bridge.pydantic.BaseModel -- Base class for the dataset model
  • tqdm -- Progress bar display

Knowledge Sources

LlamaIndex Finetuning Source LlamaIndex Embedding Finetuning Guide

Metadata

Machine Learning Embeddings Finetuning LlamaIndex

Principle:Run_llama_Llama_index_QA_Pair_Generation

2026-02-11 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment