Implementation:Run llama Llama index Generate QA Embedding Pairs
Overview
generate_qa_embedding_pairs is a function in LlamaIndex that generates synthetic question-answer pairs from document nodes using an LLM. It produces an EmbeddingQAFinetuneDataset object containing queries, corpus documents, and relevance mappings suitable for contrastive embedding finetuning.
Source Location
| Property | Value |
|---|---|
| File | llama-index-finetuning/llama_index/finetuning/embeddings/common.py
|
| Lines | 103-211 |
| Type | Module-level function |
| Import | from llama_index.finetuning import generate_qa_embedding_pairs
|
Function Signature
def generate_qa_embedding_pairs(
nodes: List[TextNode],
llm: LLM,
qa_generate_prompt_tmpl: str = DEFAULT_QA_GENERATE_PROMPT_TMPL,
num_questions_per_chunk: int = 2,
retry_limit: int = 3,
on_failure: str = "continue",
save_every: int = 500,
output_path: str = "qa_finetune_dataset.json",
verbose: bool = True,
) -> EmbeddingQAFinetuneDataset:
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
| nodes | List[TextNode] |
required | List of TextNode objects to generate questions from. Each node represents a document chunk. |
| llm | LLM |
required | The LLM instance used to generate questions (e.g., OpenAI GPT-4, Anthropic Claude). |
| qa_generate_prompt_tmpl | str |
DEFAULT_QA_GENERATE_PROMPT_TMPL |
Template for the question generation prompt. Must contain {context_str} and {num_questions_per_chunk} placeholders.
|
| num_questions_per_chunk | int |
2 |
Number of questions to generate per document chunk. |
| retry_limit | int |
3 |
Maximum number of retries for failed LLM calls. |
| on_failure | str |
"continue" |
Action on repeated failure: "continue" skips the node, "fail" raises a RuntimeError.
|
| save_every | int |
500 |
Save intermediate results every N nodes. |
| output_path | str |
"qa_finetune_dataset.json" |
File path for saving the dataset JSON. |
| verbose | bool |
True |
If True, print progress and debug messages. |
Return Value
Returns an EmbeddingQAFinetuneDataset instance with fields:
| Field | Type | Description |
|---|---|---|
| queries | Dict[str, str] |
Maps UUID query IDs to generated question strings |
| corpus | Dict[str, str] |
Maps node IDs to document text content |
| relevant_docs | Dict[str, List[str]] |
Maps query IDs to lists of relevant node IDs |
| mode | str |
Defaults to "text"
|
EmbeddingQAFinetuneDataset Class
Defined at lines 12-61 of the same file:
class EmbeddingQAFinetuneDataset(BaseModel):
queries: Dict[str, str]
corpus: Dict[str, str]
relevant_docs: Dict[str, List[str]]
mode: str = "text"
@property
def query_docid_pairs(self) -> List[Tuple[str, List[str]]]:
"""Get query, relevant doc ids."""
return [
(query, self.relevant_docs[query_id])
for query_id, query in self.queries.items()
]
def save_json(self, path: str) -> None:
"""Save the dataset to a JSON file."""
with open(path, "w") as f:
json.dump(self.model_dump(), f, indent=4)
@classmethod
def from_json(cls, path: str) -> "EmbeddingQAFinetuneDataset":
"""Load the dataset from a JSON file."""
with open(path) as f:
data = json.load(f)
return cls(**data)
Internal Behavior
The function proceeds through these steps:
- Load existing data -- Calls
load_existing_data(output_path)to resume from a previous run if the file exists - Build node dictionary -- Creates a mapping from node ID to text content (using
MetadataMode.NONE) - Determine start index -- Skips nodes already processed based on corpus size
- Iterate over remaining nodes -- For each unprocessed node:
- Formats the prompt with the node's text and
num_questions_per_chunk - Calls
llm.complete(query)with retry logic - Parses the LLM response by splitting on newlines and stripping numbering prefixes
- Truncates to
num_questions_per_chunkquestions; warns if fewer were generated - Assigns a UUID to each question and maps it to the source node ID
- Formats the prompt with the node's text and
- Periodic save -- Every
save_everynodes, constructs and saves the dataset tooutput_path - Final save -- Saves the complete dataset and returns it
Usage Example
from llama_index.core import SimpleDirectoryReader
from llama_index.core.node_parser import SentenceSplitter
from llama_index.llms.openai import OpenAI
from llama_index.finetuning import (
generate_qa_embedding_pairs,
EmbeddingQAFinetuneDataset,
)
# Load and chunk documents
documents = SimpleDirectoryReader("data/").load_data()
splitter = SentenceSplitter(chunk_size=512)
nodes = splitter.get_nodes_from_documents(documents)
# Generate QA pairs using an LLM
llm = OpenAI(model="gpt-4")
qa_dataset = generate_qa_embedding_pairs(
nodes=nodes,
llm=llm,
num_questions_per_chunk=2,
output_path="train_dataset.json",
)
# Save and reload
qa_dataset.save_json("train_dataset.json")
loaded_dataset = EmbeddingQAFinetuneDataset.from_json("train_dataset.json")
Dependencies
llama_index.core.schema.TextNode-- Input node typellama_index.core.llms.utils.LLM-- LLM interface for question generationllama_index.core.bridge.pydantic.BaseModel-- Base class for the dataset modeltqdm-- Progress bar display
Knowledge Sources
LlamaIndex Finetuning Source LlamaIndex Embedding Finetuning Guide
Metadata
Machine Learning Embeddings Finetuning LlamaIndex