Implementation:Run llama Llama index Reranker Dataset Gen

Overview

This module provides dataset generation utilities for Cohere reranker fine-tuning. It includes functions for generating hard negatives (via random sampling or cosine similarity), extracting query-context pairs from an existing embedding QA dataset, and producing a complete JSONL training file for Cohere's custom reranker model API.

Source file: llama-index-finetuning/llama_index/finetuning/rerankers/dataset_gen.py (128 lines)

Dependencies

Dependency	Purpose
`random`	Random sampling for hard negative generation
`llama_index.core.bridge.pydantic.BaseModel`	Base class for the `CohereRerankerFinetuneDataset` model
`llama_index.core.indices.query.embedding_utils.get_top_k_embeddings`	Retrieval of top-k similar embeddings for cosine-similarity-based hard negatives
`llama_index.finetuning.EmbeddingQAFinetuneDataset`	Input dataset type containing queries, corpus, and relevance mappings

Class: CohereRerankerFinetuneDataset

Inherits from: pydantic.BaseModel

A Pydantic model representing a single training entry for Cohere reranker fine-tuning.

class CohereRerankerFinetuneDataset(BaseModel):
    query: str
    relevant_passages: List[str]
    hard_negatives: Any

Field	Type	Description
`query`	`str`	The query text
`relevant_passages`	`List[str]`	List of passages relevant to the query (typically a single passage)
`hard_negatives`	`Any`	List of hard negative passages (similar but not relevant)

Method: to_jsonl

def to_jsonl(self) -> str

Serializes the instance to a JSON string followed by a newline character, suitable for JSONL file output. Uses Pydantic's model_dump_json() for serialization.

Function: generate_embeddings

def generate_embeddings(embed_model: Any, text: str) -> List[float]

A thin wrapper around an embedding model's get_text_embedding() method. Takes a single text string and returns its embedding vector.

Function: generate_hard_negatives

def generate_hard_negatives(
    queries: List[str],
    relevant_contexts: List[str],
    embed_model: Optional[Any],
    num_negatives: int = 5,
    method: str = "random",
) -> Any

Parameter	Type	Default	Description
`queries`	`List[str]`	required	List of query strings
`relevant_contexts`	`List[str]`	required	List of relevant context strings (one per query, aligned by index)
`embed_model`	`Optional[Any]`	required	Embedding model (required for `"cosine_similarity"` method)
`num_negatives`	`int`	`5`	Number of hard negatives to generate per query
`method`	`str`	`"random"`	Hard negative generation strategy

Supported methods:

Method	Description
`"random"`	Randomly samples `num_negatives` contexts from the relevant_contexts list, excluding the context at the same index as the query. Uses `random.sample()` for selection without replacement.
`"cosine_similarity"`	Computes embeddings for all queries and contexts, then for each query selects the `num_negatives` most similar contexts (by embedding cosine similarity) that are not the correct relevant context. Uses `get_top_k_embeddings` to rank all contexts and then filters out the correct one.

Design note: When method="cosine_similarity", all embeddings are pre-computed before the loop. The embeddings are computed using the provided embed_model. The embed_model parameter is not used when method="random".

Function: get_query_context_lists

def get_query_context_lists(
    query_context_pairs: EmbeddingQAFinetuneDataset,
) -> Tuple[List[str], List[str]]

Extracts parallel lists of queries and their corresponding relevant contexts from an EmbeddingQAFinetuneDataset.

Workflow:

Iterates over all query IDs and query texts in query_context_pairs.queries.
For each query, looks up the first relevant document ID from query_context_pairs.relevant_docs.
Retrieves the corresponding document text from query_context_pairs.corpus.
Returns two aligned lists: (queries, relevant_contexts).

Function: generate_cohere_reranker_finetuning_dataset

def generate_cohere_reranker_finetuning_dataset(
    query_context_pairs: EmbeddingQAFinetuneDataset,
    num_negatives: int = 0,
    top_k_dissimilar: int = 100,
    hard_negatives_gen_method: str = "random",
    finetune_dataset_file_name: str = "train.jsonl",
    embed_model: Optional[Any] = None,
) -> Any

Parameter	Type	Default	Description
`query_context_pairs`	`EmbeddingQAFinetuneDataset`	required	Dataset containing queries, corpus, and relevance mappings
`num_negatives`	`int`	`0`	Number of hard negatives per query; if 0, no hard negatives are generated
`top_k_dissimilar`	`int`	`100`	Parameter accepted but not currently used in the implementation
`hard_negatives_gen_method`	`str`	`"random"`	Method for generating hard negatives (`"random"` or `"cosine_similarity"`)
`finetune_dataset_file_name`	`str`	`"train.jsonl"`	Output file path for the JSONL dataset
`embed_model`	`Optional[Any]`	`None`	Embedding model for cosine-similarity-based hard negative generation

Workflow:

Extracts query and context lists via get_query_context_lists().
If num_negatives > 0, generates hard negatives using the specified method. Otherwise, assigns empty lists as hard negatives for all queries.
Opens the output file in write mode and iterates over the zipped queries, contexts, and hard negatives.
For each entry, creates a CohereRerankerFinetuneDataset instance with:
- query -- the query string
- relevant_passages -- a single-element list containing the relevant context
- hard_negatives -- the list of hard negative passages
Writes each entry as a JSONL line via entry.to_jsonl().

Typical Usage Pipeline

EmbeddingQAFinetuneDataset (queries + corpus + relevance mappings)
    |
    v
generate_cohere_reranker_finetuning_dataset()
    |
    +--> get_query_context_lists()  -->  (queries, contexts)
    |
    +--> generate_hard_negatives()  -->  hard_negatives per query
    |
    v
train.jsonl (CohereRerankerFinetuneDataset entries)
    |
    v
CohereRerankerFinetuneEngine (training)

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment