Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Run llama Llama index Reranker Dataset Gen

From Leeroopedia

Overview

This module provides dataset generation utilities for Cohere reranker fine-tuning. It includes functions for generating hard negatives (via random sampling or cosine similarity), extracting query-context pairs from an existing embedding QA dataset, and producing a complete JSONL training file for Cohere's custom reranker model API.

Source file: llama-index-finetuning/llama_index/finetuning/rerankers/dataset_gen.py (128 lines)

Dependencies

Dependency Purpose
random Random sampling for hard negative generation
llama_index.core.bridge.pydantic.BaseModel Base class for the CohereRerankerFinetuneDataset model
llama_index.core.indices.query.embedding_utils.get_top_k_embeddings Retrieval of top-k similar embeddings for cosine-similarity-based hard negatives
llama_index.finetuning.EmbeddingQAFinetuneDataset Input dataset type containing queries, corpus, and relevance mappings

Class: CohereRerankerFinetuneDataset

Inherits from: pydantic.BaseModel

A Pydantic model representing a single training entry for Cohere reranker fine-tuning.

class CohereRerankerFinetuneDataset(BaseModel):
    query: str
    relevant_passages: List[str]
    hard_negatives: Any
Field Type Description
query str The query text
relevant_passages List[str] List of passages relevant to the query (typically a single passage)
hard_negatives Any List of hard negative passages (similar but not relevant)

Method: to_jsonl

def to_jsonl(self) -> str

Serializes the instance to a JSON string followed by a newline character, suitable for JSONL file output. Uses Pydantic's model_dump_json() for serialization.

Function: generate_embeddings

def generate_embeddings(embed_model: Any, text: str) -> List[float]

A thin wrapper around an embedding model's get_text_embedding() method. Takes a single text string and returns its embedding vector.

Function: generate_hard_negatives

def generate_hard_negatives(
    queries: List[str],
    relevant_contexts: List[str],
    embed_model: Optional[Any],
    num_negatives: int = 5,
    method: str = "random",
) -> Any
Parameter Type Default Description
queries List[str] required List of query strings
relevant_contexts List[str] required List of relevant context strings (one per query, aligned by index)
embed_model Optional[Any] required Embedding model (required for "cosine_similarity" method)
num_negatives int 5 Number of hard negatives to generate per query
method str "random" Hard negative generation strategy

Supported methods:

Method Description
"random" Randomly samples num_negatives contexts from the relevant_contexts list, excluding the context at the same index as the query. Uses random.sample() for selection without replacement.
"cosine_similarity" Computes embeddings for all queries and contexts, then for each query selects the num_negatives most similar contexts (by embedding cosine similarity) that are not the correct relevant context. Uses get_top_k_embeddings to rank all contexts and then filters out the correct one.

Design note: When method="cosine_similarity", all embeddings are pre-computed before the loop. The embeddings are computed using the provided embed_model. The embed_model parameter is not used when method="random".

Function: get_query_context_lists

def get_query_context_lists(
    query_context_pairs: EmbeddingQAFinetuneDataset,
) -> Tuple[List[str], List[str]]

Extracts parallel lists of queries and their corresponding relevant contexts from an EmbeddingQAFinetuneDataset.

Workflow:

  1. Iterates over all query IDs and query texts in query_context_pairs.queries.
  2. For each query, looks up the first relevant document ID from query_context_pairs.relevant_docs.
  3. Retrieves the corresponding document text from query_context_pairs.corpus.
  4. Returns two aligned lists: (queries, relevant_contexts).

Function: generate_cohere_reranker_finetuning_dataset

def generate_cohere_reranker_finetuning_dataset(
    query_context_pairs: EmbeddingQAFinetuneDataset,
    num_negatives: int = 0,
    top_k_dissimilar: int = 100,
    hard_negatives_gen_method: str = "random",
    finetune_dataset_file_name: str = "train.jsonl",
    embed_model: Optional[Any] = None,
) -> Any
Parameter Type Default Description
query_context_pairs EmbeddingQAFinetuneDataset required Dataset containing queries, corpus, and relevance mappings
num_negatives int 0 Number of hard negatives per query; if 0, no hard negatives are generated
top_k_dissimilar int 100 Parameter accepted but not currently used in the implementation
hard_negatives_gen_method str "random" Method for generating hard negatives ("random" or "cosine_similarity")
finetune_dataset_file_name str "train.jsonl" Output file path for the JSONL dataset
embed_model Optional[Any] None Embedding model for cosine-similarity-based hard negative generation

Workflow:

  1. Extracts query and context lists via get_query_context_lists().
  2. If num_negatives > 0, generates hard negatives using the specified method. Otherwise, assigns empty lists as hard negatives for all queries.
  3. Opens the output file in write mode and iterates over the zipped queries, contexts, and hard negatives.
  4. For each entry, creates a CohereRerankerFinetuneDataset instance with:
    • query -- the query string
    • relevant_passages -- a single-element list containing the relevant context
    • hard_negatives -- the list of hard negative passages
  5. Writes each entry as a JSONL line via entry.to_jsonl().

Typical Usage Pipeline

EmbeddingQAFinetuneDataset (queries + corpus + relevance mappings)
    |
    v
generate_cohere_reranker_finetuning_dataset()
    |
    +--> get_query_context_lists()  -->  (queries, contexts)
    |
    +--> generate_hard_negatives()  -->  hard_negatives per query
    |
    v
train.jsonl (CohereRerankerFinetuneDataset entries)
    |
    v
CohereRerankerFinetuneEngine (training)

See Also

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment