Implementation:Run llama Llama index Reranker Dataset Gen
Overview
This module provides dataset generation utilities for Cohere reranker fine-tuning. It includes functions for generating hard negatives (via random sampling or cosine similarity), extracting query-context pairs from an existing embedding QA dataset, and producing a complete JSONL training file for Cohere's custom reranker model API.
Source file: llama-index-finetuning/llama_index/finetuning/rerankers/dataset_gen.py (128 lines)
Dependencies
| Dependency | Purpose |
|---|---|
random |
Random sampling for hard negative generation |
llama_index.core.bridge.pydantic.BaseModel |
Base class for the CohereRerankerFinetuneDataset model
|
llama_index.core.indices.query.embedding_utils.get_top_k_embeddings |
Retrieval of top-k similar embeddings for cosine-similarity-based hard negatives |
llama_index.finetuning.EmbeddingQAFinetuneDataset |
Input dataset type containing queries, corpus, and relevance mappings |
Class: CohereRerankerFinetuneDataset
Inherits from: pydantic.BaseModel
A Pydantic model representing a single training entry for Cohere reranker fine-tuning.
class CohereRerankerFinetuneDataset(BaseModel):
query: str
relevant_passages: List[str]
hard_negatives: Any
| Field | Type | Description |
|---|---|---|
query |
str |
The query text |
relevant_passages |
List[str] |
List of passages relevant to the query (typically a single passage) |
hard_negatives |
Any |
List of hard negative passages (similar but not relevant) |
Method: to_jsonl
def to_jsonl(self) -> str
Serializes the instance to a JSON string followed by a newline character, suitable for JSONL file output. Uses Pydantic's model_dump_json() for serialization.
Function: generate_embeddings
def generate_embeddings(embed_model: Any, text: str) -> List[float]
A thin wrapper around an embedding model's get_text_embedding() method. Takes a single text string and returns its embedding vector.
Function: generate_hard_negatives
def generate_hard_negatives(
queries: List[str],
relevant_contexts: List[str],
embed_model: Optional[Any],
num_negatives: int = 5,
method: str = "random",
) -> Any
| Parameter | Type | Default | Description |
|---|---|---|---|
queries |
List[str] |
required | List of query strings |
relevant_contexts |
List[str] |
required | List of relevant context strings (one per query, aligned by index) |
embed_model |
Optional[Any] |
required | Embedding model (required for "cosine_similarity" method)
|
num_negatives |
int |
5 |
Number of hard negatives to generate per query |
method |
str |
"random" |
Hard negative generation strategy |
Supported methods:
| Method | Description |
|---|---|
"random" |
Randomly samples num_negatives contexts from the relevant_contexts list, excluding the context at the same index as the query. Uses random.sample() for selection without replacement.
|
"cosine_similarity" |
Computes embeddings for all queries and contexts, then for each query selects the num_negatives most similar contexts (by embedding cosine similarity) that are not the correct relevant context. Uses get_top_k_embeddings to rank all contexts and then filters out the correct one.
|
Design note: When method="cosine_similarity", all embeddings are pre-computed before the loop. The embeddings are computed using the provided embed_model. The embed_model parameter is not used when method="random".
Function: get_query_context_lists
def get_query_context_lists(
query_context_pairs: EmbeddingQAFinetuneDataset,
) -> Tuple[List[str], List[str]]
Extracts parallel lists of queries and their corresponding relevant contexts from an EmbeddingQAFinetuneDataset.
Workflow:
- Iterates over all query IDs and query texts in
query_context_pairs.queries. - For each query, looks up the first relevant document ID from
query_context_pairs.relevant_docs. - Retrieves the corresponding document text from
query_context_pairs.corpus. - Returns two aligned lists:
(queries, relevant_contexts).
Function: generate_cohere_reranker_finetuning_dataset
def generate_cohere_reranker_finetuning_dataset(
query_context_pairs: EmbeddingQAFinetuneDataset,
num_negatives: int = 0,
top_k_dissimilar: int = 100,
hard_negatives_gen_method: str = "random",
finetune_dataset_file_name: str = "train.jsonl",
embed_model: Optional[Any] = None,
) -> Any
| Parameter | Type | Default | Description |
|---|---|---|---|
query_context_pairs |
EmbeddingQAFinetuneDataset |
required | Dataset containing queries, corpus, and relevance mappings |
num_negatives |
int |
0 |
Number of hard negatives per query; if 0, no hard negatives are generated |
top_k_dissimilar |
int |
100 |
Parameter accepted but not currently used in the implementation |
hard_negatives_gen_method |
str |
"random" |
Method for generating hard negatives ("random" or "cosine_similarity")
|
finetune_dataset_file_name |
str |
"train.jsonl" |
Output file path for the JSONL dataset |
embed_model |
Optional[Any] |
None |
Embedding model for cosine-similarity-based hard negative generation |
Workflow:
- Extracts query and context lists via
get_query_context_lists(). - If
num_negatives > 0, generates hard negatives using the specified method. Otherwise, assigns empty lists as hard negatives for all queries. - Opens the output file in write mode and iterates over the zipped queries, contexts, and hard negatives.
- For each entry, creates a
CohereRerankerFinetuneDatasetinstance with:query-- the query stringrelevant_passages-- a single-element list containing the relevant contexthard_negatives-- the list of hard negative passages
- Writes each entry as a JSONL line via
entry.to_jsonl().
Typical Usage Pipeline
EmbeddingQAFinetuneDataset (queries + corpus + relevance mappings)
|
v
generate_cohere_reranker_finetuning_dataset()
|
+--> get_query_context_lists() --> (queries, contexts)
|
+--> generate_hard_negatives() --> hard_negatives per query
|
v
train.jsonl (CohereRerankerFinetuneDataset entries)
|
v
CohereRerankerFinetuneEngine (training)
See Also
- Run_llama_Llama_index_CohereRerankerFinetuneEngine -- Cohere reranker fine-tuning engine that consumes the generated JSONL files
- Run_llama_Llama_index_CrossEncoder_Dataset_Gen -- Dataset generation for cross-encoder fine-tuning