Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:FlagOpen FlagEmbedding LLM Embedder Retrieval Data

From Leeroopedia
Revision as of 14:59, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/FlagOpen_FlagEmbedding_LLM_Embedder_Retrieval_Data.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Machine Learning, Information Retrieval, Data Processing
Last Updated 2026-02-09 00:00 GMT

Overview

Data processing module for preparing retrieval training and evaluation datasets with support for multiple task types and contrastive learning.

Description

This module provides comprehensive functionality for handling retrieval datasets in the LLM Embedder framework. It includes the RetrievalDataset class with static methods for preparing training and evaluation data, the SameDatasetTrainDataset class for organizing batches from the same task, and the RetrievalDataCollator for dynamic batching. The module supports various retrieval tasks (QA, conversational search, chat, LRLM, ICL, tool retrieval) with configurable instruction templates and training strategies.

Key features include:

  • Flexible positive/negative selection strategies (first, random, teacher-based)
  • Teacher score filtering and knowledge distillation support
  • Task-specific instruction templates for different retrieval scenarios
  • Multiple data organization methods (random, epoch-based, epoch-random)
  • Cross-device negative sampling support

Usage

Use this module when training retrieval models with contrastive learning, preparing evaluation datasets for information retrieval tasks, or when you need task-specific data processing with instruction templates.

Code Reference

Source Location

Signature

class RetrievalDataset:
    @staticmethod
    def get_train_process_fn(train_group_size=8, select_positive="first",
                             select_negative="random", teacher_scores_margin=None,
                             teacher_scores_min=None, stable_distill=False,
                             instruction=None)

    @staticmethod
    def prepare_train_dataset(data_file=None, cache_dir=None, config=None,
                             train_group_size=8, select_positive="first",
                             select_negative="random", max_sample_num=None,
                             teacher_scores_margin=None, teacher_scores_min=None,
                             stable_distill=False, add_instruction=False,
                             instruction=None, use_train_config=False)

    @staticmethod
    def prepare_eval_dataset(data_file=None, cache_dir=None,
                            instruction=None, eval_method="retrieve")

    @staticmethod
    def prepare_corpus(data_file, key_template:str, cache_dir=None,
                      instruction=None)

class SameDatasetTrainDataset(torch.utils.data.Dataset):
    def __init__(self, dataset, dataset_indices_range, batch_size, seed,
                 organize_method, process_index=0, num_processes=1)

@dataclass
class RetrievalDataCollator:
    tokenizer: PreTrainedTokenizer = None
    query_max_length: int = 256
    key_max_length: int = 256
    inbatch_same_dataset: bool = False
    cross: bool = False

Import

from retrieval.data import RetrievalDataset, SameDatasetTrainDataset, RetrievalDataCollator, TASK_CONFIG

I/O Contract

Inputs

Name Type Required Description
data_file str or List[str] Yes Path(s) to JSON training/eval data files
train_group_size int No Number of samples per training group (1 pos + n neg)
select_positive str No Strategy for selecting positives: "first", "random", "teacher", "teacher-pos"
select_negative str No Strategy for selecting negatives: "random", "first", "teacher+", "teacher-"
teacher_scores List[float] No Teacher scores for distillation
instruction Dict No Query and key instruction templates
eval_method str No Evaluation method: "retrieve" or "rerank"

Outputs

Name Type Description
dataset datasets.Dataset Processed HuggingFace dataset
dataset_indices_range Dict Mapping of dataset names to index ranges
collated_batch Dict Tokenized batch with query, key, teacher_scores, etc.

Usage Examples

# Prepare training dataset with teacher distillation
from retrieval.data import RetrievalDataset, TASK_CONFIG

config = TASK_CONFIG["llm-embedder"]
dataset, indices_range = RetrievalDataset.prepare_train_dataset(
    data_file="train_data/*.json",
    train_group_size=8,
    select_positive="teacher",
    select_negative="random",
    teacher_scores_min=0.5,
    stable_distill=True,
    add_instruction=True,
    config=config,
    use_train_config=True
)

# Create data collator for batching
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("BAAI/llm-embedder")
collator = RetrievalDataCollator(
    tokenizer=tokenizer,
    query_max_length=512,
    key_max_length=512
)

# Prepare evaluation dataset
eval_dataset = RetrievalDataset.prepare_eval_dataset(
    data_file="eval_data.json",
    instruction=config["instruction"]["qa"],
    eval_method="retrieve"
)

# Prepare corpus for retrieval
corpus = RetrievalDataset.prepare_corpus(
    data_file="corpus.json",
    key_template="{title}: {text}",
    instruction=config["instruction"]["qa"]
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment