Implementation:FlagOpen FlagEmbedding LLM Embedder Retrieval Data
| Knowledge Sources | |
|---|---|
| Domains | Machine Learning, Information Retrieval, Data Processing |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Data processing module for preparing retrieval training and evaluation datasets with support for multiple task types and contrastive learning.
Description
This module provides comprehensive functionality for handling retrieval datasets in the LLM Embedder framework. It includes the RetrievalDataset class with static methods for preparing training and evaluation data, the SameDatasetTrainDataset class for organizing batches from the same task, and the RetrievalDataCollator for dynamic batching. The module supports various retrieval tasks (QA, conversational search, chat, LRLM, ICL, tool retrieval) with configurable instruction templates and training strategies.
Key features include:
- Flexible positive/negative selection strategies (first, random, teacher-based)
- Teacher score filtering and knowledge distillation support
- Task-specific instruction templates for different retrieval scenarios
- Multiple data organization methods (random, epoch-based, epoch-random)
- Cross-device negative sampling support
Usage
Use this module when training retrieval models with contrastive learning, preparing evaluation datasets for information retrieval tasks, or when you need task-specific data processing with instruction templates.
Code Reference
Source Location
- Repository: FlagOpen_FlagEmbedding
- File: research/llm_embedder/src/retrieval/data.py
Signature
class RetrievalDataset:
@staticmethod
def get_train_process_fn(train_group_size=8, select_positive="first",
select_negative="random", teacher_scores_margin=None,
teacher_scores_min=None, stable_distill=False,
instruction=None)
@staticmethod
def prepare_train_dataset(data_file=None, cache_dir=None, config=None,
train_group_size=8, select_positive="first",
select_negative="random", max_sample_num=None,
teacher_scores_margin=None, teacher_scores_min=None,
stable_distill=False, add_instruction=False,
instruction=None, use_train_config=False)
@staticmethod
def prepare_eval_dataset(data_file=None, cache_dir=None,
instruction=None, eval_method="retrieve")
@staticmethod
def prepare_corpus(data_file, key_template:str, cache_dir=None,
instruction=None)
class SameDatasetTrainDataset(torch.utils.data.Dataset):
def __init__(self, dataset, dataset_indices_range, batch_size, seed,
organize_method, process_index=0, num_processes=1)
@dataclass
class RetrievalDataCollator:
tokenizer: PreTrainedTokenizer = None
query_max_length: int = 256
key_max_length: int = 256
inbatch_same_dataset: bool = False
cross: bool = False
Import
from retrieval.data import RetrievalDataset, SameDatasetTrainDataset, RetrievalDataCollator, TASK_CONFIG
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| data_file | str or List[str] | Yes | Path(s) to JSON training/eval data files |
| train_group_size | int | No | Number of samples per training group (1 pos + n neg) |
| select_positive | str | No | Strategy for selecting positives: "first", "random", "teacher", "teacher-pos" |
| select_negative | str | No | Strategy for selecting negatives: "random", "first", "teacher+", "teacher-" |
| teacher_scores | List[float] | No | Teacher scores for distillation |
| instruction | Dict | No | Query and key instruction templates |
| eval_method | str | No | Evaluation method: "retrieve" or "rerank" |
Outputs
| Name | Type | Description |
|---|---|---|
| dataset | datasets.Dataset | Processed HuggingFace dataset |
| dataset_indices_range | Dict | Mapping of dataset names to index ranges |
| collated_batch | Dict | Tokenized batch with query, key, teacher_scores, etc. |
Usage Examples
# Prepare training dataset with teacher distillation
from retrieval.data import RetrievalDataset, TASK_CONFIG
config = TASK_CONFIG["llm-embedder"]
dataset, indices_range = RetrievalDataset.prepare_train_dataset(
data_file="train_data/*.json",
train_group_size=8,
select_positive="teacher",
select_negative="random",
teacher_scores_min=0.5,
stable_distill=True,
add_instruction=True,
config=config,
use_train_config=True
)
# Create data collator for batching
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("BAAI/llm-embedder")
collator = RetrievalDataCollator(
tokenizer=tokenizer,
query_max_length=512,
key_max_length=512
)
# Prepare evaluation dataset
eval_dataset = RetrievalDataset.prepare_eval_dataset(
data_file="eval_data.json",
instruction=config["instruction"]["qa"],
eval_method="retrieve"
)
# Prepare corpus for retrieval
corpus = RetrievalDataset.prepare_corpus(
data_file="corpus.json",
key_template="{title}: {text}",
instruction=config["instruction"]["qa"]
)