Heuristic:FlagOpen FlagEmbedding Same Dataset Batching Tip
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Contrastive_Learning, Data_Engineering |
| Last Updated | 2026-02-09 21:00 GMT |
Overview
Enable same-dataset batching when training with multiple datasets to prevent cross-dataset in-batch negatives that harm contrastive learning quality.
Description
When fine-tuning embedders with data from multiple datasets (e.g., NLI + QA + retrieval), the `same_dataset_within_batch` flag ensures all samples in a single batch come from the same dataset. This prevents samples from different domains or task types from being used as in-batch negatives for each other, which would create noise in the contrastive learning signal.
When enabled, this mode requires `per_device_train_batch_size=1` and `dataloader_num_workers=0` because the collator returns a pre-assembled batch from a single dataset rather than individual samples.
Usage
Use this heuristic when:
- Multi-dataset training: Training with data from multiple sources that have different distributions
- ICL (In-Context Learning) embedder training: The ICL dataset collator specifically requires this mode
- Heterogeneous tasks: Mixing symmetric tasks (STS, clustering) with asymmetric tasks (retrieval)
The Insight (Rule of Thumb)
- Action: Set `--same_dataset_within_batch True` with `--per_device_train_batch_size 1` and `--dataloader_num_workers 0`.
- Value: Prevents cross-dataset contamination of in-batch negatives.
- Trade-off: Reduces batch diversity and may slightly slow convergence for single-dataset setups. Requires specific batch size and worker settings.
- Companion: Use `--small_threshold N` to merge small datasets below N examples together, and `--drop_threshold M` to remove merged groups still below M examples.
Reasoning
In contrastive learning, all non-positive examples in a batch serve as implicit negatives. When batch samples come from different datasets (e.g., a medical QA pair and a general web retrieval pair), the cross-dataset negatives are semantically unrelated and provide no useful learning signal. Worse, they can teach the model that domain-specific terminology is universally dissimilar, harming generalization.
# From FlagEmbedding/abc/finetune/embedder/AbsArguments.py:108-118
same_dataset_within_batch: bool = field(
default=False,
metadata={"help": "All samples in the same batch comes from the same dataset."}
)
small_threshold: int = field(
default=0,
metadata={"help": "The threshold of small dataset. All small dataset in the same directory will be merged into one dataset."}
)
drop_threshold: int = field(
default=0,
metadata={"help": "The threshold for dropping merged small dataset."}
)
Batch size constraint from `FlagEmbedding/finetune/embedder/decoder_only/icl/dataset.py:204-208`:
"""
EmbedCollator for SameDataset.
Note that after using this collator, the training_args should be set as:
training_args.per_device_train_batch_size = 1
training_args.dataloader_num_workers = 0 # avoid multi-processing
"""