Heuristic:FlagOpen FlagEmbedding Same Dataset Batching Tip

Knowledge Sources	FlagOpen/FlagEmbedding
Domains	Optimization, Contrastive_Learning, Data_Engineering
Last Updated	2026-02-09 21:00 GMT

Overview

Enable same-dataset batching when training with multiple datasets to prevent cross-dataset in-batch negatives that harm contrastive learning quality.

Description

When fine-tuning embedders with data from multiple datasets (e.g., NLI + QA + retrieval), the `same_dataset_within_batch` flag ensures all samples in a single batch come from the same dataset. This prevents samples from different domains or task types from being used as in-batch negatives for each other, which would create noise in the contrastive learning signal.

When enabled, this mode requires `per_device_train_batch_size=1` and `dataloader_num_workers=0` because the collator returns a pre-assembled batch from a single dataset rather than individual samples.

Usage

Use this heuristic when:

Multi-dataset training: Training with data from multiple sources that have different distributions
ICL (In-Context Learning) embedder training: The ICL dataset collator specifically requires this mode
Heterogeneous tasks: Mixing symmetric tasks (STS, clustering) with asymmetric tasks (retrieval)

The Insight (Rule of Thumb)

Action: Set `--same_dataset_within_batch True` with `--per_device_train_batch_size 1` and `--dataloader_num_workers 0`.
Value: Prevents cross-dataset contamination of in-batch negatives.
Trade-off: Reduces batch diversity and may slightly slow convergence for single-dataset setups. Requires specific batch size and worker settings.
Companion: Use `--small_threshold N` to merge small datasets below N examples together, and `--drop_threshold M` to remove merged groups still below M examples.

Reasoning

In contrastive learning, all non-positive examples in a batch serve as implicit negatives. When batch samples come from different datasets (e.g., a medical QA pair and a general web retrieval pair), the cross-dataset negatives are semantically unrelated and provide no useful learning signal. Worse, they can teach the model that domain-specific terminology is universally dissimilar, harming generalization.

# From FlagEmbedding/abc/finetune/embedder/AbsArguments.py:108-118
same_dataset_within_batch: bool = field(
    default=False,
    metadata={"help": "All samples in the same batch comes from the same dataset."}
)
small_threshold: int = field(
    default=0,
    metadata={"help": "The threshold of small dataset. All small dataset in the same directory will be merged into one dataset."}
)
drop_threshold: int = field(
    default=0,
    metadata={"help": "The threshold for dropping merged small dataset."}
)

Batch size constraint from `FlagEmbedding/finetune/embedder/decoder_only/icl/dataset.py:204-208`:

"""
EmbedCollator for SameDataset.
Note that after using this collator, the training_args should be set as:
    training_args.per_device_train_batch_size = 1
    training_args.dataloader_num_workers = 0    # avoid multi-processing
"""

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment