Heuristic:FlagOpen FlagEmbedding Temperature Scaling Tip
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Contrastive_Learning |
| Last Updated | 2026-02-09 21:00 GMT |
Overview
Use temperature=0.02 for contrastive training of embedding models to produce sharp similarity distributions that maximize separation between positives and negatives.
Description
The temperature parameter scales the similarity scores before applying the cross-entropy loss in contrastive learning. In FlagEmbedding, the default temperature is set to 0.02, which is very low compared to common defaults in other frameworks (0.05-0.1). This produces a very sharp probability distribution over the contrastive candidates, concentrating most of the probability mass on the positive pair and strongly penalizing hard negatives.
Usage
Use this heuristic when fine-tuning embedding models with the FlagEmbedding training pipeline. The temperature 0.02 is the default and works well for most contrastive learning scenarios with normalized embeddings. Consider adjusting if:
- Training diverges (loss becomes NaN): Increase temperature to soften gradients
- Model produces embeddings with poor discrimination: Decrease temperature further
- Using unnormalized embeddings: Temperature may need adjustment based on score scale
The Insight (Rule of Thumb)
- Action: Set `--temperature 0.02` in training arguments.
- Value: 0.02 (default in `AbsEmbedderTrainingArguments`).
- Trade-off: Very low temperature makes training sensitive to hard negatives, which can cause instability if negatives are too similar to positives. Higher temperature (e.g., 0.05) is more forgiving but may produce less discriminative embeddings.
Reasoning
The similarity score computation divides by temperature: `scores = similarity(q, p) / temperature`. With normalized embeddings, cosine similarity ranges from -1 to 1. Dividing by 0.02 maps this to [-50, 50], creating a very peaked softmax distribution.
This is effective because:
- Normalized embeddings produce bounded similarity scores
- The `train_group_size` default of 8 (1 positive + 7 negatives) benefits from sharp distributions
- Hard negative mining ensures the negatives are challenging, requiring strong discrimination
# From FlagEmbedding/abc/finetune/embedder/AbsArguments.py:136
temperature: Optional[float] = field(
default=0.02,
metadata={"help": "temperature used for similarity score"}
)
Score computation from `FlagEmbedding/finetune/embedder/encoder_only/base/modeling.py:138`:
scores = self._compute_similarity(q_reps, p_reps) / self.temperature