Heuristic:FlagOpen FlagEmbedding Temperature Scaling Tip

Knowledge Sources	FlagOpen/FlagEmbedding
Domains	Optimization, Contrastive_Learning
Last Updated	2026-02-09 21:00 GMT

Overview

Use temperature=0.02 for contrastive training of embedding models to produce sharp similarity distributions that maximize separation between positives and negatives.

Description

The temperature parameter scales the similarity scores before applying the cross-entropy loss in contrastive learning. In FlagEmbedding, the default temperature is set to 0.02, which is very low compared to common defaults in other frameworks (0.05-0.1). This produces a very sharp probability distribution over the contrastive candidates, concentrating most of the probability mass on the positive pair and strongly penalizing hard negatives.

Usage

Use this heuristic when fine-tuning embedding models with the FlagEmbedding training pipeline. The temperature 0.02 is the default and works well for most contrastive learning scenarios with normalized embeddings. Consider adjusting if:

Training diverges (loss becomes NaN): Increase temperature to soften gradients
Model produces embeddings with poor discrimination: Decrease temperature further
Using unnormalized embeddings: Temperature may need adjustment based on score scale

The Insight (Rule of Thumb)

Action: Set `--temperature 0.02` in training arguments.
Value: 0.02 (default in `AbsEmbedderTrainingArguments`).
Trade-off: Very low temperature makes training sensitive to hard negatives, which can cause instability if negatives are too similar to positives. Higher temperature (e.g., 0.05) is more forgiving but may produce less discriminative embeddings.

Reasoning

The similarity score computation divides by temperature: `scores = similarity(q, p) / temperature`. With normalized embeddings, cosine similarity ranges from -1 to 1. Dividing by 0.02 maps this to [-50, 50], creating a very peaked softmax distribution.

This is effective because:

Normalized embeddings produce bounded similarity scores
The `train_group_size` default of 8 (1 positive + 7 negatives) benefits from sharp distributions
Hard negative mining ensures the negatives are challenging, requiring strong discrimination

# From FlagEmbedding/abc/finetune/embedder/AbsArguments.py:136
temperature: Optional[float] = field(
    default=0.02,
    metadata={"help": "temperature used for similarity score"}
)

Score computation from `FlagEmbedding/finetune/embedder/encoder_only/base/modeling.py:138`:

scores = self._compute_similarity(q_reps, p_reps) / self.temperature

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment