Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Heuristic:FlagOpen FlagEmbedding Temperature Scaling Tip

From Leeroopedia



Knowledge Sources
Domains Optimization, Contrastive_Learning
Last Updated 2026-02-09 21:00 GMT

Overview

Use temperature=0.02 for contrastive training of embedding models to produce sharp similarity distributions that maximize separation between positives and negatives.

Description

The temperature parameter scales the similarity scores before applying the cross-entropy loss in contrastive learning. In FlagEmbedding, the default temperature is set to 0.02, which is very low compared to common defaults in other frameworks (0.05-0.1). This produces a very sharp probability distribution over the contrastive candidates, concentrating most of the probability mass on the positive pair and strongly penalizing hard negatives.

Usage

Use this heuristic when fine-tuning embedding models with the FlagEmbedding training pipeline. The temperature 0.02 is the default and works well for most contrastive learning scenarios with normalized embeddings. Consider adjusting if:

  • Training diverges (loss becomes NaN): Increase temperature to soften gradients
  • Model produces embeddings with poor discrimination: Decrease temperature further
  • Using unnormalized embeddings: Temperature may need adjustment based on score scale

The Insight (Rule of Thumb)

  • Action: Set `--temperature 0.02` in training arguments.
  • Value: 0.02 (default in `AbsEmbedderTrainingArguments`).
  • Trade-off: Very low temperature makes training sensitive to hard negatives, which can cause instability if negatives are too similar to positives. Higher temperature (e.g., 0.05) is more forgiving but may produce less discriminative embeddings.

Reasoning

The similarity score computation divides by temperature: `scores = similarity(q, p) / temperature`. With normalized embeddings, cosine similarity ranges from -1 to 1. Dividing by 0.02 maps this to [-50, 50], creating a very peaked softmax distribution.

This is effective because:

  • Normalized embeddings produce bounded similarity scores
  • The `train_group_size` default of 8 (1 positive + 7 negatives) benefits from sharp distributions
  • Hard negative mining ensures the negatives are challenging, requiring strong discrimination
# From FlagEmbedding/abc/finetune/embedder/AbsArguments.py:136
temperature: Optional[float] = field(
    default=0.02,
    metadata={"help": "temperature used for similarity score"}
)

Score computation from `FlagEmbedding/finetune/embedder/encoder_only/base/modeling.py:138`:

scores = self._compute_similarity(q_reps, p_reps) / self.temperature

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment