Heuristic:Speechbrain Speechbrain Score Normalization Tips
| Knowledge Sources | |
|---|---|
| Domains | Speaker_Verification, Optimization |
| Last Updated | 2026-02-09 20:00 GMT |
Overview
Score normalization techniques (z-norm, t-norm, s-norm) for calibrating cosine similarity scores in speaker verification, with optional top-k cohort selection.
Description
Raw cosine similarity scores in speaker verification are not calibrated: they vary depending on the speaker's embedding characteristics. Score normalization uses an impostor cohort from the training set to standardize scores. Three methods are supported: z-norm (normalize by enrollment impostor stats), t-norm (normalize by test impostor stats), and s-norm (average of z-norm and t-norm). An optional `cohort_size` parameter restricts normalization to only the top-k most similar impostors, reducing computation while focusing on the most discriminative comparisons.
Usage
Apply when performing speaker verification evaluation and the EER/minDCF needs to be optimized. Set `score_norm: "s-norm"` in the YAML config for best results. Use `cohort_size` (e.g., 200-400) to reduce computation on large cohorts.
The Insight (Rule of Thumb)
- Action: Set `score_norm: "s-norm"` and optionally `cohort_size: 200` in speaker verification YAML configs.
- Value: s-norm (recommended), z-norm or t-norm as alternatives. Cohort size 200-400 typical.
- Trade-off: s-norm gives best EER/minDCF but is 2x more expensive than z-norm or t-norm alone (computes both enrollment and test impostor statistics). Top-k cohort reduces cost but may miss some informative impostors.
Reasoning
Cosine similarity scores are speaker-dependent: some speakers naturally produce higher similarity scores with everyone. Without normalization, a single global threshold performs suboptimally. Score normalization converts raw scores to z-scores relative to the impostor distribution, making the threshold more speaker-independent. S-norm is the most robust because it accounts for both enrollment and test speaker biases.
Code from `recipes/VoxCeleb/SpeakerRec/speaker_verification_cosine.py:94-143`:
if "score_norm" in params:
# Getting norm stats for enrol impostors
enrol_rep = enrol.repeat(train_cohort.shape[0], 1, 1)
score_e_c = similarity(enrol_rep, train_cohort)
if "cohort_size" in params:
score_e_c = torch.topk(score_e_c, k=params["cohort_size"], dim=0)[0]
mean_e_c = torch.mean(score_e_c, dim=0)
std_e_c = torch.std(score_e_c, dim=0)
# Apply normalization
if params["score_norm"] == "z-norm":
score = (score - mean_e_c) / std_e_c
elif params["score_norm"] == "t-norm":
score = (score - mean_t_c) / std_t_c
elif params["score_norm"] == "s-norm":
score_e = (score - mean_e_c) / std_e_c
score_t = (score - mean_t_c) / std_t_c
score = 0.5 * (score_e + score_t)