Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Speechbrain Speechbrain Score Normalization Tips

From Leeroopedia



Knowledge Sources
Domains Speaker_Verification, Optimization
Last Updated 2026-02-09 20:00 GMT

Overview

Score normalization techniques (z-norm, t-norm, s-norm) for calibrating cosine similarity scores in speaker verification, with optional top-k cohort selection.

Description

Raw cosine similarity scores in speaker verification are not calibrated: they vary depending on the speaker's embedding characteristics. Score normalization uses an impostor cohort from the training set to standardize scores. Three methods are supported: z-norm (normalize by enrollment impostor stats), t-norm (normalize by test impostor stats), and s-norm (average of z-norm and t-norm). An optional `cohort_size` parameter restricts normalization to only the top-k most similar impostors, reducing computation while focusing on the most discriminative comparisons.

Usage

Apply when performing speaker verification evaluation and the EER/minDCF needs to be optimized. Set `score_norm: "s-norm"` in the YAML config for best results. Use `cohort_size` (e.g., 200-400) to reduce computation on large cohorts.

The Insight (Rule of Thumb)

  • Action: Set `score_norm: "s-norm"` and optionally `cohort_size: 200` in speaker verification YAML configs.
  • Value: s-norm (recommended), z-norm or t-norm as alternatives. Cohort size 200-400 typical.
  • Trade-off: s-norm gives best EER/minDCF but is 2x more expensive than z-norm or t-norm alone (computes both enrollment and test impostor statistics). Top-k cohort reduces cost but may miss some informative impostors.

Reasoning

Cosine similarity scores are speaker-dependent: some speakers naturally produce higher similarity scores with everyone. Without normalization, a single global threshold performs suboptimally. Score normalization converts raw scores to z-scores relative to the impostor distribution, making the threshold more speaker-independent. S-norm is the most robust because it accounts for both enrollment and test speaker biases.

Code from `recipes/VoxCeleb/SpeakerRec/speaker_verification_cosine.py:94-143`:

if "score_norm" in params:
    # Getting norm stats for enrol impostors
    enrol_rep = enrol.repeat(train_cohort.shape[0], 1, 1)
    score_e_c = similarity(enrol_rep, train_cohort)
    if "cohort_size" in params:
        score_e_c = torch.topk(score_e_c, k=params["cohort_size"], dim=0)[0]
    mean_e_c = torch.mean(score_e_c, dim=0)
    std_e_c = torch.std(score_e_c, dim=0)

    # Apply normalization
    if params["score_norm"] == "z-norm":
        score = (score - mean_e_c) / std_e_c
    elif params["score_norm"] == "t-norm":
        score = (score - mean_t_c) / std_t_c
    elif params["score_norm"] == "s-norm":
        score_e = (score - mean_e_c) / std_e_c
        score_t = (score - mean_t_c) / std_t_c
        score = 0.5 * (score_e + score_t)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment