Principle:Speechbrain Speechbrain Speaker Verification Scoring

Property	Value
Principle Name	Speaker Verification Scoring
Domains	Speaker_Recognition, Biometrics
Related Implementation	Implementation:Speechbrain_Speechbrain_Get_Verification_Scores
Repository	speechbrain/speechbrain
Source Context	`recipes/VoxCeleb/SpeakerRec/speaker_verification_cosine.py`

Overview

Computing similarity scores between speaker embeddings for identity verification decisions. Speaker verification is the task of determining whether two utterances are spoken by the same person. Given pre-computed enrollment and test embeddings, a similarity metric produces a scalar score indicating how likely the two utterances share the same speaker identity.

Theoretical Foundations

Cosine Similarity

The primary scoring metric for modern speaker verification is cosine similarity, which measures the angle between two embedding vectors:

score = (e1 . e2) / (||e1|| x ||e2||)

where:

e1 is the enrollment embedding
e2 is the test embedding
The score ranges from -1 (opposite directions) to +1 (identical directions)

Higher scores indicate greater likelihood of the same speaker.

Cosine similarity is preferred over Euclidean distance because:

It is invariant to embedding magnitude, focusing purely on directional similarity.
Speaker embeddings trained with softmax-based losses naturally produce embeddings where direction encodes identity and magnitude encodes confidence.
It requires no additional training (unlike PLDA or learned scoring backends).

Verification Trial Protocol

A verification trial consists of a triplet:

label enrol_utterance_id test_utterance_id

where:

label is 1 (target, same speaker) or 0 (non-target, different speaker)
enrol_utterance_id identifies the enrollment utterance
test_utterance_id identifies the test utterance

For each trial, the system:

Retrieves the pre-computed enrollment embedding
Retrieves the pre-computed test embedding
Computes the cosine similarity score
Records the score as positive (if label=1) or negative (if label=0)

Score Normalization

Raw cosine scores can be poorly calibrated across different speakers. Score normalization techniques improve calibration by comparing each score against a distribution of impostor scores from a cohort:

Z-Norm (Zero Normalization)

Normalizes using the enrollment speaker's impostor distribution:

score_znorm = (score - mean_e_c) / std_e_c

where mean_e_c and std_e_c are the mean and standard deviation of cosine scores between the enrollment embedding and all cohort embeddings.

T-Norm (Test Normalization)

Normalizes using the test utterance's impostor distribution:

score_tnorm = (score - mean_t_c) / std_t_c

S-Norm (Symmetric Normalization)

Averages z-norm and t-norm:

score_snorm = 0.5 * (score_znorm + score_tnorm)

S-norm generally provides the best performance by accounting for both enrollment and test side variability.

Cohort Selection

The impostor cohort is typically drawn from the training set. An optional cohort_size parameter selects only the top-K most similar impostors for normalization, reducing computational cost and focusing on the most informative comparisons.

Decision Making

Given a score and a threshold theta:

if score > theta:
    decision = "same speaker" (accept)
else:
    decision = "different speaker" (reject)

The threshold is set based on the desired operating point (e.g., EER threshold, or application-specific cost function).

Embedding Computation Loop

Before scoring, embeddings must be computed for all enrollment and test utterances. The compute_embedding_loop function:

Iterates through a DataLoader of utterances
Computes embeddings using the trained model (under torch.no_grad())
Stores embeddings in a dictionary keyed by segment ID
Skips utterances already in the dictionary (for efficiency with duplicate IDs)

Key Design Decisions

Pre-computed embeddings: All embeddings are computed once and cached in memory before scoring, avoiding redundant computation for utterances that appear in multiple trials.
Cosine similarity over learned backends: Cosine scoring requires no additional training parameters and generalizes well when combined with strong embedding models.
Optional score normalization: The system supports z-norm, t-norm, and s-norm as optional post-processing, controlled by a single configuration parameter.
Score file output: All trial scores are written to a text file for post-hoc analysis and reproducibility.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment