Principle:Speechbrain Speechbrain Speaker Verification Scoring
| Property | Value |
|---|---|
| Principle Name | Speaker Verification Scoring |
| Domains | Speaker_Recognition, Biometrics |
| Related Implementation | Implementation:Speechbrain_Speechbrain_Get_Verification_Scores |
| Repository | speechbrain/speechbrain |
| Source Context | recipes/VoxCeleb/SpeakerRec/speaker_verification_cosine.py
|
Overview
Computing similarity scores between speaker embeddings for identity verification decisions. Speaker verification is the task of determining whether two utterances are spoken by the same person. Given pre-computed enrollment and test embeddings, a similarity metric produces a scalar score indicating how likely the two utterances share the same speaker identity.
Theoretical Foundations
Cosine Similarity
The primary scoring metric for modern speaker verification is cosine similarity, which measures the angle between two embedding vectors:
score = (e1 . e2) / (||e1|| x ||e2||)
where:
- e1 is the enrollment embedding
- e2 is the test embedding
- The score ranges from -1 (opposite directions) to +1 (identical directions)
Higher scores indicate greater likelihood of the same speaker.
Cosine similarity is preferred over Euclidean distance because:
- It is invariant to embedding magnitude, focusing purely on directional similarity.
- Speaker embeddings trained with softmax-based losses naturally produce embeddings where direction encodes identity and magnitude encodes confidence.
- It requires no additional training (unlike PLDA or learned scoring backends).
Verification Trial Protocol
A verification trial consists of a triplet:
label enrol_utterance_id test_utterance_id
where:
- label is 1 (target, same speaker) or 0 (non-target, different speaker)
- enrol_utterance_id identifies the enrollment utterance
- test_utterance_id identifies the test utterance
For each trial, the system:
- Retrieves the pre-computed enrollment embedding
- Retrieves the pre-computed test embedding
- Computes the cosine similarity score
- Records the score as positive (if label=1) or negative (if label=0)
Score Normalization
Raw cosine scores can be poorly calibrated across different speakers. Score normalization techniques improve calibration by comparing each score against a distribution of impostor scores from a cohort:
Z-Norm (Zero Normalization)
Normalizes using the enrollment speaker's impostor distribution:
score_znorm = (score - mean_e_c) / std_e_c
where mean_e_c and std_e_c are the mean and standard deviation of cosine scores between the enrollment embedding and all cohort embeddings.
T-Norm (Test Normalization)
Normalizes using the test utterance's impostor distribution:
score_tnorm = (score - mean_t_c) / std_t_c
S-Norm (Symmetric Normalization)
Averages z-norm and t-norm:
score_snorm = 0.5 * (score_znorm + score_tnorm)
S-norm generally provides the best performance by accounting for both enrollment and test side variability.
Cohort Selection
The impostor cohort is typically drawn from the training set. An optional cohort_size parameter selects only the top-K most similar impostors for normalization, reducing computational cost and focusing on the most informative comparisons.
Decision Making
Given a score and a threshold theta:
if score > theta:
decision = "same speaker" (accept)
else:
decision = "different speaker" (reject)
The threshold is set based on the desired operating point (e.g., EER threshold, or application-specific cost function).
Embedding Computation Loop
Before scoring, embeddings must be computed for all enrollment and test utterances. The compute_embedding_loop function:
- Iterates through a DataLoader of utterances
- Computes embeddings using the trained model (under
torch.no_grad()) - Stores embeddings in a dictionary keyed by segment ID
- Skips utterances already in the dictionary (for efficiency with duplicate IDs)
Key Design Decisions
- Pre-computed embeddings: All embeddings are computed once and cached in memory before scoring, avoiding redundant computation for utterances that appear in multiple trials.
- Cosine similarity over learned backends: Cosine scoring requires no additional training parameters and generalizes well when combined with strong embedding models.
- Optional score normalization: The system supports z-norm, t-norm, and s-norm as optional post-processing, controlled by a single configuration parameter.
- Score file output: All trial scores are written to a text file for post-hoc analysis and reproducibility.