Principle:Speechbrain Speechbrain Verification Metrics
| Property | Value |
|---|---|
| Principle Name | Verification Metrics |
| Domains | Evaluation_Metrics, Speaker_Recognition |
| Related Implementation | Implementation:Speechbrain_Speechbrain_EER_And_MinDCF |
| Repository | speechbrain/speechbrain |
| Source Context | speechbrain/utils/metric_stats.py
|
| Knowledge Sources | NIST Speaker Recognition Evaluation guidelines |
Overview
Evaluating speaker verification systems using Equal Error Rate (EER) and minimum Detection Cost Function (minDCF). These are the standard metrics adopted by the NIST Speaker Recognition Evaluations and the VoxCeleb Speaker Recognition Challenges for benchmarking speaker verification performance.
Theoretical Foundations
Binary Decision Framework
Speaker verification is a binary decision problem. For each trial, the system produces a scalar score. A threshold theta converts this score into a binary decision:
- Accept (same speaker): score > theta
- Reject (different speaker): score <= theta
Two types of errors arise:
- False Acceptance (FA): A different-speaker trial is incorrectly accepted (score > theta when label = 0).
- False Rejection (FR): A same-speaker trial is incorrectly rejected (score <= theta when label = 1).
The rates of these errors are:
FAR(theta) = |{negative trials with score > theta}| / |{all negative trials}|
FRR(theta) = |{positive trials with score <= theta}| / |{all positive trials}|
Equal Error Rate (EER)
The EER is the operating point where the false acceptance rate equals the false rejection rate:
EER: the value of FAR(theta*) = FRR(theta*) at threshold theta*
Properties:
- Provides a single-number summary of system accuracy that is independent of any prior or cost assumptions.
- Lower EER indicates better performance.
- Typical state-of-the-art values on VoxCeleb1-O are below 1%.
Computation:
- All positive and negative scores are combined and sorted to form candidate thresholds.
- Intermediate thresholds (midpoints between consecutive scores) are added for finer resolution.
- For each threshold, FAR and FRR are computed.
- The threshold that minimizes |FAR - FRR| is selected.
- EER is reported as (FAR + FRR) / 2 at that threshold (since exact equality may not occur with discrete scores).
Minimum Detection Cost Function (minDCF)
The minDCF is a cost-weighted metric that accounts for the relative costs of different error types and the prior probability of encountering a target speaker:
DCF(theta) = C_miss * P_miss(theta) * P_target + C_fa * P_fa(theta) * (1 - P_target)
where:
- C_miss = cost of a missed detection (default: 1.0)
- C_fa = cost of a false alarm (default: 1.0)
- P_target = prior probability of the target speaker (default: 0.01)
- P_miss(theta) = miss probability = FRR(theta)
- P_fa(theta) = false alarm probability = FAR(theta)
The minimum DCF over all possible thresholds is reported:
minDCF = min_theta DCF(theta)
Properties:
- Reflects application-specific requirements through the cost parameters.
- With P_target = 0.01, the metric emphasizes performance in scenarios where target speakers are rare (e.g., surveillance, access control).
- The "minimum" indicates the best possible performance the system could achieve if the optimal threshold were known (i.e., an oracle threshold).
Relationship Between EER and minDCF
- EER assumes equal costs for both error types and a flat prior -- it is threshold-agnostic and application-neutral.
- minDCF weights errors according to their real-world cost -- it is application-aware.
- A system may have excellent EER but poor minDCF if its score distributions are poorly calibrated for the specific cost/prior regime.
- Both metrics should be reported for a complete evaluation.
Threshold Selection
Both EER and minDCF report the optimal threshold along with the metric value:
- EER threshold: The threshold where FAR equals FRR. Used as a reasonable operating point when no application-specific requirements are known.
- minDCF threshold: The threshold that minimizes the detection cost function. This is the optimal operating point for the specific cost/prior configuration.
Standard Evaluation Protocols
The VoxCeleb benchmark defines three standard evaluation protocols:
- VoxCeleb1-O (Original): The original test set with 37,611 trial pairs.
- VoxCeleb1-E (Extended): An extended test set with approximately 579,818 trial pairs.
- VoxCeleb1-H (Hard): A hard subset where enrollment and test utterances share the same nationality and gender.
Key Design Decisions
- Score-based evaluation: Metrics operate on continuous scores rather than binary decisions, allowing the evaluation of the full operating range.
- Discrete threshold search: Both EER and minDCF are computed by searching over all possible thresholds derived from the score distribution, with interpolated intermediate thresholds for finer granularity.
- Default DCF parameters: The defaults (C_miss=1.0, C_fa=1.0, P_target=0.01) follow the NIST SRE 2010 evaluation plan.