Principle:Speechbrain Speechbrain Speaker Diarization Pipeline
| Knowledge Sources | |
|---|---|
| Domains | Speaker_Diarization, Speaker_Recognition |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Speaker diarization segments an audio recording by speaker identity, answering the question "who spoke when" by combining voice activity detection, speaker embedding extraction, and spectral clustering.
Description
Speaker diarization is the task of partitioning a multi-speaker audio stream into homogeneous segments, each attributed to a single speaker. The problem is challenging because the number of speakers is often unknown in advance, speakers may overlap, and turn-taking patterns vary widely across domains (meetings, broadcast, telephone). The embedding-based diarization pipeline addresses this by first detecting speech regions, then extracting fixed-dimensional speaker embeddings from short uniform segments, and finally grouping these embeddings into speaker clusters using spectral clustering.
Usage
Apply this pipeline when you need to determine speaker identities and turn boundaries in multi-speaker recordings such as meetings, interviews, or broadcast content. This approach assumes oracle or predicted voice activity detection boundaries and works best when speakers produce enough speech for reliable embedding extraction.
Theoretical Basis
Pipeline Architecture
The diarization pipeline consists of three sequential stages:
Audio Recording
-> Voice Activity Detection (VAD)
-> Uniform Segmentation (e.g., 1.5s windows with 0.75s shift)
-> Speaker Embedding Extraction (ECAPA-TDNN or x-vector)
-> Affinity Matrix Construction (cosine similarity)
-> Spectral Clustering
-> Speaker Labels per Segment
Stage 1: Voice Activity Detection
VAD identifies the speech regions in the recording, discarding silence and non-speech noise. In the oracle setting, ground-truth speech boundaries from reference annotations are used. In practical systems, a neural VAD or energy-based detector provides these boundaries. The VAD output is a set of time intervals marking speech activity.
Stage 2: Speaker Embedding Extraction
Each speech region is divided into fixed-length subsegments (typically 1.5 seconds with 0.75-second overlap). For each subsegment, a pretrained speaker embedding model (such as ECAPA-TDNN) extracts a fixed-dimensional vector:
For each subsegment s_i:
features = compute_features(s_i) # Fbank features
features = mean_var_norm(features) # Instance normalization
emb_i = embedding_model(features) # e.g., 192-dim vector
emb_i = mean_var_norm_emb(emb_i) # Embedding normalization
The embeddings encode speaker identity information and are designed to be similar for segments from the same speaker and dissimilar for different speakers.
Stage 3: Spectral Clustering
Given N embeddings, spectral clustering determines the speaker assignment:
1. Compute affinity matrix: A[i,j] = cosine_similarity(emb_i, emb_j)
2. Apply p-percentile thresholding to prune weak connections
3. Compute the graph Laplacian: L = D - A
4. Extract the k smallest eigenvectors of L
5. Apply k-means on the eigenvector matrix to obtain cluster assignments
The number of speakers k can be specified in advance or estimated automatically using eigenvalue analysis (the eigengap heuristic), where the largest gap in the sorted eigenvalues of the Laplacian indicates the optimal number of clusters.
Evaluation: Diarization Error Rate (DER)
The standard metric is the Diarization Error Rate:
DER = (False Alarm + Missed Speech + Speaker Confusion) / Total Speech Duration
- False Alarm: Non-speech classified as speech.
- Missed Speech: Speech classified as non-speech.
- Speaker Confusion: Speech attributed to the wrong speaker.
A collar tolerance (typically 0.25s) around reference boundaries is applied to forgive minor boundary imprecision.