Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Speechbrain Speechbrain Speaker Diarization Pipeline

From Leeroopedia


Knowledge Sources
Domains Speaker_Diarization, Speaker_Recognition
Last Updated 2026-02-09 00:00 GMT

Overview

Speaker diarization segments an audio recording by speaker identity, answering the question "who spoke when" by combining voice activity detection, speaker embedding extraction, and spectral clustering.

Description

Speaker diarization is the task of partitioning a multi-speaker audio stream into homogeneous segments, each attributed to a single speaker. The problem is challenging because the number of speakers is often unknown in advance, speakers may overlap, and turn-taking patterns vary widely across domains (meetings, broadcast, telephone). The embedding-based diarization pipeline addresses this by first detecting speech regions, then extracting fixed-dimensional speaker embeddings from short uniform segments, and finally grouping these embeddings into speaker clusters using spectral clustering.

Usage

Apply this pipeline when you need to determine speaker identities and turn boundaries in multi-speaker recordings such as meetings, interviews, or broadcast content. This approach assumes oracle or predicted voice activity detection boundaries and works best when speakers produce enough speech for reliable embedding extraction.

Theoretical Basis

Pipeline Architecture

The diarization pipeline consists of three sequential stages:

Audio Recording
  -> Voice Activity Detection (VAD)
    -> Uniform Segmentation (e.g., 1.5s windows with 0.75s shift)
      -> Speaker Embedding Extraction (ECAPA-TDNN or x-vector)
        -> Affinity Matrix Construction (cosine similarity)
          -> Spectral Clustering
            -> Speaker Labels per Segment

Stage 1: Voice Activity Detection

VAD identifies the speech regions in the recording, discarding silence and non-speech noise. In the oracle setting, ground-truth speech boundaries from reference annotations are used. In practical systems, a neural VAD or energy-based detector provides these boundaries. The VAD output is a set of time intervals marking speech activity.

Stage 2: Speaker Embedding Extraction

Each speech region is divided into fixed-length subsegments (typically 1.5 seconds with 0.75-second overlap). For each subsegment, a pretrained speaker embedding model (such as ECAPA-TDNN) extracts a fixed-dimensional vector:

For each subsegment s_i:
  features = compute_features(s_i)        # Fbank features
  features = mean_var_norm(features)       # Instance normalization
  emb_i = embedding_model(features)        # e.g., 192-dim vector
  emb_i = mean_var_norm_emb(emb_i)         # Embedding normalization

The embeddings encode speaker identity information and are designed to be similar for segments from the same speaker and dissimilar for different speakers.

Stage 3: Spectral Clustering

Given N embeddings, spectral clustering determines the speaker assignment:

1. Compute affinity matrix: A[i,j] = cosine_similarity(emb_i, emb_j)
2. Apply p-percentile thresholding to prune weak connections
3. Compute the graph Laplacian: L = D - A
4. Extract the k smallest eigenvectors of L
5. Apply k-means on the eigenvector matrix to obtain cluster assignments

The number of speakers k can be specified in advance or estimated automatically using eigenvalue analysis (the eigengap heuristic), where the largest gap in the sorted eigenvalues of the Laplacian indicates the optimal number of clusters.

Evaluation: Diarization Error Rate (DER)

The standard metric is the Diarization Error Rate:

DER = (False Alarm + Missed Speech + Speaker Confusion) / Total Speech Duration
  • False Alarm: Non-speech classified as speech.
  • Missed Speech: Speech classified as non-speech.
  • Speaker Confusion: Speech attributed to the wrong speaker.

A collar tolerance (typically 0.25s) around reference boundaries is applied to forgive minor boundary imprecision.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment