Implementation:Speechbrain Speechbrain Compute Embeddings

Property	Value
Implementation Name	Compute Embeddings
Type	API Doc
Repository	speechbrain/speechbrain
Source File	`recipes/VoxCeleb/SpeakerRec/extract_speaker_embeddings.py:L45-67` (single), `L70-101` (batch)
Import	Recipe-specific
Related Principle	Principle:Speechbrain_Speechbrain_Embedding_Extraction

API Signatures

compute_embeddings_single

def compute_embeddings_single(wavs, wav_lens, params):
    """Compute speaker embeddings.

    Arguments
    ---------
    wavs : torch.Tensor
        Waveform tensor of shape (batch, time). Sample rate must be 16000 Hz.
    wav_lens : torch.Tensor
        Relative lengths for each sentence (e.g., [0.8, 0.6, 1.0]).
    params : dict
        Parameter dictionary with model components.

    Returns
    -------
    embeddings : torch.Tensor
        Speaker embeddings of shape (batch, embedding_dim).
    """

compute_embeddings

def compute_embeddings(params, wav_scp, outdir):
    """Compute speaker embeddings for all utterances in a file list.

    Arguments
    ---------
    params : dict
        Parameter dictionary with model components.
    wav_scp : str
        Path to a Kaldi-style wav.scp file (format: "utt_id wav_path").
    outdir : str
        Output directory for storing per-utterance .npy embedding files.
    """

Description

These functions extract speaker embeddings from a trained model for downstream tasks such as speaker verification and clustering. compute_embeddings_single processes a single batch of waveforms and returns embedding tensors. compute_embeddings reads a full file list and saves each embedding as a NumPy file to disk.

Parameters

compute_embeddings_single

Parameter	Type	Description
wavs	`torch.Tensor`	Speech waveform tensor of shape `(batch, time)`. Must be 16 kHz.
wav_lens	`torch.Tensor`	Relative length of each waveform in the batch (values between 0 and 1, or absolute sample counts).
params	dict	Dictionary containing the model components: `compute_features`, `mean_var_norm`, `embedding_model`.

compute_embeddings

Parameter	Type	Description
params	dict	Dictionary containing model components (same as above).
wav_scp	str	Path to a text file where each line contains `utterance_id wav_file_path`.
outdir	str	Output directory where `.npy` embedding files are saved.

Inputs

Waveform tensors (for compute_embeddings_single): Raw audio at 16 kHz.
wav.scp file (for compute_embeddings): A text file in Kaldi format listing utterance IDs and their corresponding wav file paths.
Trained model parameters: The params dict must contain pre-loaded and evaluated model components.

Outputs

compute_embeddings_single: Returns a torch.Tensor of shape (batch, embedding_dim).
compute_embeddings: Writes one .npy file per utterance to the output directory. Each file contains a NumPy array of shape (embedding_dim,).

Implementation Details

Single Utterance Pipeline

def compute_embeddings_single(wavs, wav_lens, params):
    with torch.no_grad():
        feats = params["compute_features"](wavs)
        feats = params["mean_var_norm"](feats, wav_lens)
        embeddings = params["embedding_model"](feats, wav_lens)
    return embeddings.squeeze(1)

The pipeline under torch.no_grad():

Feature extraction: Computes acoustic features (e.g., Fbank) from raw waveforms.
Normalization: Applies mean-variance normalization to the features.
Embedding model: Passes normalized features through the trained encoder (e.g., ECAPA-TDNN) to produce fixed-dimensional embeddings.
Squeeze: Removes the singleton dimension, returning shape (batch, embedding_dim).

Batch File Processing

def compute_embeddings(params, wav_scp, outdir):
    with torch.no_grad():
        with open(wav_scp, "r") as wavscp:
            for line in wavscp:
                utt, wav_path = line.split()
                out_file = "{}/{}.npy".format(outdir, utt)
                wav, _ = torchaudio.load(wav_path)
                data = wav.transpose(0, 1).squeeze(1).unsqueeze(0)
                lens = torch.Tensor([data.shape[1]])
                data, lens = data.to(device), lens.to(device)
                embedding = compute_embeddings_single(
                    data, lens, params
                ).squeeze()
                out_embedding = embedding.detach().cpu().numpy()
                np.save(out_file, out_embedding)
                del out_embedding, wav, data

For each line in the wav.scp file:

Parses the utterance ID and wav file path.
Loads the audio using torchaudio.load.
Reshapes to (1, time) batch format.
Moves tensors to the target device (GPU/CPU).
Calls compute_embeddings_single to get the embedding.
Saves the embedding as a NumPy .npy file.
Frees memory for the processed utterance.

Usage Example

import sys
import os
import torch
import speechbrain as sb
from hyperpyyaml import load_hyperpyyaml
from speechbrain.utils.distributed import run_on_main

# Setup
in_list = "data/wav.scp"       # Kaldi-style file list
out_dir = "embeddings/output"   # Output directory
os.makedirs(out_dir, exist_ok=True)

# Load hyperparameters and pretrained model
params_file, run_opts, overrides = sb.core.parse_arguments(sys.argv[1:])
with open(params_file) as fin:
    params = load_hyperpyyaml(fin, overrides)

# Load pretrained weights
run_on_main(params["pretrainer"].collect_files)
params["pretrainer"].load_collected(run_opts["device"])
params["embedding_model"].eval()
params["embedding_model"].to(run_opts["device"])

# Extract embeddings for all utterances
compute_embeddings(params, in_list, out_dir)

# Or extract a single embedding
import torchaudio
wav, sr = torchaudio.load("test_utterance.wav")
wav = wav.transpose(0, 1).squeeze(1).unsqueeze(0)
lens = torch.Tensor([wav.shape[1]])
embedding = compute_embeddings_single(wav, lens, params)
print(embedding.shape)  # e.g., torch.Size([1, 192])

Command-Line Usage

python extract_speaker_embeddings.py \
    data/wav.scp \
    embeddings/output \
    hparams/verification_ecapa.yaml \
    --device cuda:0

Arguments:

Path to the wav.scp file
Output directory for embeddings
Hyperparameter YAML file
Additional overrides (e.g., device)

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment

API Signatures

compute_embeddings_single

compute_embeddings

Description

Parameters

compute_embeddings_single

compute_embeddings

Inputs

Outputs

Implementation Details

Single Utterance Pipeline

Batch File Processing

Usage Example

Command-Line Usage

See Also

Page Connections