Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Speechbrain Speechbrain Compute Embeddings

From Leeroopedia


Property Value
Implementation Name Compute Embeddings
Type API Doc
Repository speechbrain/speechbrain
Source File recipes/VoxCeleb/SpeakerRec/extract_speaker_embeddings.py:L45-67 (single), L70-101 (batch)
Import Recipe-specific
Related Principle Principle:Speechbrain_Speechbrain_Embedding_Extraction

API Signatures

compute_embeddings_single

def compute_embeddings_single(wavs, wav_lens, params):
    """Compute speaker embeddings.

    Arguments
    ---------
    wavs : torch.Tensor
        Waveform tensor of shape (batch, time). Sample rate must be 16000 Hz.
    wav_lens : torch.Tensor
        Relative lengths for each sentence (e.g., [0.8, 0.6, 1.0]).
    params : dict
        Parameter dictionary with model components.

    Returns
    -------
    embeddings : torch.Tensor
        Speaker embeddings of shape (batch, embedding_dim).
    """

compute_embeddings

def compute_embeddings(params, wav_scp, outdir):
    """Compute speaker embeddings for all utterances in a file list.

    Arguments
    ---------
    params : dict
        Parameter dictionary with model components.
    wav_scp : str
        Path to a Kaldi-style wav.scp file (format: "utt_id wav_path").
    outdir : str
        Output directory for storing per-utterance .npy embedding files.
    """

Description

These functions extract speaker embeddings from a trained model for downstream tasks such as speaker verification and clustering. compute_embeddings_single processes a single batch of waveforms and returns embedding tensors. compute_embeddings reads a full file list and saves each embedding as a NumPy file to disk.

Parameters

compute_embeddings_single

Parameter Type Description
wavs torch.Tensor Speech waveform tensor of shape (batch, time). Must be 16 kHz.
wav_lens torch.Tensor Relative length of each waveform in the batch (values between 0 and 1, or absolute sample counts).
params dict Dictionary containing the model components: compute_features, mean_var_norm, embedding_model.

compute_embeddings

Parameter Type Description
params dict Dictionary containing model components (same as above).
wav_scp str Path to a text file where each line contains utterance_id wav_file_path.
outdir str Output directory where .npy embedding files are saved.

Inputs

  • Waveform tensors (for compute_embeddings_single): Raw audio at 16 kHz.
  • wav.scp file (for compute_embeddings): A text file in Kaldi format listing utterance IDs and their corresponding wav file paths.
  • Trained model parameters: The params dict must contain pre-loaded and evaluated model components.

Outputs

  • compute_embeddings_single: Returns a torch.Tensor of shape (batch, embedding_dim).
  • compute_embeddings: Writes one .npy file per utterance to the output directory. Each file contains a NumPy array of shape (embedding_dim,).

Implementation Details

Single Utterance Pipeline

def compute_embeddings_single(wavs, wav_lens, params):
    with torch.no_grad():
        feats = params["compute_features"](wavs)
        feats = params["mean_var_norm"](feats, wav_lens)
        embeddings = params["embedding_model"](feats, wav_lens)
    return embeddings.squeeze(1)

The pipeline under torch.no_grad():

  1. Feature extraction: Computes acoustic features (e.g., Fbank) from raw waveforms.
  2. Normalization: Applies mean-variance normalization to the features.
  3. Embedding model: Passes normalized features through the trained encoder (e.g., ECAPA-TDNN) to produce fixed-dimensional embeddings.
  4. Squeeze: Removes the singleton dimension, returning shape (batch, embedding_dim).

Batch File Processing

def compute_embeddings(params, wav_scp, outdir):
    with torch.no_grad():
        with open(wav_scp, "r") as wavscp:
            for line in wavscp:
                utt, wav_path = line.split()
                out_file = "{}/{}.npy".format(outdir, utt)
                wav, _ = torchaudio.load(wav_path)
                data = wav.transpose(0, 1).squeeze(1).unsqueeze(0)
                lens = torch.Tensor([data.shape[1]])
                data, lens = data.to(device), lens.to(device)
                embedding = compute_embeddings_single(
                    data, lens, params
                ).squeeze()
                out_embedding = embedding.detach().cpu().numpy()
                np.save(out_file, out_embedding)
                del out_embedding, wav, data

For each line in the wav.scp file:

  1. Parses the utterance ID and wav file path.
  2. Loads the audio using torchaudio.load.
  3. Reshapes to (1, time) batch format.
  4. Moves tensors to the target device (GPU/CPU).
  5. Calls compute_embeddings_single to get the embedding.
  6. Saves the embedding as a NumPy .npy file.
  7. Frees memory for the processed utterance.

Usage Example

import sys
import os
import torch
import speechbrain as sb
from hyperpyyaml import load_hyperpyyaml
from speechbrain.utils.distributed import run_on_main

# Setup
in_list = "data/wav.scp"       # Kaldi-style file list
out_dir = "embeddings/output"   # Output directory
os.makedirs(out_dir, exist_ok=True)

# Load hyperparameters and pretrained model
params_file, run_opts, overrides = sb.core.parse_arguments(sys.argv[1:])
with open(params_file) as fin:
    params = load_hyperpyyaml(fin, overrides)

# Load pretrained weights
run_on_main(params["pretrainer"].collect_files)
params["pretrainer"].load_collected(run_opts["device"])
params["embedding_model"].eval()
params["embedding_model"].to(run_opts["device"])

# Extract embeddings for all utterances
compute_embeddings(params, in_list, out_dir)

# Or extract a single embedding
import torchaudio
wav, sr = torchaudio.load("test_utterance.wav")
wav = wav.transpose(0, 1).squeeze(1).unsqueeze(0)
lens = torch.Tensor([wav.shape[1]])
embedding = compute_embeddings_single(wav, lens, params)
print(embedding.shape)  # e.g., torch.Size([1, 192])

Command-Line Usage

python extract_speaker_embeddings.py \
    data/wav.scp \
    embeddings/output \
    hparams/verification_ecapa.yaml \
    --device cuda:0

Arguments:

  1. Path to the wav.scp file
  2. Output directory for embeddings
  3. Hyperparameter YAML file
  4. Additional overrides (e.g., device)

See Also

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment