Implementation:Speechbrain Speechbrain Compute Embeddings
Appearance
| Property | Value |
|---|---|
| Implementation Name | Compute Embeddings |
| Type | API Doc |
| Repository | speechbrain/speechbrain |
| Source File | recipes/VoxCeleb/SpeakerRec/extract_speaker_embeddings.py:L45-67 (single), L70-101 (batch)
|
| Import | Recipe-specific |
| Related Principle | Principle:Speechbrain_Speechbrain_Embedding_Extraction |
API Signatures
compute_embeddings_single
def compute_embeddings_single(wavs, wav_lens, params):
"""Compute speaker embeddings.
Arguments
---------
wavs : torch.Tensor
Waveform tensor of shape (batch, time). Sample rate must be 16000 Hz.
wav_lens : torch.Tensor
Relative lengths for each sentence (e.g., [0.8, 0.6, 1.0]).
params : dict
Parameter dictionary with model components.
Returns
-------
embeddings : torch.Tensor
Speaker embeddings of shape (batch, embedding_dim).
"""
compute_embeddings
def compute_embeddings(params, wav_scp, outdir):
"""Compute speaker embeddings for all utterances in a file list.
Arguments
---------
params : dict
Parameter dictionary with model components.
wav_scp : str
Path to a Kaldi-style wav.scp file (format: "utt_id wav_path").
outdir : str
Output directory for storing per-utterance .npy embedding files.
"""
Description
These functions extract speaker embeddings from a trained model for downstream tasks such as speaker verification and clustering. compute_embeddings_single processes a single batch of waveforms and returns embedding tensors. compute_embeddings reads a full file list and saves each embedding as a NumPy file to disk.
Parameters
compute_embeddings_single
| Parameter | Type | Description |
|---|---|---|
| wavs | torch.Tensor |
Speech waveform tensor of shape (batch, time). Must be 16 kHz.
|
| wav_lens | torch.Tensor |
Relative length of each waveform in the batch (values between 0 and 1, or absolute sample counts). |
| params | dict | Dictionary containing the model components: compute_features, mean_var_norm, embedding_model.
|
compute_embeddings
| Parameter | Type | Description |
|---|---|---|
| params | dict | Dictionary containing model components (same as above). |
| wav_scp | str | Path to a text file where each line contains utterance_id wav_file_path.
|
| outdir | str | Output directory where .npy embedding files are saved.
|
Inputs
- Waveform tensors (for
compute_embeddings_single): Raw audio at 16 kHz. - wav.scp file (for
compute_embeddings): A text file in Kaldi format listing utterance IDs and their corresponding wav file paths. - Trained model parameters: The
paramsdict must contain pre-loaded and evaluated model components.
Outputs
- compute_embeddings_single: Returns a
torch.Tensorof shape(batch, embedding_dim). - compute_embeddings: Writes one
.npyfile per utterance to the output directory. Each file contains a NumPy array of shape(embedding_dim,).
Implementation Details
Single Utterance Pipeline
def compute_embeddings_single(wavs, wav_lens, params):
with torch.no_grad():
feats = params["compute_features"](wavs)
feats = params["mean_var_norm"](feats, wav_lens)
embeddings = params["embedding_model"](feats, wav_lens)
return embeddings.squeeze(1)
The pipeline under torch.no_grad():
- Feature extraction: Computes acoustic features (e.g., Fbank) from raw waveforms.
- Normalization: Applies mean-variance normalization to the features.
- Embedding model: Passes normalized features through the trained encoder (e.g., ECAPA-TDNN) to produce fixed-dimensional embeddings.
- Squeeze: Removes the singleton dimension, returning shape
(batch, embedding_dim).
Batch File Processing
def compute_embeddings(params, wav_scp, outdir):
with torch.no_grad():
with open(wav_scp, "r") as wavscp:
for line in wavscp:
utt, wav_path = line.split()
out_file = "{}/{}.npy".format(outdir, utt)
wav, _ = torchaudio.load(wav_path)
data = wav.transpose(0, 1).squeeze(1).unsqueeze(0)
lens = torch.Tensor([data.shape[1]])
data, lens = data.to(device), lens.to(device)
embedding = compute_embeddings_single(
data, lens, params
).squeeze()
out_embedding = embedding.detach().cpu().numpy()
np.save(out_file, out_embedding)
del out_embedding, wav, data
For each line in the wav.scp file:
- Parses the utterance ID and wav file path.
- Loads the audio using
torchaudio.load. - Reshapes to
(1, time)batch format. - Moves tensors to the target device (GPU/CPU).
- Calls
compute_embeddings_singleto get the embedding. - Saves the embedding as a NumPy
.npyfile. - Frees memory for the processed utterance.
Usage Example
import sys
import os
import torch
import speechbrain as sb
from hyperpyyaml import load_hyperpyyaml
from speechbrain.utils.distributed import run_on_main
# Setup
in_list = "data/wav.scp" # Kaldi-style file list
out_dir = "embeddings/output" # Output directory
os.makedirs(out_dir, exist_ok=True)
# Load hyperparameters and pretrained model
params_file, run_opts, overrides = sb.core.parse_arguments(sys.argv[1:])
with open(params_file) as fin:
params = load_hyperpyyaml(fin, overrides)
# Load pretrained weights
run_on_main(params["pretrainer"].collect_files)
params["pretrainer"].load_collected(run_opts["device"])
params["embedding_model"].eval()
params["embedding_model"].to(run_opts["device"])
# Extract embeddings for all utterances
compute_embeddings(params, in_list, out_dir)
# Or extract a single embedding
import torchaudio
wav, sr = torchaudio.load("test_utterance.wav")
wav = wav.transpose(0, 1).squeeze(1).unsqueeze(0)
lens = torch.Tensor([wav.shape[1]])
embedding = compute_embeddings_single(wav, lens, params)
print(embedding.shape) # e.g., torch.Size([1, 192])
Command-Line Usage
python extract_speaker_embeddings.py \
data/wav.scp \
embeddings/output \
hparams/verification_ecapa.yaml \
--device cuda:0
Arguments:
- Path to the wav.scp file
- Output directory for embeddings
- Hyperparameter YAML file
- Additional overrides (e.g., device)
See Also
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment