Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Speechbrain Speechbrain EncoderClassifier Encode Batch

From Leeroopedia


Property Value
Type Wrapper Doc
Repository speechbrain/speechbrain
Source File recipes/LibriTTS/TTS/mstacotron2/compute_speaker_embeddings.py:L1-129, speechbrain/inference/classifiers.py:L26-117
Import from speechbrain.inference.classifiers import EncoderClassifier
Related Principle Principle:Speechbrain_Speechbrain_Speaker_Embedding_Precomputation

API Signatures

EncoderClassifier.from_hparams

classifier = EncoderClassifier.from_hparams(
    source="speechbrain/spkrec-ecapa-voxceleb",
    savedir="tmpdir_spk_emb",
    run_opts={"device": "cuda:0"},
)

Loads a pretrained speaker encoder from HuggingFace Hub or a local directory.

EncoderClassifier.encode_batch

def encode_batch(self, wavs, wav_lens=None, normalize=False):

Encodes input audio waveforms into fixed-dimensional speaker embedding vectors.

compute_speaker_embeddings

def compute_speaker_embeddings(
    input_filepaths,
    output_file_paths,
    data_folder,
    spk_emb_encoder_path,
    spk_emb_sr,
    mel_spec_params,
    device,
):

Wrapper function that processes JSON manifest files and computes speaker embeddings for every utterance, saving results to pickle files.

Description

This implementation provides two layers of functionality:

  1. Low-level API (EncoderClassifier.encode_batch): Takes raw waveform tensors and returns speaker embedding vectors through a pipeline of feature extraction, normalization, and ECAPA-TDNN encoding.
  1. High-level wrapper (compute_speaker_embeddings): Orchestrates bulk embedding computation for an entire dataset by iterating over JSON manifests, loading audio files, calling encode_batch, and saving the results.

Parameters

encode_batch Parameters

Parameter Type Default Description
wavs torch.Tensor required Batch of waveforms with shape [batch, time] or [batch, time, channels]. Expected sample rate: 16000 Hz
wav_lens torch.Tensor None Relative lengths of waveforms in the batch (values between 0 and 1). If None, all set to 1.0
normalize bool False If True, normalizes embeddings using stored mean-variance statistics

compute_speaker_embeddings Parameters

Parameter Type Default Description
input_filepaths list required List of paths to JSON manifest files (e.g., train.json, valid.json, test.json)
output_file_paths list required List of output pickle file paths, one per input manifest
data_folder str required Root path to the LibriTTS data folder (used to resolve {data_root} placeholders)
spk_emb_encoder_path str required HuggingFace model ID or local path for the speaker encoder (e.g., "speechbrain/spkrec-ecapa-voxceleb")
spk_emb_sr int required Sample rate expected by the speaker encoder (typically 16000)
mel_spec_params dict required Dictionary with mel-spectrogram parameters and custom_mel_spec_encoder flag
device str required Compute device (e.g., "cuda:0" or "cpu")

Returns

encode_batch

Returns a torch.Tensor of shape [batch, 1, 192] containing the speaker embedding for each input waveform.

compute_speaker_embeddings

Returns None. Writes pickle files to the specified output_file_paths.

Internal Pipeline

The encode_batch method implements a three-stage pipeline:

# Stage 1: Compute filterbank features
feats = self.mods.compute_features(wavs)

# Stage 2: Mean-variance normalization
feats = self.mods.mean_var_norm(feats, wav_lens)

# Stage 3: Extract embeddings via ECAPA-TDNN
embeddings = self.mods.embedding_model(feats, wav_lens)

The required modules are declared in the MODULES_NEEDED class attribute:

MODULES_NEEDED = [
    "compute_features",
    "mean_var_norm",
    "embedding_model",
    "classifier",
]

Usage Examples

Direct Embedding Extraction

import torchaudio
from speechbrain.inference.classifiers import EncoderClassifier

# Load pretrained ECAPA-TDNN speaker encoder
classifier = EncoderClassifier.from_hparams(
    source="speechbrain/spkrec-ecapa-voxceleb",
    savedir="pretrained_models/spkrec-ecapa-voxceleb",
    run_opts={"device": "cuda:0"},
)

# Load an audio file
signal, fs = torchaudio.load("utterance.wav")

# Extract speaker embedding (192-dim)
embedding = classifier.encode_batch(signal)
print(embedding.shape)  # torch.Size([1, 1, 192])

# Squeeze to get a flat vector
embedding = embedding.squeeze()
print(embedding.shape)  # torch.Size([192])

Batch Embedding Extraction with Length Handling

import torch
import torchaudio
from speechbrain.inference.classifiers import EncoderClassifier

classifier = EncoderClassifier.from_hparams(
    source="speechbrain/spkrec-ecapa-voxceleb",
    savedir="pretrained_models/spkrec-ecapa-voxceleb",
)

# Load multiple files and pad to same length
signals = []
lengths = []
for wav_path in ["utt1.wav", "utt2.wav", "utt3.wav"]:
    sig, sr = torchaudio.load(wav_path)
    signals.append(sig.squeeze())
    lengths.append(sig.shape[1])

max_len = max(lengths)
padded = torch.zeros(len(signals), max_len)
for i, sig in enumerate(signals):
    padded[i, :sig.shape[0]] = sig

wav_lens = torch.tensor([l / max_len for l in lengths])

# Extract batch of embeddings
embeddings = classifier.encode_batch(padded, wav_lens)
print(embeddings.shape)  # torch.Size([3, 1, 192])

Bulk Dataset Embedding Computation

from compute_speaker_embeddings import compute_speaker_embeddings

compute_speaker_embeddings(
    input_filepaths=[
        "results/save/train.json",
        "results/save/valid.json",
        "results/save/test.json",
    ],
    output_file_paths=[
        "results/save/train_speaker_embeddings.pickle",
        "results/save/valid_speaker_embeddings.pickle",
        "results/save/test_speaker_embeddings.pickle",
    ],
    data_folder="/data/LibriTTS",
    spk_emb_encoder_path="speechbrain/spkrec-ecapa-voxceleb",
    spk_emb_sr=16000,
    mel_spec_params={
        "custom_mel_spec_encoder": False,
        "sample_rate": 16000,
        "hop_length": 256,
        "win_length": 1024,
        "n_mel_channels": 80,
        "n_fft": 1024,
        "mel_fmin": 0.0,
        "mel_fmax": 8000.0,
        "mel_normalized": False,
        "power": 1,
        "norm": "slaney",
        "mel_scale": "slaney",
        "dynamic_range_compression": True,
    },
    device="cuda:0",
)

Integration in Training Recipe

import speechbrain as sb
from compute_speaker_embeddings import compute_speaker_embeddings

sb.utils.distributed.run_on_main(
    compute_speaker_embeddings,
    kwargs={
        "input_filepaths": [
            hparams["train_json"],
            hparams["valid_json"],
            hparams["test_json"],
        ],
        "output_file_paths": [
            hparams["train_speaker_embeddings_pickle"],
            hparams["valid_speaker_embeddings_pickle"],
            hparams["test_speaker_embeddings_pickle"],
        ],
        "data_folder": hparams["data_folder"],
        "spk_emb_encoder_path": hparams["spk_emb_encoder"],
        "spk_emb_sr": hparams["spk_emb_sample_rate"],
        "mel_spec_params": {
            "custom_mel_spec_encoder": hparams["custom_mel_spec_encoder"],
            ...
        },
        "device": run_opts["device"],
    },
)

Wrapper Implementation Detail

The compute_speaker_embeddings function processes each utterance individually:

for utt_id, utt_data in tqdm(json_data.items()):
    utt_wav_path = utt_data["wav"]
    utt_wav_path = utt_wav_path.replace("{data_root}", data_folder)

    # Load and resample if needed
    signal, sig_sr = torchaudio.load(utt_wav_path)
    if sig_sr != spk_emb_sr:
        signal = torchaudio.functional.resample(signal, sig_sr, spk_emb_sr)
    signal = signal.to(device)

    # Compute embedding
    spk_emb = spk_emb_encoder.encode_batch(signal)
    spk_emb = spk_emb.squeeze().detach()
    speaker_embeddings[utt_id] = spk_emb.cpu()

# Save to pickle
with open(output_file_path, "wb") as output_file:
    pickle.dump(speaker_embeddings, output_file, protocol=pickle.HIGHEST_PROTOCOL)

Key implementation details:

  • Audio is resampled to the speaker encoder's expected sample rate if they differ
  • Embeddings are detached from the computation graph and moved to CPU before storage
  • The {data_root} placeholder in wav paths is replaced with the actual data folder path
  • Results are serialized using Python's pickle with the highest protocol for efficiency

Idempotency

The skip() function checks if all output pickle files already exist. If so, embedding computation is skipped entirely:

def skip(filepaths):
    for filepath in filepaths:
        if not os.path.isfile(filepath):
            return False
    return True

Alternative Encoder

When mel_spec_params["custom_mel_spec_encoder"] is True, the MelSpectrogramEncoder is used instead of EncoderClassifier:

if mel_spec_params["custom_mel_spec_encoder"]:
    spk_emb_encoder = MelSpectrogramEncoder.from_hparams(
        source=spk_emb_encoder_path, run_opts={"device": device}
    )
    spk_emb = spk_emb_encoder.encode_waveform(signal)
else:
    spk_emb_encoder = EncoderClassifier.from_hparams(
        source=spk_emb_encoder_path, run_opts={"device": device}
    )
    spk_emb = spk_emb_encoder.encode_batch(signal)

See Also

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment