Implementation:Speechbrain Speechbrain EncoderClassifier Encode Batch

Property	Value
Type	Wrapper Doc
Repository	speechbrain/speechbrain
Source File	`recipes/LibriTTS/TTS/mstacotron2/compute_speaker_embeddings.py:L1-129`, `speechbrain/inference/classifiers.py:L26-117`
Import	`from speechbrain.inference.classifiers import EncoderClassifier`
Related Principle	Principle:Speechbrain_Speechbrain_Speaker_Embedding_Precomputation

API Signatures

EncoderClassifier.from_hparams

classifier = EncoderClassifier.from_hparams(
    source="speechbrain/spkrec-ecapa-voxceleb",
    savedir="tmpdir_spk_emb",
    run_opts={"device": "cuda:0"},
)

Loads a pretrained speaker encoder from HuggingFace Hub or a local directory.

EncoderClassifier.encode_batch

def encode_batch(self, wavs, wav_lens=None, normalize=False):

Encodes input audio waveforms into fixed-dimensional speaker embedding vectors.

compute_speaker_embeddings

def compute_speaker_embeddings(
    input_filepaths,
    output_file_paths,
    data_folder,
    spk_emb_encoder_path,
    spk_emb_sr,
    mel_spec_params,
    device,
):

Wrapper function that processes JSON manifest files and computes speaker embeddings for every utterance, saving results to pickle files.

Description

This implementation provides two layers of functionality:

Low-level API (EncoderClassifier.encode_batch): Takes raw waveform tensors and returns speaker embedding vectors through a pipeline of feature extraction, normalization, and ECAPA-TDNN encoding.

High-level wrapper (compute_speaker_embeddings): Orchestrates bulk embedding computation for an entire dataset by iterating over JSON manifests, loading audio files, calling encode_batch, and saving the results.

Parameters

encode_batch Parameters

Parameter	Type	Default	Description
wavs	torch.Tensor	required	Batch of waveforms with shape `[batch, time]` or `[batch, time, channels]`. Expected sample rate: 16000 Hz
wav_lens	torch.Tensor	None	Relative lengths of waveforms in the batch (values between 0 and 1). If None, all set to 1.0
normalize	bool	False	If True, normalizes embeddings using stored mean-variance statistics

compute_speaker_embeddings Parameters

Parameter	Type	Default	Description
input_filepaths	list	required	List of paths to JSON manifest files (e.g., train.json, valid.json, test.json)
output_file_paths	list	required	List of output pickle file paths, one per input manifest
data_folder	str	required	Root path to the LibriTTS data folder (used to resolve `{data_root}` placeholders)
spk_emb_encoder_path	str	required	HuggingFace model ID or local path for the speaker encoder (e.g., `"speechbrain/spkrec-ecapa-voxceleb"`)
spk_emb_sr	int	required	Sample rate expected by the speaker encoder (typically 16000)
mel_spec_params	dict	required	Dictionary with mel-spectrogram parameters and `custom_mel_spec_encoder` flag
device	str	required	Compute device (e.g., `"cuda:0"` or `"cpu"`)

Returns

encode_batch

Returns a torch.Tensor of shape [batch, 1, 192] containing the speaker embedding for each input waveform.

compute_speaker_embeddings

Returns None. Writes pickle files to the specified output_file_paths.

Internal Pipeline

The encode_batch method implements a three-stage pipeline:

# Stage 1: Compute filterbank features
feats = self.mods.compute_features(wavs)

# Stage 2: Mean-variance normalization
feats = self.mods.mean_var_norm(feats, wav_lens)

# Stage 3: Extract embeddings via ECAPA-TDNN
embeddings = self.mods.embedding_model(feats, wav_lens)

The required modules are declared in the MODULES_NEEDED class attribute:

MODULES_NEEDED = [
    "compute_features",
    "mean_var_norm",
    "embedding_model",
    "classifier",
]

Usage Examples

Direct Embedding Extraction

import torchaudio
from speechbrain.inference.classifiers import EncoderClassifier

# Load pretrained ECAPA-TDNN speaker encoder
classifier = EncoderClassifier.from_hparams(
    source="speechbrain/spkrec-ecapa-voxceleb",
    savedir="pretrained_models/spkrec-ecapa-voxceleb",
    run_opts={"device": "cuda:0"},
)

# Load an audio file
signal, fs = torchaudio.load("utterance.wav")

# Extract speaker embedding (192-dim)
embedding = classifier.encode_batch(signal)
print(embedding.shape)  # torch.Size([1, 1, 192])

# Squeeze to get a flat vector
embedding = embedding.squeeze()
print(embedding.shape)  # torch.Size([192])

Batch Embedding Extraction with Length Handling

import torch
import torchaudio
from speechbrain.inference.classifiers import EncoderClassifier

classifier = EncoderClassifier.from_hparams(
    source="speechbrain/spkrec-ecapa-voxceleb",
    savedir="pretrained_models/spkrec-ecapa-voxceleb",
)

# Load multiple files and pad to same length
signals = []
lengths = []
for wav_path in ["utt1.wav", "utt2.wav", "utt3.wav"]:
    sig, sr = torchaudio.load(wav_path)
    signals.append(sig.squeeze())
    lengths.append(sig.shape[1])

max_len = max(lengths)
padded = torch.zeros(len(signals), max_len)
for i, sig in enumerate(signals):
    padded[i, :sig.shape[0]] = sig

wav_lens = torch.tensor([l / max_len for l in lengths])

# Extract batch of embeddings
embeddings = classifier.encode_batch(padded, wav_lens)
print(embeddings.shape)  # torch.Size([3, 1, 192])

Bulk Dataset Embedding Computation

from compute_speaker_embeddings import compute_speaker_embeddings

compute_speaker_embeddings(
    input_filepaths=[
        "results/save/train.json",
        "results/save/valid.json",
        "results/save/test.json",
    ],
    output_file_paths=[
        "results/save/train_speaker_embeddings.pickle",
        "results/save/valid_speaker_embeddings.pickle",
        "results/save/test_speaker_embeddings.pickle",
    ],
    data_folder="/data/LibriTTS",
    spk_emb_encoder_path="speechbrain/spkrec-ecapa-voxceleb",
    spk_emb_sr=16000,
    mel_spec_params={
        "custom_mel_spec_encoder": False,
        "sample_rate": 16000,
        "hop_length": 256,
        "win_length": 1024,
        "n_mel_channels": 80,
        "n_fft": 1024,
        "mel_fmin": 0.0,
        "mel_fmax": 8000.0,
        "mel_normalized": False,
        "power": 1,
        "norm": "slaney",
        "mel_scale": "slaney",
        "dynamic_range_compression": True,
    },
    device="cuda:0",
)

Integration in Training Recipe

import speechbrain as sb
from compute_speaker_embeddings import compute_speaker_embeddings

sb.utils.distributed.run_on_main(
    compute_speaker_embeddings,
    kwargs={
        "input_filepaths": [
            hparams["train_json"],
            hparams["valid_json"],
            hparams["test_json"],
        ],
        "output_file_paths": [
            hparams["train_speaker_embeddings_pickle"],
            hparams["valid_speaker_embeddings_pickle"],
            hparams["test_speaker_embeddings_pickle"],
        ],
        "data_folder": hparams["data_folder"],
        "spk_emb_encoder_path": hparams["spk_emb_encoder"],
        "spk_emb_sr": hparams["spk_emb_sample_rate"],
        "mel_spec_params": {
            "custom_mel_spec_encoder": hparams["custom_mel_spec_encoder"],
            ...
        },
        "device": run_opts["device"],
    },
)

Wrapper Implementation Detail

The compute_speaker_embeddings function processes each utterance individually:

for utt_id, utt_data in tqdm(json_data.items()):
    utt_wav_path = utt_data["wav"]
    utt_wav_path = utt_wav_path.replace("{data_root}", data_folder)

    # Load and resample if needed
    signal, sig_sr = torchaudio.load(utt_wav_path)
    if sig_sr != spk_emb_sr:
        signal = torchaudio.functional.resample(signal, sig_sr, spk_emb_sr)
    signal = signal.to(device)

    # Compute embedding
    spk_emb = spk_emb_encoder.encode_batch(signal)
    spk_emb = spk_emb.squeeze().detach()
    speaker_embeddings[utt_id] = spk_emb.cpu()

# Save to pickle
with open(output_file_path, "wb") as output_file:
    pickle.dump(speaker_embeddings, output_file, protocol=pickle.HIGHEST_PROTOCOL)

Key implementation details:

Audio is resampled to the speaker encoder's expected sample rate if they differ
Embeddings are detached from the computation graph and moved to CPU before storage
The {data_root} placeholder in wav paths is replaced with the actual data folder path
Results are serialized using Python's pickle with the highest protocol for efficiency

Idempotency

The skip() function checks if all output pickle files already exist. If so, embedding computation is skipped entirely:

def skip(filepaths):
    for filepath in filepaths:
        if not os.path.isfile(filepath):
            return False
    return True

Alternative Encoder

When mel_spec_params["custom_mel_spec_encoder"] is True, the MelSpectrogramEncoder is used instead of EncoderClassifier:

if mel_spec_params["custom_mel_spec_encoder"]:
    spk_emb_encoder = MelSpectrogramEncoder.from_hparams(
        source=spk_emb_encoder_path, run_opts={"device": device}
    )
    spk_emb = spk_emb_encoder.encode_waveform(signal)
else:
    spk_emb_encoder = EncoderClassifier.from_hparams(
        source=spk_emb_encoder_path, run_opts={"device": device}
    )
    spk_emb = spk_emb_encoder.encode_batch(signal)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment

API Signatures

EncoderClassifier.from_hparams

EncoderClassifier.encode_batch

compute_speaker_embeddings

Description

Parameters

encode_batch Parameters

compute_speaker_embeddings Parameters

Returns

encode_batch

compute_speaker_embeddings

Internal Pipeline

Usage Examples

Direct Embedding Extraction

Batch Embedding Extraction with Length Handling

Bulk Dataset Embedding Computation

Integration in Training Recipe

Wrapper Implementation Detail

Idempotency

Alternative Encoder

See Also

Related Pages

Page Connections