Implementation:Speechbrain Speechbrain EncoderClassifier Encode Batch
| Property | Value |
|---|---|
| Type | Wrapper Doc |
| Repository | speechbrain/speechbrain |
| Source File | recipes/LibriTTS/TTS/mstacotron2/compute_speaker_embeddings.py:L1-129, speechbrain/inference/classifiers.py:L26-117
|
| Import | from speechbrain.inference.classifiers import EncoderClassifier
|
| Related Principle | Principle:Speechbrain_Speechbrain_Speaker_Embedding_Precomputation |
API Signatures
EncoderClassifier.from_hparams
classifier = EncoderClassifier.from_hparams(
source="speechbrain/spkrec-ecapa-voxceleb",
savedir="tmpdir_spk_emb",
run_opts={"device": "cuda:0"},
)
Loads a pretrained speaker encoder from HuggingFace Hub or a local directory.
EncoderClassifier.encode_batch
def encode_batch(self, wavs, wav_lens=None, normalize=False):
Encodes input audio waveforms into fixed-dimensional speaker embedding vectors.
compute_speaker_embeddings
def compute_speaker_embeddings(
input_filepaths,
output_file_paths,
data_folder,
spk_emb_encoder_path,
spk_emb_sr,
mel_spec_params,
device,
):
Wrapper function that processes JSON manifest files and computes speaker embeddings for every utterance, saving results to pickle files.
Description
This implementation provides two layers of functionality:
- Low-level API (
EncoderClassifier.encode_batch): Takes raw waveform tensors and returns speaker embedding vectors through a pipeline of feature extraction, normalization, and ECAPA-TDNN encoding.
- High-level wrapper (
compute_speaker_embeddings): Orchestrates bulk embedding computation for an entire dataset by iterating over JSON manifests, loading audio files, callingencode_batch, and saving the results.
Parameters
encode_batch Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
| wavs | torch.Tensor | required | Batch of waveforms with shape [batch, time] or [batch, time, channels]. Expected sample rate: 16000 Hz
|
| wav_lens | torch.Tensor | None | Relative lengths of waveforms in the batch (values between 0 and 1). If None, all set to 1.0 |
| normalize | bool | False | If True, normalizes embeddings using stored mean-variance statistics |
compute_speaker_embeddings Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
| input_filepaths | list | required | List of paths to JSON manifest files (e.g., train.json, valid.json, test.json) |
| output_file_paths | list | required | List of output pickle file paths, one per input manifest |
| data_folder | str | required | Root path to the LibriTTS data folder (used to resolve {data_root} placeholders)
|
| spk_emb_encoder_path | str | required | HuggingFace model ID or local path for the speaker encoder (e.g., "speechbrain/spkrec-ecapa-voxceleb")
|
| spk_emb_sr | int | required | Sample rate expected by the speaker encoder (typically 16000) |
| mel_spec_params | dict | required | Dictionary with mel-spectrogram parameters and custom_mel_spec_encoder flag
|
| device | str | required | Compute device (e.g., "cuda:0" or "cpu")
|
Returns
encode_batch
Returns a torch.Tensor of shape [batch, 1, 192] containing the speaker embedding for each input waveform.
compute_speaker_embeddings
Returns None. Writes pickle files to the specified output_file_paths.
Internal Pipeline
The encode_batch method implements a three-stage pipeline:
# Stage 1: Compute filterbank features
feats = self.mods.compute_features(wavs)
# Stage 2: Mean-variance normalization
feats = self.mods.mean_var_norm(feats, wav_lens)
# Stage 3: Extract embeddings via ECAPA-TDNN
embeddings = self.mods.embedding_model(feats, wav_lens)
The required modules are declared in the MODULES_NEEDED class attribute:
MODULES_NEEDED = [
"compute_features",
"mean_var_norm",
"embedding_model",
"classifier",
]
Usage Examples
Direct Embedding Extraction
import torchaudio
from speechbrain.inference.classifiers import EncoderClassifier
# Load pretrained ECAPA-TDNN speaker encoder
classifier = EncoderClassifier.from_hparams(
source="speechbrain/spkrec-ecapa-voxceleb",
savedir="pretrained_models/spkrec-ecapa-voxceleb",
run_opts={"device": "cuda:0"},
)
# Load an audio file
signal, fs = torchaudio.load("utterance.wav")
# Extract speaker embedding (192-dim)
embedding = classifier.encode_batch(signal)
print(embedding.shape) # torch.Size([1, 1, 192])
# Squeeze to get a flat vector
embedding = embedding.squeeze()
print(embedding.shape) # torch.Size([192])
Batch Embedding Extraction with Length Handling
import torch
import torchaudio
from speechbrain.inference.classifiers import EncoderClassifier
classifier = EncoderClassifier.from_hparams(
source="speechbrain/spkrec-ecapa-voxceleb",
savedir="pretrained_models/spkrec-ecapa-voxceleb",
)
# Load multiple files and pad to same length
signals = []
lengths = []
for wav_path in ["utt1.wav", "utt2.wav", "utt3.wav"]:
sig, sr = torchaudio.load(wav_path)
signals.append(sig.squeeze())
lengths.append(sig.shape[1])
max_len = max(lengths)
padded = torch.zeros(len(signals), max_len)
for i, sig in enumerate(signals):
padded[i, :sig.shape[0]] = sig
wav_lens = torch.tensor([l / max_len for l in lengths])
# Extract batch of embeddings
embeddings = classifier.encode_batch(padded, wav_lens)
print(embeddings.shape) # torch.Size([3, 1, 192])
Bulk Dataset Embedding Computation
from compute_speaker_embeddings import compute_speaker_embeddings
compute_speaker_embeddings(
input_filepaths=[
"results/save/train.json",
"results/save/valid.json",
"results/save/test.json",
],
output_file_paths=[
"results/save/train_speaker_embeddings.pickle",
"results/save/valid_speaker_embeddings.pickle",
"results/save/test_speaker_embeddings.pickle",
],
data_folder="/data/LibriTTS",
spk_emb_encoder_path="speechbrain/spkrec-ecapa-voxceleb",
spk_emb_sr=16000,
mel_spec_params={
"custom_mel_spec_encoder": False,
"sample_rate": 16000,
"hop_length": 256,
"win_length": 1024,
"n_mel_channels": 80,
"n_fft": 1024,
"mel_fmin": 0.0,
"mel_fmax": 8000.0,
"mel_normalized": False,
"power": 1,
"norm": "slaney",
"mel_scale": "slaney",
"dynamic_range_compression": True,
},
device="cuda:0",
)
Integration in Training Recipe
import speechbrain as sb
from compute_speaker_embeddings import compute_speaker_embeddings
sb.utils.distributed.run_on_main(
compute_speaker_embeddings,
kwargs={
"input_filepaths": [
hparams["train_json"],
hparams["valid_json"],
hparams["test_json"],
],
"output_file_paths": [
hparams["train_speaker_embeddings_pickle"],
hparams["valid_speaker_embeddings_pickle"],
hparams["test_speaker_embeddings_pickle"],
],
"data_folder": hparams["data_folder"],
"spk_emb_encoder_path": hparams["spk_emb_encoder"],
"spk_emb_sr": hparams["spk_emb_sample_rate"],
"mel_spec_params": {
"custom_mel_spec_encoder": hparams["custom_mel_spec_encoder"],
...
},
"device": run_opts["device"],
},
)
Wrapper Implementation Detail
The compute_speaker_embeddings function processes each utterance individually:
for utt_id, utt_data in tqdm(json_data.items()):
utt_wav_path = utt_data["wav"]
utt_wav_path = utt_wav_path.replace("{data_root}", data_folder)
# Load and resample if needed
signal, sig_sr = torchaudio.load(utt_wav_path)
if sig_sr != spk_emb_sr:
signal = torchaudio.functional.resample(signal, sig_sr, spk_emb_sr)
signal = signal.to(device)
# Compute embedding
spk_emb = spk_emb_encoder.encode_batch(signal)
spk_emb = spk_emb.squeeze().detach()
speaker_embeddings[utt_id] = spk_emb.cpu()
# Save to pickle
with open(output_file_path, "wb") as output_file:
pickle.dump(speaker_embeddings, output_file, protocol=pickle.HIGHEST_PROTOCOL)
Key implementation details:
- Audio is resampled to the speaker encoder's expected sample rate if they differ
- Embeddings are detached from the computation graph and moved to CPU before storage
- The
{data_root}placeholder in wav paths is replaced with the actual data folder path - Results are serialized using Python's pickle with the highest protocol for efficiency
Idempotency
The skip() function checks if all output pickle files already exist. If so, embedding computation is skipped entirely:
def skip(filepaths):
for filepath in filepaths:
if not os.path.isfile(filepath):
return False
return True
Alternative Encoder
When mel_spec_params["custom_mel_spec_encoder"] is True, the MelSpectrogramEncoder is used instead of EncoderClassifier:
if mel_spec_params["custom_mel_spec_encoder"]:
spk_emb_encoder = MelSpectrogramEncoder.from_hparams(
source=spk_emb_encoder_path, run_opts={"device": device}
)
spk_emb = spk_emb_encoder.encode_waveform(signal)
else:
spk_emb_encoder = EncoderClassifier.from_hparams(
source=spk_emb_encoder_path, run_opts={"device": device}
)
spk_emb = spk_emb_encoder.encode_batch(signal)
See Also
- Principle:Speechbrain_Speechbrain_Speaker_Embedding_Precomputation - Theoretical foundations of speaker embedding precomputation
- Implementation:Speechbrain_Speechbrain_Prepare_Libritts - Data preparation that produces the JSON manifests consumed by this module
- Implementation:Speechbrain_Speechbrain_Tacotron2Brain_Compute_Forward - Training recipe that consumes the precomputed embeddings