Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Speechbrain Speechbrain Speaker Dataio Prep

From Leeroopedia


Property Value
Implementation Name Speaker Dataio Prep
Type API Doc
Repository speechbrain/speechbrain
Source File recipes/VoxCeleb/SpeakerRec/train_speaker_embeddings.py:L114-177
Import Recipe-specific. Uses speechbrain.dataio.dataset.DynamicItemDataset, speechbrain.dataio.encoder.CategoricalEncoder
Related Principle Principle:Speechbrain_Speechbrain_Speaker_Feature_Pipeline

API Signature

def dataio_prep(hparams):
    """Creates the datasets and their data processing pipelines."""
    # Returns:
    #   train_data: DynamicItemDataset
    #   valid_data: DynamicItemDataset
    #   label_encoder: CategoricalEncoder

Description

Creates the training and validation datasets with their dynamic item pipelines for speaker embedding training. This function sets up two parallel pipelines -- an audio pipeline that loads waveform segments from disk, and a label pipeline that encodes speaker identities as integer indices. It returns fully configured DynamicItemDataset objects ready for use with SpeechBrain's Brain.fit() method.

Parameters

Parameter Type Description
hparams dict Hyperparameters dictionary (loaded from YAML) containing the keys described below.

Required hparams Keys

Key Type Description
data_folder str Root path to VoxCeleb wav data.
train_annotation str Path to the training CSV file (e.g., train.csv).
valid_annotation str Path to the validation CSV file (e.g., dev.csv).
save_folder str Path to folder for saving the label encoder file.
sample_rate int Audio sample rate (typically 16000).
sentence_len float Target segment length in seconds (e.g., 3.0).
random_chunk bool Whether to randomly sample segment start positions.

Inputs

  • CSV files: train.csv and dev.csv produced by prepare_voxceleb, with columns: ID, duration, wav, start, stop, spk_id.
  • hparams dictionary: Configuration specifying data paths, sample rate, and segment length.

Outputs

Output Type Description
train_data DynamicItemDataset Training dataset with output keys ["id", "sig", "spk_id_encoded"]
valid_data DynamicItemDataset Validation dataset with output keys ["id", "sig", "spk_id_encoded"]
label_encoder CategoricalEncoder Fitted encoder mapping speaker IDs to integer indices

Audio Pipeline

@sb.utils.data_pipeline.takes("wav", "start", "stop", "duration")
@sb.utils.data_pipeline.provides("sig")
def audio_pipeline(wav, start, stop, duration):
    if hparams["random_chunk"]:
        duration_sample = int(duration * hparams["sample_rate"])
        start = random.randint(0, duration_sample - snt_len_sample)
        stop = start + snt_len_sample
    else:
        start = int(start)
        stop = int(stop)
    num_frames = stop - start
    sig, fs = torchaudio.load(
        wav, num_frames=num_frames, frame_offset=start
    )
    sig = sig.transpose(0, 1).squeeze(1)
    return sig

Behavior:

  • When random_chunk=True: Ignores the pre-computed start/stop and randomly selects a segment of length sentence_len * sample_rate within the full utterance.
  • When random_chunk=False: Uses the exact start/stop sample indices from the CSV.
  • Uses torchaudio.load with frame_offset and num_frames for efficient partial file reading.

Label Pipeline

@sb.utils.data_pipeline.takes("spk_id")
@sb.utils.data_pipeline.provides("spk_id", "spk_id_encoded")
def label_pipeline(spk_id):
    yield spk_id
    spk_id_encoded = label_encoder.encode_sequence_torch([spk_id])
    yield spk_id_encoded

Behavior:

  • Passes through the original string spk_id as-is.
  • Encodes the speaker ID as an integer tensor using the CategoricalEncoder.
  • Uses Python generator (yield) for lazy multi-output production.

Label Encoder Setup

label_encoder = sb.dataio.encoder.CategoricalEncoder()

lab_enc_file = os.path.join(hparams["save_folder"], "label_encoder.txt")
label_encoder.load_or_create(
    path=lab_enc_file,
    from_didatasets=[train_data],
    output_key="spk_id",
)

Behavior:

  • If label_encoder.txt exists, loads the existing mapping.
  • If not, scans all spk_id values from the training dataset, assigns integer indices, and saves to file.
  • Only fitted on training data to prevent label leakage from validation.

Usage Example

import speechbrain as sb
from hyperpyyaml import load_hyperpyyaml

# Load hyperparameters
with open("hyperparams/train_ecapa_tdnn.yaml") as fin:
    hparams = load_hyperpyyaml(fin)

# Create datasets and encoder
train_data, valid_data, label_encoder = dataio_prep(hparams)

# Inspect a training sample
sample = train_data[0]
print(sample["id"])              # e.g., "id10001--sess01--utt01_0.0_3.0"
print(sample["sig"].shape)       # e.g., torch.Size([48000]) for 3s at 16kHz
print(sample["spk_id_encoded"])  # e.g., tensor([42])

# Use in Brain.fit()
speaker_brain = SpeakerBrain(
    modules=hparams["modules"],
    opt_class=hparams["opt_class"],
    hparams=hparams,
    run_opts=run_opts,
    checkpointer=hparams["checkpointer"],
)
speaker_brain.fit(
    speaker_brain.hparams.epoch_counter,
    train_data,
    valid_data,
    train_loader_kwargs=hparams["dataloader_options"],
    valid_loader_kwargs=hparams["dataloader_options"],
)

Implementation Notes

  • The snt_len_sample variable is computed as int(hparams["sample_rate"] * hparams["sentence_len"]) and represents the target number of audio samples per segment.
  • DynamicItemDataset.from_csv supports a replacements dictionary for path substitution, allowing CSV paths to use a placeholder (e.g., data_root) that is replaced at runtime.
  • set_output_keys is called on both datasets simultaneously to ensure consistent batch structure.

See Also

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment