Implementation:Speechbrain Speechbrain Speaker Dataio Prep

Property	Value
Implementation Name	Speaker Dataio Prep
Type	API Doc
Repository	speechbrain/speechbrain
Source File	`recipes/VoxCeleb/SpeakerRec/train_speaker_embeddings.py:L114-177`
Import	Recipe-specific. Uses `speechbrain.dataio.dataset.DynamicItemDataset`, `speechbrain.dataio.encoder.CategoricalEncoder`
Related Principle	Principle:Speechbrain_Speechbrain_Speaker_Feature_Pipeline

API Signature

def dataio_prep(hparams):
    """Creates the datasets and their data processing pipelines."""
    # Returns:
    #   train_data: DynamicItemDataset
    #   valid_data: DynamicItemDataset
    #   label_encoder: CategoricalEncoder

Description

Creates the training and validation datasets with their dynamic item pipelines for speaker embedding training. This function sets up two parallel pipelines -- an audio pipeline that loads waveform segments from disk, and a label pipeline that encodes speaker identities as integer indices. It returns fully configured DynamicItemDataset objects ready for use with SpeechBrain's Brain.fit() method.

Parameters

Parameter	Type	Description
hparams	dict	Hyperparameters dictionary (loaded from YAML) containing the keys described below.

Required hparams Keys

Key	Type	Description
data_folder	str	Root path to VoxCeleb wav data.
train_annotation	str	Path to the training CSV file (e.g., train.csv).
valid_annotation	str	Path to the validation CSV file (e.g., dev.csv).
save_folder	str	Path to folder for saving the label encoder file.
sample_rate	int	Audio sample rate (typically 16000).
sentence_len	float	Target segment length in seconds (e.g., 3.0).
random_chunk	bool	Whether to randomly sample segment start positions.

Inputs

CSV files: train.csv and dev.csv produced by prepare_voxceleb, with columns: ID, duration, wav, start, stop, spk_id.
hparams dictionary: Configuration specifying data paths, sample rate, and segment length.

Outputs

Output	Type	Description
train_data	`DynamicItemDataset`	Training dataset with output keys `["id", "sig", "spk_id_encoded"]`
valid_data	`DynamicItemDataset`	Validation dataset with output keys `["id", "sig", "spk_id_encoded"]`
label_encoder	`CategoricalEncoder`	Fitted encoder mapping speaker IDs to integer indices

Audio Pipeline

@sb.utils.data_pipeline.takes("wav", "start", "stop", "duration")
@sb.utils.data_pipeline.provides("sig")
def audio_pipeline(wav, start, stop, duration):
    if hparams["random_chunk"]:
        duration_sample = int(duration * hparams["sample_rate"])
        start = random.randint(0, duration_sample - snt_len_sample)
        stop = start + snt_len_sample
    else:
        start = int(start)
        stop = int(stop)
    num_frames = stop - start
    sig, fs = torchaudio.load(
        wav, num_frames=num_frames, frame_offset=start
    )
    sig = sig.transpose(0, 1).squeeze(1)
    return sig

Behavior:

When random_chunk=True: Ignores the pre-computed start/stop and randomly selects a segment of length sentence_len * sample_rate within the full utterance.
When random_chunk=False: Uses the exact start/stop sample indices from the CSV.
Uses torchaudio.load with frame_offset and num_frames for efficient partial file reading.

Label Pipeline

@sb.utils.data_pipeline.takes("spk_id")
@sb.utils.data_pipeline.provides("spk_id", "spk_id_encoded")
def label_pipeline(spk_id):
    yield spk_id
    spk_id_encoded = label_encoder.encode_sequence_torch([spk_id])
    yield spk_id_encoded

Behavior:

Passes through the original string spk_id as-is.
Encodes the speaker ID as an integer tensor using the CategoricalEncoder.
Uses Python generator (yield) for lazy multi-output production.

Label Encoder Setup

label_encoder = sb.dataio.encoder.CategoricalEncoder()

lab_enc_file = os.path.join(hparams["save_folder"], "label_encoder.txt")
label_encoder.load_or_create(
    path=lab_enc_file,
    from_didatasets=[train_data],
    output_key="spk_id",
)

Behavior:

If label_encoder.txt exists, loads the existing mapping.
If not, scans all spk_id values from the training dataset, assigns integer indices, and saves to file.
Only fitted on training data to prevent label leakage from validation.

Usage Example

import speechbrain as sb
from hyperpyyaml import load_hyperpyyaml

# Load hyperparameters
with open("hyperparams/train_ecapa_tdnn.yaml") as fin:
    hparams = load_hyperpyyaml(fin)

# Create datasets and encoder
train_data, valid_data, label_encoder = dataio_prep(hparams)

# Inspect a training sample
sample = train_data[0]
print(sample["id"])              # e.g., "id10001--sess01--utt01_0.0_3.0"
print(sample["sig"].shape)       # e.g., torch.Size([48000]) for 3s at 16kHz
print(sample["spk_id_encoded"])  # e.g., tensor([42])

# Use in Brain.fit()
speaker_brain = SpeakerBrain(
    modules=hparams["modules"],
    opt_class=hparams["opt_class"],
    hparams=hparams,
    run_opts=run_opts,
    checkpointer=hparams["checkpointer"],
)
speaker_brain.fit(
    speaker_brain.hparams.epoch_counter,
    train_data,
    valid_data,
    train_loader_kwargs=hparams["dataloader_options"],
    valid_loader_kwargs=hparams["dataloader_options"],
)

Implementation Notes

The snt_len_sample variable is computed as int(hparams["sample_rate"] * hparams["sentence_len"]) and represents the target number of audio samples per segment.
DynamicItemDataset.from_csv supports a replacements dictionary for path substitution, allowing CSV paths to use a placeholder (e.g., data_root) that is replaced at runtime.
set_output_keys is called on both datasets simultaneously to ensure consistent batch structure.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment

API Signature

Description

Parameters

Required hparams Keys

Inputs

Outputs

Audio Pipeline

Label Pipeline

Label Encoder Setup

Usage Example

Implementation Notes

See Also

Related Pages

Page Connections