Implementation:Speechbrain Speechbrain Speaker Dataio Prep
Appearance
| Property | Value |
|---|---|
| Implementation Name | Speaker Dataio Prep |
| Type | API Doc |
| Repository | speechbrain/speechbrain |
| Source File | recipes/VoxCeleb/SpeakerRec/train_speaker_embeddings.py:L114-177
|
| Import | Recipe-specific. Uses speechbrain.dataio.dataset.DynamicItemDataset, speechbrain.dataio.encoder.CategoricalEncoder
|
| Related Principle | Principle:Speechbrain_Speechbrain_Speaker_Feature_Pipeline |
API Signature
def dataio_prep(hparams):
"""Creates the datasets and their data processing pipelines."""
# Returns:
# train_data: DynamicItemDataset
# valid_data: DynamicItemDataset
# label_encoder: CategoricalEncoder
Description
Creates the training and validation datasets with their dynamic item pipelines for speaker embedding training. This function sets up two parallel pipelines -- an audio pipeline that loads waveform segments from disk, and a label pipeline that encodes speaker identities as integer indices. It returns fully configured DynamicItemDataset objects ready for use with SpeechBrain's Brain.fit() method.
Parameters
| Parameter | Type | Description |
|---|---|---|
| hparams | dict | Hyperparameters dictionary (loaded from YAML) containing the keys described below. |
Required hparams Keys
| Key | Type | Description |
|---|---|---|
| data_folder | str | Root path to VoxCeleb wav data. |
| train_annotation | str | Path to the training CSV file (e.g., train.csv). |
| valid_annotation | str | Path to the validation CSV file (e.g., dev.csv). |
| save_folder | str | Path to folder for saving the label encoder file. |
| sample_rate | int | Audio sample rate (typically 16000). |
| sentence_len | float | Target segment length in seconds (e.g., 3.0). |
| random_chunk | bool | Whether to randomly sample segment start positions. |
Inputs
- CSV files: train.csv and dev.csv produced by
prepare_voxceleb, with columns: ID, duration, wav, start, stop, spk_id. - hparams dictionary: Configuration specifying data paths, sample rate, and segment length.
Outputs
| Output | Type | Description |
|---|---|---|
| train_data | DynamicItemDataset |
Training dataset with output keys ["id", "sig", "spk_id_encoded"]
|
| valid_data | DynamicItemDataset |
Validation dataset with output keys ["id", "sig", "spk_id_encoded"]
|
| label_encoder | CategoricalEncoder |
Fitted encoder mapping speaker IDs to integer indices |
Audio Pipeline
@sb.utils.data_pipeline.takes("wav", "start", "stop", "duration")
@sb.utils.data_pipeline.provides("sig")
def audio_pipeline(wav, start, stop, duration):
if hparams["random_chunk"]:
duration_sample = int(duration * hparams["sample_rate"])
start = random.randint(0, duration_sample - snt_len_sample)
stop = start + snt_len_sample
else:
start = int(start)
stop = int(stop)
num_frames = stop - start
sig, fs = torchaudio.load(
wav, num_frames=num_frames, frame_offset=start
)
sig = sig.transpose(0, 1).squeeze(1)
return sig
Behavior:
- When
random_chunk=True: Ignores the pre-computed start/stop and randomly selects a segment of lengthsentence_len * sample_ratewithin the full utterance. - When
random_chunk=False: Uses the exact start/stop sample indices from the CSV. - Uses
torchaudio.loadwithframe_offsetandnum_framesfor efficient partial file reading.
Label Pipeline
@sb.utils.data_pipeline.takes("spk_id")
@sb.utils.data_pipeline.provides("spk_id", "spk_id_encoded")
def label_pipeline(spk_id):
yield spk_id
spk_id_encoded = label_encoder.encode_sequence_torch([spk_id])
yield spk_id_encoded
Behavior:
- Passes through the original string
spk_idas-is. - Encodes the speaker ID as an integer tensor using the
CategoricalEncoder. - Uses Python generator (yield) for lazy multi-output production.
Label Encoder Setup
label_encoder = sb.dataio.encoder.CategoricalEncoder()
lab_enc_file = os.path.join(hparams["save_folder"], "label_encoder.txt")
label_encoder.load_or_create(
path=lab_enc_file,
from_didatasets=[train_data],
output_key="spk_id",
)
Behavior:
- If
label_encoder.txtexists, loads the existing mapping. - If not, scans all
spk_idvalues from the training dataset, assigns integer indices, and saves to file. - Only fitted on training data to prevent label leakage from validation.
Usage Example
import speechbrain as sb
from hyperpyyaml import load_hyperpyyaml
# Load hyperparameters
with open("hyperparams/train_ecapa_tdnn.yaml") as fin:
hparams = load_hyperpyyaml(fin)
# Create datasets and encoder
train_data, valid_data, label_encoder = dataio_prep(hparams)
# Inspect a training sample
sample = train_data[0]
print(sample["id"]) # e.g., "id10001--sess01--utt01_0.0_3.0"
print(sample["sig"].shape) # e.g., torch.Size([48000]) for 3s at 16kHz
print(sample["spk_id_encoded"]) # e.g., tensor([42])
# Use in Brain.fit()
speaker_brain = SpeakerBrain(
modules=hparams["modules"],
opt_class=hparams["opt_class"],
hparams=hparams,
run_opts=run_opts,
checkpointer=hparams["checkpointer"],
)
speaker_brain.fit(
speaker_brain.hparams.epoch_counter,
train_data,
valid_data,
train_loader_kwargs=hparams["dataloader_options"],
valid_loader_kwargs=hparams["dataloader_options"],
)
Implementation Notes
- The
snt_len_samplevariable is computed asint(hparams["sample_rate"] * hparams["sentence_len"])and represents the target number of audio samples per segment. DynamicItemDataset.from_csvsupports areplacementsdictionary for path substitution, allowing CSV paths to use a placeholder (e.g.,data_root) that is replaced at runtime.set_output_keysis called on both datasets simultaneously to ensure consistent batch structure.
See Also
- Principle:Speechbrain_Speechbrain_Speaker_Feature_Pipeline
- Implementation:Speechbrain_Speechbrain_Prepare_Voxceleb
- Implementation:Speechbrain_Speechbrain_SpeakerBrain_Compute_Forward
Related Pages
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment