Implementation:Speechbrain Speechbrain DynamicItemDataset From Csv
| Field | Value |
|---|---|
| Implementation Name | DynamicItemDataset_From_Csv |
| API Signature | DynamicItemDataset.from_csv(csv_path, replacements={}, dynamic_items=[], output_keys=[])
|
| Source File | speechbrain/dataio/dataset.py:L23 (class), L413-418 (from_csv classmethod) |
| Import | from speechbrain.dataio.dataset import DynamicItemDataset
|
| Type | API Doc |
| Related Principle | Principle:Speechbrain_Speechbrain_Dataset_Pipeline_Construction |
Description
DynamicItemDataset.from_csv() is a classmethod that creates a DynamicItemDataset from a CSV manifest file. The dataset provides a lazy computation pipeline: static data is loaded from CSV columns, and dynamic items (audio loading, tokenization, etc.) are computed on-the-fly based on declarative dependency specifications. Only the computations needed to produce the requested output keys are executed.
Inputs
| Parameter | Type | Default | Description |
|---|---|---|---|
csv_path |
str | (required) | Path to a CSV file produced by data preparation (e.g., the output of prepare_common_voice). The first column is used as the data point ID. Expected columns: ID, duration, wav, spk_id, wrd.
|
replacements |
dict | {} | String substitutions applied to all values in the CSV. Keys are substrings to find; values are replacement strings. Commonly used to resolve path placeholders (e.g., {"data_root": "/actual/data/path"}).
|
dynamic_items |
list | [] | List of dynamic item configurations. Each can be a DynamicItem object or a dict with keys "func", "takes", "provides". In practice, dynamic items are typically added after construction via add_dynamic_item().
|
output_keys |
list or dict | [] | Keys to include in the output when data points are fetched. Can be a list of strings or a dict mapping output names to internal keys. Typically set after construction via set_output_keys().
|
Outputs
Returns a DynamicItemDataset instance that:
- Implements the PyTorch
Datasetinterface (__len__,__getitem__) - When indexed, returns a dict containing only the requested output keys
- Lazily evaluates the computation pipeline to produce dynamic items on demand
Core Methods
add_dynamic_item(func, takes=None, provides=None)
Registers a new dynamic item (computation step) with the dataset pipeline.
| Parameter | Type | Description |
|---|---|---|
func |
callable | A function or generator function that computes the dynamic item. If a generator, it yields multiple values corresponding to multiple provided keys. |
takes |
list or str | Key(s) of existing items (static CSV columns or other dynamic items) passed as positional arguments to func.
|
provides |
str or list | Key(s) that this dynamic item produces. If a list, func must be a generator yielding values in order.
|
set_output_keys(keys)
Sets which keys to include in the output dict when data points are fetched. Only computations needed to produce these keys are executed.
filtered_sorted(sort_key=None, key_max_value={}, reverse=False)
Returns a filtered and/or sorted version of the dataset that shares the underlying static data. Used for duration-based sorting and filtering.
Helper Module Functions
add_dynamic_item(datasets, func, takes, provides)
Convenience function that adds the same dynamic item to multiple datasets simultaneously.
from speechbrain.dataio.dataset import add_dynamic_item
add_dynamic_item(datasets, audio_pipeline)
set_output_keys(datasets, output_keys)
Convenience function that sets the same output keys on multiple datasets simultaneously.
from speechbrain.dataio.dataset import set_output_keys
set_output_keys(datasets, ["id", "sig", "tokens_bos", "tokens_eos", "tokens"])
Usage Example: Complete dataio_prepare Pattern
The following example shows the full pipeline construction pattern from the CTC ASR recipe (recipes/CommonVoice/ASR/CTC/train_with_wav2vec.py, lines 199-310):
import torch
import torchaudio
import speechbrain as sb
def dataio_prepare(hparams, tokenizer):
"""Prepare datasets with audio and text processing pipelines."""
data_folder = hparams["data_folder"]
# 1. Create datasets from CSV manifests
train_data = sb.dataio.dataset.DynamicItemDataset.from_csv(
csv_path=hparams["train_csv"],
replacements={"data_root": data_folder},
)
valid_data = sb.dataio.dataset.DynamicItemDataset.from_csv(
csv_path=hparams["valid_csv"],
replacements={"data_root": data_folder},
)
test_data = sb.dataio.dataset.DynamicItemDataset.from_csv(
csv_path=hparams["test_csv"],
replacements={"data_root": data_folder},
)
# Optional: sort training data by duration
if hparams["sorting"] == "ascending":
train_data = train_data.filtered_sorted(
sort_key="duration",
key_max_value={"duration": hparams["avoid_if_longer_than"]},
)
hparams["dataloader_options"]["shuffle"] = False
valid_data = valid_data.filtered_sorted(sort_key="duration")
test_data = test_data.filtered_sorted(sort_key="duration")
datasets = [train_data, valid_data, test_data]
# 2. Define audio pipeline using @takes/@provides decorators
@sb.utils.data_pipeline.takes("wav")
@sb.utils.data_pipeline.provides("sig")
def audio_pipeline(wav):
info = torchaudio.info(wav)
sig = sb.dataio.dataio.read_audio(wav)
resampled = torchaudio.transforms.Resample(
info.sample_rate,
hparams["sample_rate"],
)(sig)
return resampled
sb.dataio.dataset.add_dynamic_item(datasets, audio_pipeline)
# 3. Define text pipeline (generator for multiple outputs)
@sb.utils.data_pipeline.takes("wrd")
@sb.utils.data_pipeline.provides(
"tokens_list", "tokens_bos", "tokens_eos", "tokens"
)
def text_pipeline(wrd):
tokens_list = tokenizer.sp.encode_as_ids(wrd)
yield tokens_list
tokens_bos = torch.LongTensor(
[hparams["bos_index"]] + tokens_list
)
yield tokens_bos
tokens_eos = torch.LongTensor(
tokens_list + [hparams["eos_index"]]
)
yield tokens_eos
tokens = torch.LongTensor(tokens_list)
yield tokens
sb.dataio.dataset.add_dynamic_item(datasets, text_pipeline)
# 4. Set output keys for all datasets
sb.dataio.dataset.set_output_keys(
datasets,
["id", "sig", "tokens_bos", "tokens_eos", "tokens"],
)
return train_data, valid_data, test_data
Pipeline Data Flow
The computation graph for the CTC ASR pipeline:
Static CSV columns:
ID, duration, wav, spk_id, wrd
| |
v v
audio_pipeline text_pipeline
takes: "wav" takes: "wrd"
provides: "sig" provides: "tokens_list", "tokens_bos",
| "tokens_eos", "tokens"
v v
Output keys: ["id", "sig", "tokens_bos", "tokens_eos", "tokens"]
Sorting and Filtering
The filtered_sorted() method supports:
| Parameter | Type | Description |
|---|---|---|
sort_key |
str | CSV column or dynamic item key to sort by (e.g., "duration")
|
reverse |
bool | If True, sort in descending order |
key_max_value |
dict | Filter out items where data_point[key] > limit
|
key_min_value |
dict | Filter out items where data_point[key] < limit
|
select_n |
int | Keep at most N data points (for debugging) |
Dependencies
speechbrain.dataio.dataio.load_data_csv-- for parsing CSV files into dict-of-dicts formatspeechbrain.utils.data_pipeline.DataPipeline-- the underlying computation graph enginespeechbrain.utils.data_pipeline.takes/provides-- decorators for pipeline function definitiontorch.utils.data.Dataset-- base class providing PyTorch DataLoader compatibility