Implementation:Speechbrain Speechbrain DynamicItemDataset From Csv

Field	Value
Implementation Name	DynamicItemDataset_From_Csv
API Signature	`DynamicItemDataset.from_csv(csv_path, replacements={}, dynamic_items=[], output_keys=[])`
Source File	speechbrain/dataio/dataset.py:L23 (class), L413-418 (from_csv classmethod)
Import	`from speechbrain.dataio.dataset import DynamicItemDataset`
Type	API Doc
Related Principle	Principle:Speechbrain_Speechbrain_Dataset_Pipeline_Construction

Description

DynamicItemDataset.from_csv() is a classmethod that creates a DynamicItemDataset from a CSV manifest file. The dataset provides a lazy computation pipeline: static data is loaded from CSV columns, and dynamic items (audio loading, tokenization, etc.) are computed on-the-fly based on declarative dependency specifications. Only the computations needed to produce the requested output keys are executed.

Inputs

Parameter	Type	Default	Description
`csv_path`	str	(required)	Path to a CSV file produced by data preparation (e.g., the output of `prepare_common_voice`). The first column is used as the data point ID. Expected columns: ID, duration, wav, spk_id, wrd.
`replacements`	dict	{}	String substitutions applied to all values in the CSV. Keys are substrings to find; values are replacement strings. Commonly used to resolve path placeholders (e.g., `{"data_root": "/actual/data/path"}`).
`dynamic_items`	list	[]	List of dynamic item configurations. Each can be a `DynamicItem` object or a dict with keys `"func"`, `"takes"`, `"provides"`. In practice, dynamic items are typically added after construction via `add_dynamic_item()`.
`output_keys`	list or dict	[]	Keys to include in the output when data points are fetched. Can be a list of strings or a dict mapping output names to internal keys. Typically set after construction via `set_output_keys()`.

Outputs

Returns a DynamicItemDataset instance that:

Implements the PyTorch Dataset interface (__len__, __getitem__)
When indexed, returns a dict containing only the requested output keys
Lazily evaluates the computation pipeline to produce dynamic items on demand

Core Methods

`add_dynamic_item(func, takes=None, provides=None)`

Registers a new dynamic item (computation step) with the dataset pipeline.

Parameter	Type	Description
`func`	callable	A function or generator function that computes the dynamic item. If a generator, it yields multiple values corresponding to multiple provided keys.
`takes`	list or str	Key(s) of existing items (static CSV columns or other dynamic items) passed as positional arguments to `func`.
`provides`	str or list	Key(s) that this dynamic item produces. If a list, `func` must be a generator yielding values in order.

`set_output_keys(keys)`

Sets which keys to include in the output dict when data points are fetched. Only computations needed to produce these keys are executed.

`filtered_sorted(sort_key=None, key_max_value={}, reverse=False)`

Returns a filtered and/or sorted version of the dataset that shares the underlying static data. Used for duration-based sorting and filtering.

Helper Module Functions

`add_dynamic_item(datasets, func, takes, provides)`

Convenience function that adds the same dynamic item to multiple datasets simultaneously.

from speechbrain.dataio.dataset import add_dynamic_item
add_dynamic_item(datasets, audio_pipeline)

`set_output_keys(datasets, output_keys)`

Convenience function that sets the same output keys on multiple datasets simultaneously.

from speechbrain.dataio.dataset import set_output_keys
set_output_keys(datasets, ["id", "sig", "tokens_bos", "tokens_eos", "tokens"])

Usage Example: Complete dataio_prepare Pattern

The following example shows the full pipeline construction pattern from the CTC ASR recipe (recipes/CommonVoice/ASR/CTC/train_with_wav2vec.py, lines 199-310):

import torch
import torchaudio
import speechbrain as sb

def dataio_prepare(hparams, tokenizer):
    """Prepare datasets with audio and text processing pipelines."""

    data_folder = hparams["data_folder"]

    # 1. Create datasets from CSV manifests
    train_data = sb.dataio.dataset.DynamicItemDataset.from_csv(
        csv_path=hparams["train_csv"],
        replacements={"data_root": data_folder},
    )

    valid_data = sb.dataio.dataset.DynamicItemDataset.from_csv(
        csv_path=hparams["valid_csv"],
        replacements={"data_root": data_folder},
    )

    test_data = sb.dataio.dataset.DynamicItemDataset.from_csv(
        csv_path=hparams["test_csv"],
        replacements={"data_root": data_folder},
    )

    # Optional: sort training data by duration
    if hparams["sorting"] == "ascending":
        train_data = train_data.filtered_sorted(
            sort_key="duration",
            key_max_value={"duration": hparams["avoid_if_longer_than"]},
        )
        hparams["dataloader_options"]["shuffle"] = False

    valid_data = valid_data.filtered_sorted(sort_key="duration")
    test_data = test_data.filtered_sorted(sort_key="duration")

    datasets = [train_data, valid_data, test_data]

    # 2. Define audio pipeline using @takes/@provides decorators
    @sb.utils.data_pipeline.takes("wav")
    @sb.utils.data_pipeline.provides("sig")
    def audio_pipeline(wav):
        info = torchaudio.info(wav)
        sig = sb.dataio.dataio.read_audio(wav)
        resampled = torchaudio.transforms.Resample(
            info.sample_rate,
            hparams["sample_rate"],
        )(sig)
        return resampled

    sb.dataio.dataset.add_dynamic_item(datasets, audio_pipeline)

    # 3. Define text pipeline (generator for multiple outputs)
    @sb.utils.data_pipeline.takes("wrd")
    @sb.utils.data_pipeline.provides(
        "tokens_list", "tokens_bos", "tokens_eos", "tokens"
    )
    def text_pipeline(wrd):
        tokens_list = tokenizer.sp.encode_as_ids(wrd)
        yield tokens_list
        tokens_bos = torch.LongTensor(
            [hparams["bos_index"]] + tokens_list
        )
        yield tokens_bos
        tokens_eos = torch.LongTensor(
            tokens_list + [hparams["eos_index"]]
        )
        yield tokens_eos
        tokens = torch.LongTensor(tokens_list)
        yield tokens

    sb.dataio.dataset.add_dynamic_item(datasets, text_pipeline)

    # 4. Set output keys for all datasets
    sb.dataio.dataset.set_output_keys(
        datasets,
        ["id", "sig", "tokens_bos", "tokens_eos", "tokens"],
    )

    return train_data, valid_data, test_data

Pipeline Data Flow

The computation graph for the CTC ASR pipeline:

Static CSV columns:
  ID, duration, wav, spk_id, wrd
         |                    |
         v                    v
  audio_pipeline         text_pipeline
  takes: "wav"           takes: "wrd"
  provides: "sig"        provides: "tokens_list", "tokens_bos",
         |                         "tokens_eos", "tokens"
         v                    v
  Output keys: ["id", "sig", "tokens_bos", "tokens_eos", "tokens"]

Sorting and Filtering

The filtered_sorted() method supports:

Parameter	Type	Description
`sort_key`	str	CSV column or dynamic item key to sort by (e.g., `"duration"`)
`reverse`	bool	If True, sort in descending order
`key_max_value`	dict	Filter out items where `data_point[key] > limit`
`key_min_value`	dict	Filter out items where `data_point[key] < limit`
`select_n`	int	Keep at most N data points (for debugging)

Dependencies

speechbrain.dataio.dataio.load_data_csv -- for parsing CSV files into dict-of-dicts format
speechbrain.utils.data_pipeline.DataPipeline -- the underlying computation graph engine
speechbrain.utils.data_pipeline.takes / provides -- decorators for pipeline function definition
torch.utils.data.Dataset -- base class providing PyTorch DataLoader compatibility

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment