Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Speechbrain Speechbrain DynamicItemDataset From Csv

From Leeroopedia


Field Value
Implementation Name DynamicItemDataset_From_Csv
API Signature DynamicItemDataset.from_csv(csv_path, replacements={}, dynamic_items=[], output_keys=[])
Source File speechbrain/dataio/dataset.py:L23 (class), L413-418 (from_csv classmethod)
Import from speechbrain.dataio.dataset import DynamicItemDataset
Type API Doc
Related Principle Principle:Speechbrain_Speechbrain_Dataset_Pipeline_Construction

Description

DynamicItemDataset.from_csv() is a classmethod that creates a DynamicItemDataset from a CSV manifest file. The dataset provides a lazy computation pipeline: static data is loaded from CSV columns, and dynamic items (audio loading, tokenization, etc.) are computed on-the-fly based on declarative dependency specifications. Only the computations needed to produce the requested output keys are executed.

Inputs

Parameter Type Default Description
csv_path str (required) Path to a CSV file produced by data preparation (e.g., the output of prepare_common_voice). The first column is used as the data point ID. Expected columns: ID, duration, wav, spk_id, wrd.
replacements dict {} String substitutions applied to all values in the CSV. Keys are substrings to find; values are replacement strings. Commonly used to resolve path placeholders (e.g., {"data_root": "/actual/data/path"}).
dynamic_items list [] List of dynamic item configurations. Each can be a DynamicItem object or a dict with keys "func", "takes", "provides". In practice, dynamic items are typically added after construction via add_dynamic_item().
output_keys list or dict [] Keys to include in the output when data points are fetched. Can be a list of strings or a dict mapping output names to internal keys. Typically set after construction via set_output_keys().

Outputs

Returns a DynamicItemDataset instance that:

  • Implements the PyTorch Dataset interface (__len__, __getitem__)
  • When indexed, returns a dict containing only the requested output keys
  • Lazily evaluates the computation pipeline to produce dynamic items on demand

Core Methods

add_dynamic_item(func, takes=None, provides=None)

Registers a new dynamic item (computation step) with the dataset pipeline.

Parameter Type Description
func callable A function or generator function that computes the dynamic item. If a generator, it yields multiple values corresponding to multiple provided keys.
takes list or str Key(s) of existing items (static CSV columns or other dynamic items) passed as positional arguments to func.
provides str or list Key(s) that this dynamic item produces. If a list, func must be a generator yielding values in order.

set_output_keys(keys)

Sets which keys to include in the output dict when data points are fetched. Only computations needed to produce these keys are executed.

filtered_sorted(sort_key=None, key_max_value={}, reverse=False)

Returns a filtered and/or sorted version of the dataset that shares the underlying static data. Used for duration-based sorting and filtering.

Helper Module Functions

add_dynamic_item(datasets, func, takes, provides)

Convenience function that adds the same dynamic item to multiple datasets simultaneously.

from speechbrain.dataio.dataset import add_dynamic_item
add_dynamic_item(datasets, audio_pipeline)

set_output_keys(datasets, output_keys)

Convenience function that sets the same output keys on multiple datasets simultaneously.

from speechbrain.dataio.dataset import set_output_keys
set_output_keys(datasets, ["id", "sig", "tokens_bos", "tokens_eos", "tokens"])

Usage Example: Complete dataio_prepare Pattern

The following example shows the full pipeline construction pattern from the CTC ASR recipe (recipes/CommonVoice/ASR/CTC/train_with_wav2vec.py, lines 199-310):

import torch
import torchaudio
import speechbrain as sb

def dataio_prepare(hparams, tokenizer):
    """Prepare datasets with audio and text processing pipelines."""

    data_folder = hparams["data_folder"]

    # 1. Create datasets from CSV manifests
    train_data = sb.dataio.dataset.DynamicItemDataset.from_csv(
        csv_path=hparams["train_csv"],
        replacements={"data_root": data_folder},
    )

    valid_data = sb.dataio.dataset.DynamicItemDataset.from_csv(
        csv_path=hparams["valid_csv"],
        replacements={"data_root": data_folder},
    )

    test_data = sb.dataio.dataset.DynamicItemDataset.from_csv(
        csv_path=hparams["test_csv"],
        replacements={"data_root": data_folder},
    )

    # Optional: sort training data by duration
    if hparams["sorting"] == "ascending":
        train_data = train_data.filtered_sorted(
            sort_key="duration",
            key_max_value={"duration": hparams["avoid_if_longer_than"]},
        )
        hparams["dataloader_options"]["shuffle"] = False

    valid_data = valid_data.filtered_sorted(sort_key="duration")
    test_data = test_data.filtered_sorted(sort_key="duration")

    datasets = [train_data, valid_data, test_data]

    # 2. Define audio pipeline using @takes/@provides decorators
    @sb.utils.data_pipeline.takes("wav")
    @sb.utils.data_pipeline.provides("sig")
    def audio_pipeline(wav):
        info = torchaudio.info(wav)
        sig = sb.dataio.dataio.read_audio(wav)
        resampled = torchaudio.transforms.Resample(
            info.sample_rate,
            hparams["sample_rate"],
        )(sig)
        return resampled

    sb.dataio.dataset.add_dynamic_item(datasets, audio_pipeline)

    # 3. Define text pipeline (generator for multiple outputs)
    @sb.utils.data_pipeline.takes("wrd")
    @sb.utils.data_pipeline.provides(
        "tokens_list", "tokens_bos", "tokens_eos", "tokens"
    )
    def text_pipeline(wrd):
        tokens_list = tokenizer.sp.encode_as_ids(wrd)
        yield tokens_list
        tokens_bos = torch.LongTensor(
            [hparams["bos_index"]] + tokens_list
        )
        yield tokens_bos
        tokens_eos = torch.LongTensor(
            tokens_list + [hparams["eos_index"]]
        )
        yield tokens_eos
        tokens = torch.LongTensor(tokens_list)
        yield tokens

    sb.dataio.dataset.add_dynamic_item(datasets, text_pipeline)

    # 4. Set output keys for all datasets
    sb.dataio.dataset.set_output_keys(
        datasets,
        ["id", "sig", "tokens_bos", "tokens_eos", "tokens"],
    )

    return train_data, valid_data, test_data

Pipeline Data Flow

The computation graph for the CTC ASR pipeline:

Static CSV columns:
  ID, duration, wav, spk_id, wrd
         |                    |
         v                    v
  audio_pipeline         text_pipeline
  takes: "wav"           takes: "wrd"
  provides: "sig"        provides: "tokens_list", "tokens_bos",
         |                         "tokens_eos", "tokens"
         v                    v
  Output keys: ["id", "sig", "tokens_bos", "tokens_eos", "tokens"]

Sorting and Filtering

The filtered_sorted() method supports:

Parameter Type Description
sort_key str CSV column or dynamic item key to sort by (e.g., "duration")
reverse bool If True, sort in descending order
key_max_value dict Filter out items where data_point[key] > limit
key_min_value dict Filter out items where data_point[key] < limit
select_n int Keep at most N data points (for debugging)

Dependencies

  • speechbrain.dataio.dataio.load_data_csv -- for parsing CSV files into dict-of-dicts format
  • speechbrain.utils.data_pipeline.DataPipeline -- the underlying computation graph engine
  • speechbrain.utils.data_pipeline.takes / provides -- decorators for pipeline function definition
  • torch.utils.data.Dataset -- base class providing PyTorch DataLoader compatibility

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment