Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Speechbrain Speechbrain Prepare VoxPopuli

From Leeroopedia


Knowledge Sources
Domains ASR, Data_Preparation
Last Updated 2026-02-09 00:00 GMT

Overview

Concrete tool for preparing the VoxPopuli dataset for ASR tasks provided by the SpeechBrain library.

Description

This script prepares CSV manifest files for the VoxPopuli dataset, a large-scale multilingual speech corpus derived from European Parliament recordings. It processes TSV metadata files, applies language-specific text normalization, and generates train/dev/test CSV manifests. The script supports parallel processing for efficient handling of the large dataset and can filter out excessively long audio segments (e.g., over 100 seconds) from the training set.

Usage

Use this script when preparing VoxPopuli data for multilingual or language-specific ASR training. It must be run before any VoxPopuli-based training recipe in SpeechBrain.

Code Reference

Source Location

Signature

def prepare_voxpopuli(
    data_folder,
    save_folder,
    train_tsv_file=None,
    dev_tsv_file=None,
    test_tsv_file=None,
    skip_prep=False,
    language="en",
    remove_if_longer_than=100,
):

Import

from voxpopuli_prepare import prepare_voxpopuli

I/O Contract

Inputs

Name Type Required Description
data_folder str Yes Path to the folder where the VoxPopuli dataset is stored (must include transcribed_data folder)
save_folder str Yes The directory where to store the CSV files
train_tsv_file str No Path to the Train VoxPopuli .tsv file
dev_tsv_file str No Path to the Dev VoxPopuli .tsv file
test_tsv_file str No Path to the Test VoxPopuli .tsv file
skip_prep bool No If True, skips data preparation (default: False)
language str No The language code for language-specific text normalization (default: "en")
remove_if_longer_than int No Remove training audio files longer than this many seconds (default: 100)

Outputs

Name Type Description
train.csv CSV file Training set manifest with audio paths and transcriptions
dev.csv CSV file Development/validation set manifest
test.csv CSV file Test set manifest

Usage Examples

from voxpopuli_prepare import prepare_voxpopuli

# Prepare English VoxPopuli data
prepare_voxpopuli(
    data_folder="/path/to/VoxPopuli",
    save_folder="/path/to/output",
    language="en",
)

# Prepare French data with custom length filter
prepare_voxpopuli(
    data_folder="/path/to/VoxPopuli",
    save_folder="/path/to/output",
    language="fr",
    remove_if_longer_than=60,
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment