Implementation:Speechbrain Speechbrain Prepare VoxPopuli

Knowledge Sources	SpeechBrain
Domains	ASR, Data_Preparation
Last Updated	2026-02-09 00:00 GMT

Overview

Concrete tool for preparing the VoxPopuli dataset for ASR tasks provided by the SpeechBrain library.

Description

This script prepares CSV manifest files for the VoxPopuli dataset, a large-scale multilingual speech corpus derived from European Parliament recordings. It processes TSV metadata files, applies language-specific text normalization, and generates train/dev/test CSV manifests. The script supports parallel processing for efficient handling of the large dataset and can filter out excessively long audio segments (e.g., over 100 seconds) from the training set.

Usage

Use this script when preparing VoxPopuli data for multilingual or language-specific ASR training. It must be run before any VoxPopuli-based training recipe in SpeechBrain.

Code Reference

Source Location

Repository: SpeechBrain
File: recipes/VoxPopuli/voxpopuli_prepare.py

Signature

def prepare_voxpopuli(
    data_folder,
    save_folder,
    train_tsv_file=None,
    dev_tsv_file=None,
    test_tsv_file=None,
    skip_prep=False,
    language="en",
    remove_if_longer_than=100,
):

Import

from voxpopuli_prepare import prepare_voxpopuli

I/O Contract

Inputs

Name	Type	Required	Description
data_folder	str	Yes	Path to the folder where the VoxPopuli dataset is stored (must include transcribed_data folder)
save_folder	str	Yes	The directory where to store the CSV files
train_tsv_file	str	No	Path to the Train VoxPopuli .tsv file
dev_tsv_file	str	No	Path to the Dev VoxPopuli .tsv file
test_tsv_file	str	No	Path to the Test VoxPopuli .tsv file
skip_prep	bool	No	If True, skips data preparation (default: False)
language	str	No	The language code for language-specific text normalization (default: "en")
remove_if_longer_than	int	No	Remove training audio files longer than this many seconds (default: 100)

Outputs

Name	Type	Description
train.csv	CSV file	Training set manifest with audio paths and transcriptions
dev.csv	CSV file	Development/validation set manifest
test.csv	CSV file	Test set manifest

Usage Examples

from voxpopuli_prepare import prepare_voxpopuli

# Prepare English VoxPopuli data
prepare_voxpopuli(
    data_folder="/path/to/VoxPopuli",
    save_folder="/path/to/output",
    language="en",
)

# Prepare French data with custom length filter
prepare_voxpopuli(
    data_folder="/path/to/VoxPopuli",
    save_folder="/path/to/output",
    language="fr",
    remove_if_longer_than=60,
)

Related Pages

Principle:Speechbrain_Speechbrain_Dataset_Specific_Data_Preparation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment