Implementation:Speechbrain Speechbrain Prepare VoxPopuli
| Knowledge Sources | |
|---|---|
| Domains | ASR, Data_Preparation |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Concrete tool for preparing the VoxPopuli dataset for ASR tasks provided by the SpeechBrain library.
Description
This script prepares CSV manifest files for the VoxPopuli dataset, a large-scale multilingual speech corpus derived from European Parliament recordings. It processes TSV metadata files, applies language-specific text normalization, and generates train/dev/test CSV manifests. The script supports parallel processing for efficient handling of the large dataset and can filter out excessively long audio segments (e.g., over 100 seconds) from the training set.
Usage
Use this script when preparing VoxPopuli data for multilingual or language-specific ASR training. It must be run before any VoxPopuli-based training recipe in SpeechBrain.
Code Reference
Source Location
- Repository: SpeechBrain
- File: recipes/VoxPopuli/voxpopuli_prepare.py
Signature
def prepare_voxpopuli(
data_folder,
save_folder,
train_tsv_file=None,
dev_tsv_file=None,
test_tsv_file=None,
skip_prep=False,
language="en",
remove_if_longer_than=100,
):
Import
from voxpopuli_prepare import prepare_voxpopuli
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| data_folder | str | Yes | Path to the folder where the VoxPopuli dataset is stored (must include transcribed_data folder) |
| save_folder | str | Yes | The directory where to store the CSV files |
| train_tsv_file | str | No | Path to the Train VoxPopuli .tsv file |
| dev_tsv_file | str | No | Path to the Dev VoxPopuli .tsv file |
| test_tsv_file | str | No | Path to the Test VoxPopuli .tsv file |
| skip_prep | bool | No | If True, skips data preparation (default: False) |
| language | str | No | The language code for language-specific text normalization (default: "en") |
| remove_if_longer_than | int | No | Remove training audio files longer than this many seconds (default: 100) |
Outputs
| Name | Type | Description |
|---|---|---|
| train.csv | CSV file | Training set manifest with audio paths and transcriptions |
| dev.csv | CSV file | Development/validation set manifest |
| test.csv | CSV file | Test set manifest |
Usage Examples
from voxpopuli_prepare import prepare_voxpopuli
# Prepare English VoxPopuli data
prepare_voxpopuli(
data_folder="/path/to/VoxPopuli",
save_folder="/path/to/output",
language="en",
)
# Prepare French data with custom length filter
prepare_voxpopuli(
data_folder="/path/to/VoxPopuli",
save_folder="/path/to/output",
language="fr",
remove_if_longer_than=60,
)