Implementation:Speechbrain Speechbrain Prepare MEDIA
| Knowledge Sources | |
|---|---|
| Domains | Spoken_Language_Understanding, Data_Preparation |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Concrete tool for preparing the MEDIA dataset for spoken language understanding provided by the SpeechBrain library.
Description
This script creates CSV data manifest files for the MEDIA dataset, a French spoken language understanding corpus for hotel reservation dialogues. It processes ELRA-distributed transcription/annotation archives (S0272) and audio archives (E0024), parses XML annotation files, extracts concept labels, and generates train/dev/test CSV splits. The script supports both full SLU and relaxed concept annotation schemes, and can optionally process a secondary test set (test2).
Usage
Use this when preparing the MEDIA dataset for spoken language understanding training with SpeechBrain recipes.
Code Reference
Source Location
- Repository: SpeechBrain
- File: recipes/MEDIA/media_prepare.py
Signature
def prepare_media(
data_folder,
save_folder,
channels_path,
concepts_path,
skip_wav=True,
method="slu",
task="full",
skip_prep=False,
process_test2=False,
):
Import
from recipes.MEDIA.media_prepare import prepare_media
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| data_folder | str | Yes | Path where ELRA folders S0272 and E0024 are stored |
| save_folder | str | Yes | Path where CSVs and preprocessed WAVs will be stored |
| channels_path | str | Yes | Path to the channels.csv file |
| concepts_path | str | Yes | Path to the concepts_full_relax.csv file |
| skip_wav | bool | No | Skip WAV file extraction if already done (default: True) |
| method | str | No | Annotation method: "full" or "relax" (default: "slu") |
| task | str | No | Task type: "full" or other variants (default: "full") |
| skip_prep | bool | No | If True, skip data preparation (default: False) |
| process_test2 | bool | No | Whether to process the secondary test set (default: False) |
Outputs
| Name | Type | Description |
|---|---|---|
| train.csv | CSV | Training split manifest with audio paths, transcriptions, and concept annotations |
| dev.csv | CSV | Development split manifest |
| test.csv | CSV | Test split manifest |
| test2.csv | CSV | Optional secondary test split manifest (if process_test2=True) |
Usage Examples
from recipes.MEDIA.media_prepare import prepare_media
prepare_media(
data_folder="/path/to/MEDIA",
save_folder="/path/to/output",
channels_path="/path/to/channels.csv",
concepts_path="/path/to/concepts_full_relax.csv",
skip_wav=False,
task="full",
)