Implementation:Speechbrain Speechbrain CVSS Extract Code
| Knowledge Sources | |
|---|---|
| Domains | Speech_Translation, Feature_Extraction |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Concrete tool for extracting discrete speech units from audio using HuBERT and K-means clustering provided by the SpeechBrain library.
Description
This script applies K-means clustering over acoustic features extracted from a HuBERT encoder to produce discrete speech unit codes for training a speech-to-unit translation model. The pipeline works as follows: (1) a pretrained HuBERT model (Wav2Vec2) extracts continuous features from a specified hidden layer, (2) a pre-fitted K-means model quantizes those features into discrete cluster indices (speech codes), and (3) the resulting code sequences are stored alongside the original dataset metadata in JSON files. The script supports skipping previously completed extractions by checking saved configuration against the current run. It processes train, valid, valid_small, and test splits of the CVSS (Common Voice Speech-to-Speech) dataset. The K-means checkpoint can be automatically downloaded from HuggingFace Hub if not found locally.
Usage
Called as part of the CVSS S2ST (Speech-to-Speech Translation) recipe pipeline to prepare discrete speech unit targets for HiFi-GAN vocoder training. Typically invoked programmatically from a training script rather than directly from the command line.
Code Reference
Source Location
- Repository: SpeechBrain
- File: recipes/CVSS/S2ST/extract_code.py
Signature
def setup_logger():
"""Set up a logger with a log format and logging level."""
...
def get_device(use_cuda):
"""Determine and return the appropriate device for computation."""
...
def np_array(tensor):
"""Convert a Pytorch tensor to a Numpy array."""
...
def skip(splits, save_folder, conf):
"""Detects if the code extraction has been already done."""
...
def extract_cvss(
data_folder,
splits,
kmeans_folder,
encoder,
layer,
save_folder,
sample_rate=16000,
skip_extract=False,
):
"""Extract speech units for HiFi-GAN training on the CVSS datasets."""
...
Import
from recipes.CVSS.S2ST.extract_code import extract_cvss
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| data_folder | str | Yes | Path to the original CVSS dataset |
| splits | list[str] | Yes | List of splits to prepare (e.g., ["train", "valid", "test"]) |
| kmeans_folder | str | Yes | Path to folder with K-means model checkpoint (kmeans.ckpt) |
| encoder | str | Yes | URL or identifier for the HuBERT feature extractor model |
| layer | int | Yes | Hidden layer from which features are extracted |
| save_folder | str | Yes | Path where extracted speech unit codes are stored |
| sample_rate | int | No | Audio sample rate (default: 16000) |
| skip_extract | bool | No | If True, skip extraction entirely (default: False) |
Outputs
| Name | Type | Description |
|---|---|---|
| train.json | JSON file | Metadata with speech codes for training split |
| valid.json | JSON file | Metadata with speech codes for validation split |
| test.json | JSON file | Metadata with speech codes for test split |
| codes/ | directory | Directory containing per-utterance discrete code files |
Usage Examples
from recipes.CVSS.S2ST.extract_code import extract_cvss
extract_cvss(
data_folder="data/CVSS/",
splits=["train", "valid", "test"],
kmeans_folder="./Quantization/results/kmeans/4321/save",
encoder="facebook/hubert-base-ls960",
layer=6,
save_folder="save/",
)