Implementation:Speechbrain Speechbrain CVSS Extract Code

Knowledge Sources	SpeechBrain
Domains	Speech_Translation, Feature_Extraction
Last Updated	2026-02-09 00:00 GMT

Overview

Concrete tool for extracting discrete speech units from audio using HuBERT and K-means clustering provided by the SpeechBrain library.

Description

This script applies K-means clustering over acoustic features extracted from a HuBERT encoder to produce discrete speech unit codes for training a speech-to-unit translation model. The pipeline works as follows: (1) a pretrained HuBERT model (Wav2Vec2) extracts continuous features from a specified hidden layer, (2) a pre-fitted K-means model quantizes those features into discrete cluster indices (speech codes), and (3) the resulting code sequences are stored alongside the original dataset metadata in JSON files. The script supports skipping previously completed extractions by checking saved configuration against the current run. It processes train, valid, valid_small, and test splits of the CVSS (Common Voice Speech-to-Speech) dataset. The K-means checkpoint can be automatically downloaded from HuggingFace Hub if not found locally.

Usage

Called as part of the CVSS S2ST (Speech-to-Speech Translation) recipe pipeline to prepare discrete speech unit targets for HiFi-GAN vocoder training. Typically invoked programmatically from a training script rather than directly from the command line.

Code Reference

Source Location

Repository: SpeechBrain
File: recipes/CVSS/S2ST/extract_code.py

Signature

def setup_logger():
    """Set up a logger with a log format and logging level."""
    ...

def get_device(use_cuda):
    """Determine and return the appropriate device for computation."""
    ...

def np_array(tensor):
    """Convert a Pytorch tensor to a Numpy array."""
    ...

def skip(splits, save_folder, conf):
    """Detects if the code extraction has been already done."""
    ...

def extract_cvss(
    data_folder,
    splits,
    kmeans_folder,
    encoder,
    layer,
    save_folder,
    sample_rate=16000,
    skip_extract=False,
):
    """Extract speech units for HiFi-GAN training on the CVSS datasets."""
    ...

Import

from recipes.CVSS.S2ST.extract_code import extract_cvss

I/O Contract

Inputs

Name	Type	Required	Description
data_folder	str	Yes	Path to the original CVSS dataset
splits	list[str]	Yes	List of splits to prepare (e.g., ["train", "valid", "test"])
kmeans_folder	str	Yes	Path to folder with K-means model checkpoint (kmeans.ckpt)
encoder	str	Yes	URL or identifier for the HuBERT feature extractor model
layer	int	Yes	Hidden layer from which features are extracted
save_folder	str	Yes	Path where extracted speech unit codes are stored
sample_rate	int	No	Audio sample rate (default: 16000)
skip_extract	bool	No	If True, skip extraction entirely (default: False)

Outputs

Name	Type	Description
train.json	JSON file	Metadata with speech codes for training split
valid.json	JSON file	Metadata with speech codes for validation split
test.json	JSON file	Metadata with speech codes for test split
codes/	directory	Directory containing per-utterance discrete code files

Usage Examples

from recipes.CVSS.S2ST.extract_code import extract_cvss

extract_cvss(
    data_folder="data/CVSS/",
    splits=["train", "valid", "test"],
    kmeans_folder="./Quantization/results/kmeans/4321/save",
    encoder="facebook/hubert-base-ls960",
    layer=6,
    save_folder="save/",
)

Related Pages

Principle:Speechbrain_Speechbrain_Speech_To_Unit_Translation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment