Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Speechbrain Speechbrain CVSS Extract Code

From Leeroopedia
Revision as of 16:43, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Speechbrain_Speechbrain_CVSS_Extract_Code.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Speech_Translation, Feature_Extraction
Last Updated 2026-02-09 00:00 GMT

Overview

Concrete tool for extracting discrete speech units from audio using HuBERT and K-means clustering provided by the SpeechBrain library.

Description

This script applies K-means clustering over acoustic features extracted from a HuBERT encoder to produce discrete speech unit codes for training a speech-to-unit translation model. The pipeline works as follows: (1) a pretrained HuBERT model (Wav2Vec2) extracts continuous features from a specified hidden layer, (2) a pre-fitted K-means model quantizes those features into discrete cluster indices (speech codes), and (3) the resulting code sequences are stored alongside the original dataset metadata in JSON files. The script supports skipping previously completed extractions by checking saved configuration against the current run. It processes train, valid, valid_small, and test splits of the CVSS (Common Voice Speech-to-Speech) dataset. The K-means checkpoint can be automatically downloaded from HuggingFace Hub if not found locally.

Usage

Called as part of the CVSS S2ST (Speech-to-Speech Translation) recipe pipeline to prepare discrete speech unit targets for HiFi-GAN vocoder training. Typically invoked programmatically from a training script rather than directly from the command line.

Code Reference

Source Location

Signature

def setup_logger():
    """Set up a logger with a log format and logging level."""
    ...

def get_device(use_cuda):
    """Determine and return the appropriate device for computation."""
    ...

def np_array(tensor):
    """Convert a Pytorch tensor to a Numpy array."""
    ...

def skip(splits, save_folder, conf):
    """Detects if the code extraction has been already done."""
    ...

def extract_cvss(
    data_folder,
    splits,
    kmeans_folder,
    encoder,
    layer,
    save_folder,
    sample_rate=16000,
    skip_extract=False,
):
    """Extract speech units for HiFi-GAN training on the CVSS datasets."""
    ...

Import

from recipes.CVSS.S2ST.extract_code import extract_cvss

I/O Contract

Inputs

Name Type Required Description
data_folder str Yes Path to the original CVSS dataset
splits list[str] Yes List of splits to prepare (e.g., ["train", "valid", "test"])
kmeans_folder str Yes Path to folder with K-means model checkpoint (kmeans.ckpt)
encoder str Yes URL or identifier for the HuBERT feature extractor model
layer int Yes Hidden layer from which features are extracted
save_folder str Yes Path where extracted speech unit codes are stored
sample_rate int No Audio sample rate (default: 16000)
skip_extract bool No If True, skip extraction entirely (default: False)

Outputs

Name Type Description
train.json JSON file Metadata with speech codes for training split
valid.json JSON file Metadata with speech codes for validation split
test.json JSON file Metadata with speech codes for test split
codes/ directory Directory containing per-utterance discrete code files

Usage Examples

from recipes.CVSS.S2ST.extract_code import extract_cvss

extract_cvss(
    data_folder="data/CVSS/",
    splits=["train", "valid", "test"],
    kmeans_folder="./Quantization/results/kmeans/4321/save",
    encoder="facebook/hubert-base-ls960",
    layer=6,
    save_folder="save/",
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment