Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Speechbrain Speechbrain Train Ngram LM

From Leeroopedia


Knowledge Sources
Domains Language_Modeling, Training
Last Updated 2026-02-09 00:00 GMT

Overview

Concrete tool for training a KenLM n-gram language model for LibriSpeech ASR provided by the SpeechBrain library.

Description

This recipe prepares data and generates the command for training a KenLM n-gram language model for use with LibriSpeech ASR systems. The pipeline performs: (1) downloading the LibriSpeech LM training text from OpenSLR, (2) parsing LibriSpeech data splits via librispeech_prepare, (3) downloading the vocabulary file, (4) preparing a character-level lexicon for k2 integration, (5) creating the lang directory for k2, and (6) calling dataprep_lm_training which merges CSV transcripts with the external LM corpus into a single text file, deduplicates lines, and outputs the exact lmplz command that the user must run manually to build the ARPA n-gram model. The script then exits after printing the KenLM command. The dataprep_lm_training function supports configurable n-gram order and pruning levels.

Usage

Run as a recipe script. After execution, it prints the KenLM lmplz command that must be run manually to produce the final ARPA language model file. Requires KenLM to be compiled and available on the system PATH.

Code Reference

Source Location

Signature

def download_librispeech_lm_training_text(destination):
    """Download librispeech lm training text and unpack it."""
    ...

def dataprep_lm_training(
    lm_dir,
    output_arpa,
    csv_files,
    external_lm_corpus,
    vocab_file,
    arpa_order=3,
    prune_level=[0, 1, 2],
):
    """Prepare lm txt corpus file for lm training with kenlm.
    Prints the lmplz command and exits."""
    ...

Import

python train_ngram.py hparams/train.yaml --data_folder /path/to/LibriSpeech

I/O Contract

Inputs

Name Type Required Description
hparams_file str Yes Path to YAML hyperparameter file
--data_folder str Yes Path to LibriSpeech dataset root
lm_dir str Yes Directory to store LM text corpus
output_arpa str Yes Target path for the output ARPA LM file
csv_files list[str] Yes CSV files with transcripts (wrd column)
external_lm_corpus list[str] Yes Paths to external LM text corpora
vocab_file str Yes Path to vocabulary file for pruning unknown n-grams
arpa_order int No Order of the ARPA LM (default: 3)
prune_level list[int] No Pruning thresholds per order (default: [0, 1, 2])

Outputs

Name Type Description
libri_lm_corpus.txt text file Merged and deduplicated LM training corpus
lmplz command stdout Printed KenLM command to build the ARPA model
output_arpa file ARPA LM file (after manual lmplz execution)

Usage Examples

# Step 1: Run the script to prepare corpus and get the lmplz command
python train_ngram.py hparams/train.yaml --data_folder /data/LibriSpeech

# Step 2: The script prints a command like:
# lmplz -o 3 --prune 0 1 2 --limit_vocab_file words.txt < corpus.txt > lm.arpa

# Step 3: Run the printed lmplz command manually
lmplz -o 3 --prune 0 1 2 --limit_vocab_file lang/words.txt \
    < results/libri_lm_corpus.txt \
    | sed '1,20s/<unk>/<UNK>/1' \
    > results/lm.arpa

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment