Implementation:Speechbrain Speechbrain Train Ngram LM

Knowledge Sources	SpeechBrain
Domains	Language_Modeling, Training
Last Updated	2026-02-09 00:00 GMT

Overview

Concrete tool for training a KenLM n-gram language model for LibriSpeech ASR provided by the SpeechBrain library.

Description

This recipe prepares data and generates the command for training a KenLM n-gram language model for use with LibriSpeech ASR systems. The pipeline performs: (1) downloading the LibriSpeech LM training text from OpenSLR, (2) parsing LibriSpeech data splits via librispeech_prepare, (3) downloading the vocabulary file, (4) preparing a character-level lexicon for k2 integration, (5) creating the lang directory for k2, and (6) calling dataprep_lm_training which merges CSV transcripts with the external LM corpus into a single text file, deduplicates lines, and outputs the exact lmplz command that the user must run manually to build the ARPA n-gram model. The script then exits after printing the KenLM command. The dataprep_lm_training function supports configurable n-gram order and pruning levels.

Usage

Run as a recipe script. After execution, it prints the KenLM lmplz command that must be run manually to produce the final ARPA language model file. Requires KenLM to be compiled and available on the system PATH.

Code Reference

Source Location

Repository: SpeechBrain
File: recipes/LibriSpeech/LM/train_ngram.py

Signature

def download_librispeech_lm_training_text(destination):
    """Download librispeech lm training text and unpack it."""
    ...

def dataprep_lm_training(
    lm_dir,
    output_arpa,
    csv_files,
    external_lm_corpus,
    vocab_file,
    arpa_order=3,
    prune_level=[0, 1, 2],
):
    """Prepare lm txt corpus file for lm training with kenlm.
    Prints the lmplz command and exits."""
    ...

Import

python train_ngram.py hparams/train.yaml --data_folder /path/to/LibriSpeech

I/O Contract

Inputs

Name	Type	Required	Description
hparams_file	str	Yes	Path to YAML hyperparameter file
--data_folder	str	Yes	Path to LibriSpeech dataset root
lm_dir	str	Yes	Directory to store LM text corpus
output_arpa	str	Yes	Target path for the output ARPA LM file
csv_files	list[str]	Yes	CSV files with transcripts (wrd column)
external_lm_corpus	list[str]	Yes	Paths to external LM text corpora
vocab_file	str	Yes	Path to vocabulary file for pruning unknown n-grams
arpa_order	int	No	Order of the ARPA LM (default: 3)
prune_level	list[int]	No	Pruning thresholds per order (default: [0, 1, 2])

Outputs

Name	Type	Description
libri_lm_corpus.txt	text file	Merged and deduplicated LM training corpus
lmplz command	stdout	Printed KenLM command to build the ARPA model
output_arpa	file	ARPA LM file (after manual lmplz execution)

Usage Examples

# Step 1: Run the script to prepare corpus and get the lmplz command
python train_ngram.py hparams/train.yaml --data_folder /data/LibriSpeech

# Step 2: The script prints a command like:
# lmplz -o 3 --prune 0 1 2 --limit_vocab_file words.txt < corpus.txt > lm.arpa

# Step 3: Run the printed lmplz command manually
lmplz -o 3 --prune 0 1 2 --limit_vocab_file lang/words.txt \
    < results/libri_lm_corpus.txt \
    | sed '1,20s/<unk>/<UNK>/1' \
    > results/lm.arpa

Related Pages

Principle:Speechbrain_Speechbrain_Dataset_Specific_Data_Preparation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment