Implementation:Speechbrain Speechbrain Train Ngram LM
| Knowledge Sources | |
|---|---|
| Domains | Language_Modeling, Training |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Concrete tool for training a KenLM n-gram language model for LibriSpeech ASR provided by the SpeechBrain library.
Description
This recipe prepares data and generates the command for training a KenLM n-gram language model for use with LibriSpeech ASR systems. The pipeline performs: (1) downloading the LibriSpeech LM training text from OpenSLR, (2) parsing LibriSpeech data splits via librispeech_prepare, (3) downloading the vocabulary file, (4) preparing a character-level lexicon for k2 integration, (5) creating the lang directory for k2, and (6) calling dataprep_lm_training which merges CSV transcripts with the external LM corpus into a single text file, deduplicates lines, and outputs the exact lmplz command that the user must run manually to build the ARPA n-gram model. The script then exits after printing the KenLM command. The dataprep_lm_training function supports configurable n-gram order and pruning levels.
Usage
Run as a recipe script. After execution, it prints the KenLM lmplz command that must be run manually to produce the final ARPA language model file. Requires KenLM to be compiled and available on the system PATH.
Code Reference
Source Location
- Repository: SpeechBrain
- File: recipes/LibriSpeech/LM/train_ngram.py
Signature
def download_librispeech_lm_training_text(destination):
"""Download librispeech lm training text and unpack it."""
...
def dataprep_lm_training(
lm_dir,
output_arpa,
csv_files,
external_lm_corpus,
vocab_file,
arpa_order=3,
prune_level=[0, 1, 2],
):
"""Prepare lm txt corpus file for lm training with kenlm.
Prints the lmplz command and exits."""
...
Import
python train_ngram.py hparams/train.yaml --data_folder /path/to/LibriSpeech
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| hparams_file | str | Yes | Path to YAML hyperparameter file |
| --data_folder | str | Yes | Path to LibriSpeech dataset root |
| lm_dir | str | Yes | Directory to store LM text corpus |
| output_arpa | str | Yes | Target path for the output ARPA LM file |
| csv_files | list[str] | Yes | CSV files with transcripts (wrd column) |
| external_lm_corpus | list[str] | Yes | Paths to external LM text corpora |
| vocab_file | str | Yes | Path to vocabulary file for pruning unknown n-grams |
| arpa_order | int | No | Order of the ARPA LM (default: 3) |
| prune_level | list[int] | No | Pruning thresholds per order (default: [0, 1, 2]) |
Outputs
| Name | Type | Description |
|---|---|---|
| libri_lm_corpus.txt | text file | Merged and deduplicated LM training corpus |
| lmplz command | stdout | Printed KenLM command to build the ARPA model |
| output_arpa | file | ARPA LM file (after manual lmplz execution) |
Usage Examples
# Step 1: Run the script to prepare corpus and get the lmplz command
python train_ngram.py hparams/train.yaml --data_folder /data/LibriSpeech
# Step 2: The script prints a command like:
# lmplz -o 3 --prune 0 1 2 --limit_vocab_file words.txt < corpus.txt > lm.arpa
# Step 3: Run the printed lmplz command manually
lmplz -o 3 --prune 0 1 2 --limit_vocab_file lang/words.txt \
< results/libri_lm_corpus.txt \
| sed '1,20s/<unk>/<UNK>/1' \
> results/lm.arpa