Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Speechbrain Speechbrain AISHELL1 Seq2seq Hparams

From Leeroopedia


Knowledge Sources
Domains Speech Recognition, Hyperparameter Configuration
Last Updated 2026-02-09 00:00 GMT

Overview

Concrete configuration for AISHELL-1 sequence-to-sequence ASR training provided by the SpeechBrain library.

Description

This YAML hyperparameter file defines the full training configuration for an end-to-end attention-based ASR system on the AISHELL-1 Mandarin Chinese speech corpus. It configures a CRDNN encoder with a GRU-based attentional decoder, BPE tokenization with 5000 unigram tokens, and a joint CTC+NLL loss. The file also specifies data augmentation strategies (noise addition, speed perturbation, frequency drop, time drop), beam search decoding parameters, and learning rate scheduling via NewBob.

Usage

Use this when training a sequence-to-sequence ASR model on the AISHELL-1 Mandarin Chinese dataset with SpeechBrain.

Code Reference

Source Location

Key Hyperparameters

Training Parameters

Parameter Value Description
seed 1 Random seed for reproducibility
number_of_epochs 40 Total training epochs
number_of_ctc_epochs 10 Epochs using CTC loss before switching to joint loss
batch_size 16 Static batch size
lr 0.0003 Learning rate for Adam optimizer
ctc_weight 0.5 Weight of CTC loss in the joint CTC+NLL objective
sorting ascending Sort utterances by length (ascending)
precision fp32 Training precision (bf16, fp16, or fp32)
dynamic_batching True Enable dynamic batching by duration
max_batch_length 15 Maximum batch length in seconds for dynamic batching

Feature Parameters

Parameter Value Description
sample_rate 16000 Audio sample rate in Hz
n_fft 400 FFT size for feature extraction
n_mels 40 Number of mel filterbank channels

Model Architecture

Parameter Value Description
cnn_blocks 2 Number of CNN blocks in CRDNN encoder
cnn_channels (128, 256) CNN channel sizes per block
rnn_layers 4 Number of LSTM layers in encoder
rnn_neurons 1024 Hidden size of each LSTM layer
rnn_bidirectional True Use bidirectional LSTM
dnn_blocks 2 Number of DNN blocks after RNN
dnn_neurons 512 DNN hidden layer size
emb_size 128 Decoder embedding dimension
dec_neurons 1024 Decoder GRU hidden size
output_neurons 5000 BPE vocabulary size
dropout 0.15 Dropout rate

Decoding Parameters

Parameter Value Description
beam_size 80 Beam search width
eos_threshold 1.5 End-of-sequence threshold
coverage_penalty 1.5 Coverage penalty weight
temperature 1.25 Softmax temperature for decoding
max_attn_shift 240 Maximum attention shift constraint

Data Augmentation

The configuration applies four augmentation techniques combined via an Augmenter with probability 1.0:

  • AddNoise: Adds noise from a downloaded noise dataset with SNR between 0-15 dB
  • SpeedPerturb: Speed perturbation at factors [0.95, 1.0, 1.05]
  • DropFreq: Randomly zeroes out 1-3 frequency bands
  • DropChunk: Randomly drops 1-5 temporal chunks of 1000-2000 samples

Model Components

  • Encoder: speechbrain.lobes.models.CRDNN.CRDNN
  • Embedding: speechbrain.nnet.embedding.Embedding
  • Decoder: speechbrain.nnet.RNN.AttentionalRNNDecoder (GRU with location-based attention)
  • Beam Search: speechbrain.decoders.S2SRNNBeamSearcher with CTC and coverage scorers
  • Tokenizer: SentencePiece BPE (pretrained, loaded via pretrainer)
  • Scheduler: speechbrain.nnet.schedulers.NewBobScheduler (annealing factor 0.8)

Usage Example

cd recipes/AISHELL-1/ASR/seq2seq
python train.py hparams/train.yaml --data_folder=/path/to/aishell1

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment