Implementation:Speechbrain Speechbrain AISHELL1 Seq2seq Hparams
| Knowledge Sources | |
|---|---|
| Domains | Speech Recognition, Hyperparameter Configuration |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Concrete configuration for AISHELL-1 sequence-to-sequence ASR training provided by the SpeechBrain library.
Description
This YAML hyperparameter file defines the full training configuration for an end-to-end attention-based ASR system on the AISHELL-1 Mandarin Chinese speech corpus. It configures a CRDNN encoder with a GRU-based attentional decoder, BPE tokenization with 5000 unigram tokens, and a joint CTC+NLL loss. The file also specifies data augmentation strategies (noise addition, speed perturbation, frequency drop, time drop), beam search decoding parameters, and learning rate scheduling via NewBob.
Usage
Use this when training a sequence-to-sequence ASR model on the AISHELL-1 Mandarin Chinese dataset with SpeechBrain.
Code Reference
Source Location
- Repository: SpeechBrain
- File: recipes/AISHELL-1/ASR/seq2seq/hparams/train.yaml
Key Hyperparameters
Training Parameters
| Parameter | Value | Description |
|---|---|---|
| seed | 1 | Random seed for reproducibility |
| number_of_epochs | 40 | Total training epochs |
| number_of_ctc_epochs | 10 | Epochs using CTC loss before switching to joint loss |
| batch_size | 16 | Static batch size |
| lr | 0.0003 | Learning rate for Adam optimizer |
| ctc_weight | 0.5 | Weight of CTC loss in the joint CTC+NLL objective |
| sorting | ascending | Sort utterances by length (ascending) |
| precision | fp32 | Training precision (bf16, fp16, or fp32) |
| dynamic_batching | True | Enable dynamic batching by duration |
| max_batch_length | 15 | Maximum batch length in seconds for dynamic batching |
Feature Parameters
| Parameter | Value | Description |
|---|---|---|
| sample_rate | 16000 | Audio sample rate in Hz |
| n_fft | 400 | FFT size for feature extraction |
| n_mels | 40 | Number of mel filterbank channels |
Model Architecture
| Parameter | Value | Description |
|---|---|---|
| cnn_blocks | 2 | Number of CNN blocks in CRDNN encoder |
| cnn_channels | (128, 256) | CNN channel sizes per block |
| rnn_layers | 4 | Number of LSTM layers in encoder |
| rnn_neurons | 1024 | Hidden size of each LSTM layer |
| rnn_bidirectional | True | Use bidirectional LSTM |
| dnn_blocks | 2 | Number of DNN blocks after RNN |
| dnn_neurons | 512 | DNN hidden layer size |
| emb_size | 128 | Decoder embedding dimension |
| dec_neurons | 1024 | Decoder GRU hidden size |
| output_neurons | 5000 | BPE vocabulary size |
| dropout | 0.15 | Dropout rate |
Decoding Parameters
| Parameter | Value | Description |
|---|---|---|
| beam_size | 80 | Beam search width |
| eos_threshold | 1.5 | End-of-sequence threshold |
| coverage_penalty | 1.5 | Coverage penalty weight |
| temperature | 1.25 | Softmax temperature for decoding |
| max_attn_shift | 240 | Maximum attention shift constraint |
Data Augmentation
The configuration applies four augmentation techniques combined via an Augmenter with probability 1.0:
- AddNoise: Adds noise from a downloaded noise dataset with SNR between 0-15 dB
- SpeedPerturb: Speed perturbation at factors [0.95, 1.0, 1.05]
- DropFreq: Randomly zeroes out 1-3 frequency bands
- DropChunk: Randomly drops 1-5 temporal chunks of 1000-2000 samples
Model Components
- Encoder:
speechbrain.lobes.models.CRDNN.CRDNN - Embedding:
speechbrain.nnet.embedding.Embedding - Decoder:
speechbrain.nnet.RNN.AttentionalRNNDecoder(GRU with location-based attention) - Beam Search:
speechbrain.decoders.S2SRNNBeamSearcherwith CTC and coverage scorers - Tokenizer: SentencePiece BPE (pretrained, loaded via pretrainer)
- Scheduler:
speechbrain.nnet.schedulers.NewBobScheduler(annealing factor 0.8)
Usage Example
cd recipes/AISHELL-1/ASR/seq2seq
python train.py hparams/train.yaml --data_folder=/path/to/aishell1