Implementation:Speechbrain Speechbrain AISHELL1 Seq2seq Hparams

Knowledge Sources	SpeechBrain
Domains	Speech Recognition, Hyperparameter Configuration
Last Updated	2026-02-09 00:00 GMT

Overview

Concrete configuration for AISHELL-1 sequence-to-sequence ASR training provided by the SpeechBrain library.

Description

This YAML hyperparameter file defines the full training configuration for an end-to-end attention-based ASR system on the AISHELL-1 Mandarin Chinese speech corpus. It configures a CRDNN encoder with a GRU-based attentional decoder, BPE tokenization with 5000 unigram tokens, and a joint CTC+NLL loss. The file also specifies data augmentation strategies (noise addition, speed perturbation, frequency drop, time drop), beam search decoding parameters, and learning rate scheduling via NewBob.

Usage

Use this when training a sequence-to-sequence ASR model on the AISHELL-1 Mandarin Chinese dataset with SpeechBrain.

Code Reference

Source Location

Repository: SpeechBrain
File: recipes/AISHELL-1/ASR/seq2seq/hparams/train.yaml

Key Hyperparameters

Training Parameters

Parameter	Value	Description
seed	1	Random seed for reproducibility
number_of_epochs	40	Total training epochs
number_of_ctc_epochs	10	Epochs using CTC loss before switching to joint loss
batch_size	16	Static batch size
lr	0.0003	Learning rate for Adam optimizer
ctc_weight	0.5	Weight of CTC loss in the joint CTC+NLL objective
sorting	ascending	Sort utterances by length (ascending)
precision	fp32	Training precision (bf16, fp16, or fp32)
dynamic_batching	True	Enable dynamic batching by duration
max_batch_length	15	Maximum batch length in seconds for dynamic batching

Feature Parameters

Parameter	Value	Description
sample_rate	16000	Audio sample rate in Hz
n_fft	400	FFT size for feature extraction
n_mels	40	Number of mel filterbank channels

Model Architecture

Parameter	Value	Description
cnn_blocks	2	Number of CNN blocks in CRDNN encoder
cnn_channels	(128, 256)	CNN channel sizes per block
rnn_layers	4	Number of LSTM layers in encoder
rnn_neurons	1024	Hidden size of each LSTM layer
rnn_bidirectional	True	Use bidirectional LSTM
dnn_blocks	2	Number of DNN blocks after RNN
dnn_neurons	512	DNN hidden layer size
emb_size	128	Decoder embedding dimension
dec_neurons	1024	Decoder GRU hidden size
output_neurons	5000	BPE vocabulary size
dropout	0.15	Dropout rate

Decoding Parameters

Parameter	Value	Description
beam_size	80	Beam search width
eos_threshold	1.5	End-of-sequence threshold
coverage_penalty	1.5	Coverage penalty weight
temperature	1.25	Softmax temperature for decoding
max_attn_shift	240	Maximum attention shift constraint

Data Augmentation

The configuration applies four augmentation techniques combined via an Augmenter with probability 1.0:

AddNoise: Adds noise from a downloaded noise dataset with SNR between 0-15 dB
SpeedPerturb: Speed perturbation at factors [0.95, 1.0, 1.05]
DropFreq: Randomly zeroes out 1-3 frequency bands
DropChunk: Randomly drops 1-5 temporal chunks of 1000-2000 samples

Model Components

Encoder: speechbrain.lobes.models.CRDNN.CRDNN
Embedding: speechbrain.nnet.embedding.Embedding
Decoder: speechbrain.nnet.RNN.AttentionalRNNDecoder (GRU with location-based attention)
Beam Search: speechbrain.decoders.S2SRNNBeamSearcher with CTC and coverage scorers
Tokenizer: SentencePiece BPE (pretrained, loaded via pretrainer)
Scheduler: speechbrain.nnet.schedulers.NewBobScheduler (annealing factor 0.8)

Usage Example

cd recipes/AISHELL-1/ASR/seq2seq
python train.py hparams/train.yaml --data_folder=/path/to/aishell1

Related Pages

Principle:Speechbrain_Speechbrain_Dataset_Specific_Data_Preparation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment