Implementation:Speechbrain Speechbrain Train TimersAndSuch Multistage
| Knowledge Sources | |
|---|---|
| Domains | Spoken_Language_Understanding, Training |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Concrete tool for multistage spoken language understanding (SLU) training provided by the SpeechBrain library.
Description
This recipe implements a "multistage" SLU pipeline: speech is first transcribed to text using a pretrained ASR model (trained on LibriSpeech), then the transcriptions are fed into a sequence-to-sequence model that maps them to semantic representations. The SLU class extends sb.Brain and performs both the ASR forward pass and the NLU (Natural Language Understanding) forward pass within compute_forward. The ASR model produces word-level transcriptions which are tokenized, embedded, and passed through an SLU encoder and decoder with beam search at inference time. Training uses negative log-likelihood loss on semantic token sequences. The benefit of online transcription (rather than offline) is the ability to use augmentation and sample multiple possible transcriptions during training.
Evaluation metrics include CER (Character Error Rate), WER (Word Error Rate), and SER (Sentence Error Rate) on semantic output sequences.
Usage
Run as a training recipe with a YAML hyperparameter file. The script handles data preparation from CSV files, model training with learning rate annealing based on SER, and checkpointing.
Code Reference
Source Location
- Repository: SpeechBrain
- File: recipes/timers-and-such/multistage/train.py
Signature
class SLU(sb.Brain):
def compute_forward(self, batch, stage):
"""Forward computations from waveform batches to output probabilities."""
...
def compute_objectives(self, predictions, batch, stage):
"""Computes the loss (NLL) given predictions and targets."""
...
def on_stage_start(self, stage, epoch):
"""Gets called at the beginning of each epoch."""
...
def on_stage_end(self, stage, stage_loss, epoch):
"""Gets called at the end of an epoch."""
...
def dataio_prepare(hparams):
"""Prepares the datasets to be used in the brain class."""
...
Import
python train.py hparams/train.yaml
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| hparams_file | str | Yes | Path to YAML hyperparameter file |
| batch.sig | tuple(torch.Tensor, torch.Tensor) | Yes | Waveform tensor and lengths |
| batch.tokens_bos | tuple(torch.Tensor, torch.Tensor) | Yes | Target semantic tokens with BOS and their lengths |
| batch.tokens_eos | tuple(torch.Tensor, torch.Tensor) | Yes | Target semantic tokens with EOS and their lengths |
| batch.semantics | list[str] | Yes | Target semantic strings for evaluation |
| asr_model | Pretrained | Yes | Pretrained ASR model for transcription (e.g., LibriSpeech-trained) |
Outputs
| Name | Type | Description |
|---|---|---|
| p_seq | torch.Tensor | Log-probabilities over semantic token sequences |
| p_tokens | torch.Tensor | Beam search decoded token predictions (at inference) |
| CER | float | Character Error Rate on semantic output |
| WER | float | Word Error Rate on semantic output |
| SER | float | Sentence Error Rate on semantic output |
Usage Examples
# Train the multistage SLU model
python train.py hparams/train.yaml --data_folder /path/to/timers-and-such
# The pipeline: speech -> ASR transcription -> NLU -> semantic parse
# Example output semantic format: "action: set | object: timer | duration: 5 minutes"