Implementation:Tencent Ncnn Whisper Example
| Knowledge Sources | |
|---|---|
| Domains | Audio, Speech_Recognition |
| Last Updated | 2026-02-09 19:00 GMT |
Overview
Concrete tool for speech-to-text transcription using OpenAI Whisper with ncnn.
Description
This example implements OpenAI Whisper automatic speech recognition using the ncnn inference framework. It provides a complete encoder-decoder transformer pipeline that converts PCM 16-bit 16kHz WAV audio into text transcriptions. The implementation includes a Tokenizer class for BPE vocabulary decoding, a Whisper class orchestrating the full pipeline, and support for multiple model sizes (tiny, base, small, medium, large-v3-turbo). The pipeline computes log-mel spectrograms from raw audio, runs an encoder model to produce audio features, then uses an autoregressive decoder with greedy search and KV-cache optimization to generate token sequences. Language detection is performed automatically before transcription, with support for 99 languages.
Usage
Use this example when you need to perform speech recognition on WAV audio files using ncnn. It demonstrates how to implement encoder-decoder transformer architectures with autoregressive generation and KV-cache in ncnn, making it a reference for any sequence-to-sequence task on edge devices.
Code Reference
Source Location
- Repository: Tencent_Ncnn
- File: examples/whisper.cpp
- Lines: 1-940
Signature
class Whisper
{
public:
int load();
int detect_lang(const std::vector<short>& samples, std::string& lang) const;
int transcribe(const std::vector<short>& samples, const char* lang, std::string& text) const;
protected:
int extract_fbank_feature(const std::vector<short>& samples, ncnn::Mat& input_features) const;
int run_encoder(const ncnn::Mat& input_features, ncnn::Mat& encoder_states) const;
int run_decoder_prefill(const std::vector<int>& tokens, const ncnn::Mat& encoder_states,
ncnn::Mat& last_logits, std::vector<ncnn::Mat>& out_kvcache) const;
int run_decoder_step(const std::vector<int>& tokens, const ncnn::Mat& encoder_states,
ncnn::Mat& last_logits, const std::vector<ncnn::Mat>& kvcache,
std::vector<ncnn::Mat>& out_kvcache) const;
};
class Tokenizer
{
public:
std::vector<std::string> reverse_vocab;
void generate_byte_decoder();
int load(const char* path);
std::string decode(const std::vector<int>& tokens) const;
};
Import
#include "net.h"
#include "layer.h"
#include "layer_type.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| wavpath | const char* | Yes | Path to PCM s16le 16kHz mono WAV audio file |
Outputs
| Name | Type | Description |
|---|---|---|
| lang | std::string | Detected language code (e.g., "en", "zh", "ja") |
| text | std::string | Transcribed text from the audio |
Model Files
| File | Description |
|---|---|
| whisper_tiny_fbank.ncnn.param/bin | Log-mel spectrogram feature extractor |
| whisper_tiny_encoder.ncnn.param/bin | Audio encoder model |
| whisper_tiny_embed_token.ncnn.param/bin | Token embedding model |
| whisper_tiny_embed_position.ncnn.param/bin | Positional embedding model |
| whisper_tiny_decoder.ncnn.param/bin | Autoregressive decoder model |
| whisper_tiny_proj_out.ncnn.param/bin | Output projection model |
| whisper_vocab.txt | BPE vocabulary file (converted from vocab.json) |
Usage Examples
Running the Example
./whisper audio.wav
Key Code Pattern
// Load WAV audio samples
std::vector<short> samples;
load_wav_samples(wavpath, samples);
// Truncate to 30 seconds maximum
if (samples.size() > 480000)
samples.resize(480000);
// Initialize and load Whisper models
Whisper whisper;
whisper.load();
// Detect language
std::string lang;
whisper.detect_lang(samples, lang);
// Transcribe audio to text
std::string text;
whisper.transcribe(samples, lang.c_str(), text);
Implementation Details
Pipeline Architecture
The Whisper implementation uses six separate ncnn::Net instances to compose the full encoder-decoder pipeline:
- fbank - Computes log-mel spectrogram features from raw audio
- encoder - Processes spectrograms into audio feature representations
- embed_token - Converts token IDs to embeddings
- embed_position - Adds positional encoding to token embeddings
- decoder - Autoregressive transformer decoder with MultiHeadAttention layers
- proj_out - Projects decoder output to vocabulary logits
KV-Cache Optimization
The decoder resolves KV-cache blob indexes at load time by iterating through MultiHeadAttention layers. During generation, run_decoder_prefill processes the initial token sequence, and run_decoder_step processes one token at a time reusing cached key-value states from previous steps.
Audio Requirements
Input audio must be PCM s16le, mono, 16kHz WAV format. Audio longer than 30 seconds is truncated. The conversion command is:
ffmpeg -i input.xxx -vn -c:a pcm_s16le -ac 1 -ar 16000 -fflags bitexact output.wav