Implementation:Tencent Ncnn Whisper Example

Knowledge Sources	Tencent_Ncnn
Domains	Audio, Speech_Recognition
Last Updated	2026-02-09 19:00 GMT

Overview

Concrete tool for speech-to-text transcription using OpenAI Whisper with ncnn.

Description

This example implements OpenAI Whisper automatic speech recognition using the ncnn inference framework. It provides a complete encoder-decoder transformer pipeline that converts PCM 16-bit 16kHz WAV audio into text transcriptions. The implementation includes a Tokenizer class for BPE vocabulary decoding, a Whisper class orchestrating the full pipeline, and support for multiple model sizes (tiny, base, small, medium, large-v3-turbo). The pipeline computes log-mel spectrograms from raw audio, runs an encoder model to produce audio features, then uses an autoregressive decoder with greedy search and KV-cache optimization to generate token sequences. Language detection is performed automatically before transcription, with support for 99 languages.

Usage

Use this example when you need to perform speech recognition on WAV audio files using ncnn. It demonstrates how to implement encoder-decoder transformer architectures with autoregressive generation and KV-cache in ncnn, making it a reference for any sequence-to-sequence task on edge devices.

Code Reference

Source Location

Repository: Tencent_Ncnn
File: examples/whisper.cpp
Lines: 1-940

Signature

class Whisper
{
public:
    int load();
    int detect_lang(const std::vector<short>& samples, std::string& lang) const;
    int transcribe(const std::vector<short>& samples, const char* lang, std::string& text) const;

protected:
    int extract_fbank_feature(const std::vector<short>& samples, ncnn::Mat& input_features) const;
    int run_encoder(const ncnn::Mat& input_features, ncnn::Mat& encoder_states) const;
    int run_decoder_prefill(const std::vector<int>& tokens, const ncnn::Mat& encoder_states,
                            ncnn::Mat& last_logits, std::vector<ncnn::Mat>& out_kvcache) const;
    int run_decoder_step(const std::vector<int>& tokens, const ncnn::Mat& encoder_states,
                         ncnn::Mat& last_logits, const std::vector<ncnn::Mat>& kvcache,
                         std::vector<ncnn::Mat>& out_kvcache) const;
};

class Tokenizer
{
public:
    std::vector<std::string> reverse_vocab;
    void generate_byte_decoder();
    int load(const char* path);
    std::string decode(const std::vector<int>& tokens) const;
};

Import

#include "net.h"
#include "layer.h"
#include "layer_type.h"

I/O Contract

Inputs

Name	Type	Required	Description
wavpath	const char*	Yes	Path to PCM s16le 16kHz mono WAV audio file

Outputs

Name	Type	Description
lang	std::string	Detected language code (e.g., "en", "zh", "ja")
text	std::string	Transcribed text from the audio

Model Files

File	Description
whisper_tiny_fbank.ncnn.param/bin	Log-mel spectrogram feature extractor
whisper_tiny_encoder.ncnn.param/bin	Audio encoder model
whisper_tiny_embed_token.ncnn.param/bin	Token embedding model
whisper_tiny_embed_position.ncnn.param/bin	Positional embedding model
whisper_tiny_decoder.ncnn.param/bin	Autoregressive decoder model
whisper_tiny_proj_out.ncnn.param/bin	Output projection model
whisper_vocab.txt	BPE vocabulary file (converted from vocab.json)

Usage Examples

Running the Example

./whisper audio.wav

Key Code Pattern

// Load WAV audio samples
std::vector<short> samples;
load_wav_samples(wavpath, samples);

// Truncate to 30 seconds maximum
if (samples.size() > 480000)
    samples.resize(480000);

// Initialize and load Whisper models
Whisper whisper;
whisper.load();

// Detect language
std::string lang;
whisper.detect_lang(samples, lang);

// Transcribe audio to text
std::string text;
whisper.transcribe(samples, lang.c_str(), text);

Implementation Details

Pipeline Architecture

The Whisper implementation uses six separate ncnn::Net instances to compose the full encoder-decoder pipeline:

fbank - Computes log-mel spectrogram features from raw audio
encoder - Processes spectrograms into audio feature representations
embed_token - Converts token IDs to embeddings
embed_position - Adds positional encoding to token embeddings
decoder - Autoregressive transformer decoder with MultiHeadAttention layers
proj_out - Projects decoder output to vocabulary logits

KV-Cache Optimization

The decoder resolves KV-cache blob indexes at load time by iterating through MultiHeadAttention layers. During generation, run_decoder_prefill processes the initial token sequence, and run_decoder_step processes one token at a time reusing cached key-value states from previous steps.

Audio Requirements

Input audio must be PCM s16le, mono, 16kHz WAV format. Audio longer than 30 seconds is truncated. The conversion command is:

ffmpeg -i input.xxx -vn -c:a pcm_s16le -ac 1 -ar 16000 -fflags bitexact output.wav

Related Pages

Environment:Tencent_Ncnn_Build_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment