Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Tencent Ncnn Whisper Example

From Leeroopedia


Knowledge Sources
Domains Audio, Speech_Recognition
Last Updated 2026-02-09 19:00 GMT

Overview

Concrete tool for speech-to-text transcription using OpenAI Whisper with ncnn.

Description

This example implements OpenAI Whisper automatic speech recognition using the ncnn inference framework. It provides a complete encoder-decoder transformer pipeline that converts PCM 16-bit 16kHz WAV audio into text transcriptions. The implementation includes a Tokenizer class for BPE vocabulary decoding, a Whisper class orchestrating the full pipeline, and support for multiple model sizes (tiny, base, small, medium, large-v3-turbo). The pipeline computes log-mel spectrograms from raw audio, runs an encoder model to produce audio features, then uses an autoregressive decoder with greedy search and KV-cache optimization to generate token sequences. Language detection is performed automatically before transcription, with support for 99 languages.

Usage

Use this example when you need to perform speech recognition on WAV audio files using ncnn. It demonstrates how to implement encoder-decoder transformer architectures with autoregressive generation and KV-cache in ncnn, making it a reference for any sequence-to-sequence task on edge devices.

Code Reference

Source Location

Signature

class Whisper
{
public:
    int load();
    int detect_lang(const std::vector<short>& samples, std::string& lang) const;
    int transcribe(const std::vector<short>& samples, const char* lang, std::string& text) const;

protected:
    int extract_fbank_feature(const std::vector<short>& samples, ncnn::Mat& input_features) const;
    int run_encoder(const ncnn::Mat& input_features, ncnn::Mat& encoder_states) const;
    int run_decoder_prefill(const std::vector<int>& tokens, const ncnn::Mat& encoder_states,
                            ncnn::Mat& last_logits, std::vector<ncnn::Mat>& out_kvcache) const;
    int run_decoder_step(const std::vector<int>& tokens, const ncnn::Mat& encoder_states,
                         ncnn::Mat& last_logits, const std::vector<ncnn::Mat>& kvcache,
                         std::vector<ncnn::Mat>& out_kvcache) const;
};

class Tokenizer
{
public:
    std::vector<std::string> reverse_vocab;
    void generate_byte_decoder();
    int load(const char* path);
    std::string decode(const std::vector<int>& tokens) const;
};

Import

#include "net.h"
#include "layer.h"
#include "layer_type.h"

I/O Contract

Inputs

Name Type Required Description
wavpath const char* Yes Path to PCM s16le 16kHz mono WAV audio file

Outputs

Name Type Description
lang std::string Detected language code (e.g., "en", "zh", "ja")
text std::string Transcribed text from the audio

Model Files

File Description
whisper_tiny_fbank.ncnn.param/bin Log-mel spectrogram feature extractor
whisper_tiny_encoder.ncnn.param/bin Audio encoder model
whisper_tiny_embed_token.ncnn.param/bin Token embedding model
whisper_tiny_embed_position.ncnn.param/bin Positional embedding model
whisper_tiny_decoder.ncnn.param/bin Autoregressive decoder model
whisper_tiny_proj_out.ncnn.param/bin Output projection model
whisper_vocab.txt BPE vocabulary file (converted from vocab.json)

Usage Examples

Running the Example

./whisper audio.wav

Key Code Pattern

// Load WAV audio samples
std::vector<short> samples;
load_wav_samples(wavpath, samples);

// Truncate to 30 seconds maximum
if (samples.size() > 480000)
    samples.resize(480000);

// Initialize and load Whisper models
Whisper whisper;
whisper.load();

// Detect language
std::string lang;
whisper.detect_lang(samples, lang);

// Transcribe audio to text
std::string text;
whisper.transcribe(samples, lang.c_str(), text);

Implementation Details

Pipeline Architecture

The Whisper implementation uses six separate ncnn::Net instances to compose the full encoder-decoder pipeline:

  1. fbank - Computes log-mel spectrogram features from raw audio
  2. encoder - Processes spectrograms into audio feature representations
  3. embed_token - Converts token IDs to embeddings
  4. embed_position - Adds positional encoding to token embeddings
  5. decoder - Autoregressive transformer decoder with MultiHeadAttention layers
  6. proj_out - Projects decoder output to vocabulary logits

KV-Cache Optimization

The decoder resolves KV-cache blob indexes at load time by iterating through MultiHeadAttention layers. During generation, run_decoder_prefill processes the initial token sequence, and run_decoder_step processes one token at a time reusing cached key-value states from previous steps.

Audio Requirements

Input audio must be PCM s16le, mono, 16kHz WAV format. Audio longer than 30 seconds is truncated. The conversion command is:

ffmpeg -i input.xxx -vn -c:a pcm_s16le -ac 1 -ar 16000 -fflags bitexact output.wav

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment