Principle:Tencent Ncnn Autoregressive Sequence Decoding
| Knowledge Sources | |
|---|---|
| Domains | Sequence Modeling, Natural Language Processing |
| Last Updated | 2026-02-09 19:00 GMT |
Overview
An encoder-decoder inference pattern where the encoder processes the full input once and the decoder generates output tokens iteratively, each step conditioned on all previously generated tokens, until an end-of-sequence signal is emitted.
Description
Autoregressive sequence decoding is the inference-time strategy used by encoder-decoder models to produce variable-length output sequences. The process is divided into two distinct phases.
In the encoding phase, the entire input (such as an audio spectrogram or an image feature map) is passed through the encoder network in a single forward pass. The encoder produces a sequence of hidden state vectors that capture the contextual representation of the input. This encoding is computed only once and reused throughout all subsequent decoding steps.
In the decoding phase, the decoder generates one token at a time in a left-to-right fashion. At each time step, the decoder receives the previously generated token (or a start-of-sequence token at the first step), attends to the encoder hidden states via cross-attention, and produces a probability distribution over the output vocabulary. The token with the highest probability (greedy decoding) or a token sampled from the distribution is selected, appended to the output sequence, and fed back as input for the next step. This loop continues until the decoder emits an end-of-sequence token or a maximum length is reached.
A critical optimization in this pattern is key-value caching: the decoder caches the key and value projections from previous time steps in its self-attention layers, so that each new step only needs to compute attention for the latest token rather than re-processing the entire generated sequence.
Usage
This principle applies in any inference scenario involving sequence-to-sequence generation:
- Speech recognition: Transcribing audio frames into text token sequences (e.g., Whisper).
- Optical character recognition: Decoding detected text regions into character sequences using attention-based decoders.
- Machine translation: Converting a sentence from one language to another.
- Image captioning: Generating natural language descriptions from visual features.
Theoretical Basis
The autoregressive factorization of the output sequence probability:
The decoding loop in pseudo-code:
// Encoding phase (run once)
encoder_hidden = Encoder(input_features)
// Decoding phase (iterative)
token = START_OF_SEQUENCE
output_tokens = []
kv_cache = empty
while token != END_OF_SEQUENCE and len(output_tokens) < max_length:
logits, kv_cache = Decoder(token, encoder_hidden, kv_cache)
token = argmax(logits) // greedy decoding
output_tokens.append(token)
return output_tokens
Cross-attention at each decoder layer allows the decoder to focus on relevant parts of the encoder output:
Q = W_q * decoder_hidden // query from decoder state
K = W_k * encoder_hidden // key from encoder output
V = W_v * encoder_hidden // value from encoder output
attention_weights = softmax(Q * K^T / sqrt(d_k))
context = attention_weights * V