Heuristic:Tencent Ncnn FP16 Precision Selection
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Precision |
| Last Updated | 2026-02-09 19:00 GMT |
Overview
Precision selection guide for choosing between fp32, fp16, and int8 inference modes in ncnn, balancing speed gains (2-8x) against potential accuracy loss and NaN overflow risks.
Description
ncnn supports three precision modes: fp32 (default full precision), fp16 (half-precision with packed/storage/arithmetic sub-options), and int8 (post-training quantized). FP16 is enabled by default and provides 2-4x speedup on hardware with native half-precision support (ARM fp16, x86 F16C, Vulkan). However, fp16 has a smaller dynamic range than fp32, and some models produce NaN values or incorrect results when fp16 arithmetic is enabled. Int8 quantization provides the largest speedup (4-8x) but requires a calibration step with representative data.
Usage
Use this heuristic when tuning inference performance or debugging NaN/incorrect results. If a model produces NaN or wildly incorrect outputs, the first troubleshooting step is to disable fp16 flags. For maximum performance on supported hardware, keep fp16 enabled. For models with extreme dynamic range (e.g., detection heads, attention logits), consider disabling fp16 arithmetic while keeping fp16 storage.
The Insight (Rule of Thumb)
- Action: Start with defaults (fp16 all enabled). If NaN or wrong results appear, progressively disable fp16 flags.
- Value: Default: `use_fp16_packed=true, use_fp16_storage=true, use_fp16_arithmetic=true`. Fallback: set all three to `false`.
- Trade-off: FP16 saves 2-4x compute/bandwidth but risks overflow on models with large activations. Int8 saves 4-8x but requires calibration data and may lose accuracy on some layers.
- Priority Order for Disabling: First disable `use_fp16_arithmetic`, then `use_fp16_storage`, then `use_fp16_packed`.
Reasoning
FP16 has a maximum representable value of ~65504, compared to FP32's ~3.4e38. Models with unbounded intermediate values (e.g., pre-softmax logits, certain normalization layers) can overflow to NaN in fp16. The three fp16 flags control different aspects: packed controls SIMD register packing, storage controls memory layout, and arithmetic controls compute precision. Disabling arithmetic alone often fixes overflow while retaining most of the storage bandwidth benefit.
Code evidence for defaults from `src/option.cpp:37-39`:
use_fp16_packed = true;
use_fp16_storage = true;
use_fp16_arithmetic = true;
Int8 defaults from `src/option.cpp:40-42`:
use_int8_packed = true;
use_int8_storage = true;
use_int8_arithmetic = false;
Documented fix for NaN from FAQ:
// Fix FP16 overflow producing NaN
net.opt.use_fp16_packed = false;
net.opt.use_fp16_storage = false;
net.opt.use_fp16_arithmetic = false;