Implementation:Alibaba MNN LLM Config JSON

Field	Value
implementation_name	LLM_Config_JSON
implementation_type	Pattern Doc
repository	Alibaba_MNN
workflow	LLM_Deployment_Pipeline
pipeline_stage	Runtime Configuration
source_file	docs/transformers/llm.md (L365-502)
last_updated	2026-02-10 14:00 GMT

Summary

This implementation documents the config.json configuration file format used to control MNN LLM inference behavior at runtime. The file resides in the exported model directory alongside llm.mnn, llm.mnn.weight, and other model artifacts. It provides a declarative interface for tuning hardware backend, precision, memory, sampling, and generation parameters without modifying the model itself.

API Signature

Edit config.json in the model directory (JSON key-value configuration)

The config.json is read by Llm::createLLM(config_path) in the C++ runtime and can also be updated dynamically via the llm->set_config(json_string) API.

Source Reference

The full configuration schema is documented in the MNN LLM documentation:

docs/transformers/llm.md (Lines 365-502)

Configuration Schema

Model File Information

Key	Type	Default	Description
`base_dir`	string	(directory of config.json)	Base directory for resolving model file paths
`llm_config`	string	`"config.json"`	Path to `llm_config.json` (relative to base_dir)
`llm_model`	string	`"llm.mnn"`	Path to the MNN model file
`llm_weight`	string	`"llm.mnn.weight"`	Path to the MNN weight file
`block_model`	string	`"block_{idx}.mnn"`	Path pattern for segmented block models
`lm_model`	string	`"lm.mnn"`	Path for segmented LM head model
`embedding_model`	string	`"embedding.mnn"`	Path for embedding model (when using model-based embedding)
`embedding_file`	string	`"embeddings_bf16.bin"`	Path for binary embedding file
`tokenizer_file`	string	`"tokenizer.txt"`	Path to the tokenizer file
`visual_model`	string	`"visual.mnn"`	Path for vision encoder (VL models)
`audio_model`	string	`"audio.mnn"`	Path for audio encoder (Audio models)

Hardware Configuration

Key	Type	Default	Description
`backend_type`	string	`"cpu"`	Inference backend: `"cpu"`, `"opencl"`, or `"metal"`
`thread_num`	int	4	CPU thread count. For OpenCL, use 68 (buffer mode + wide tuning)
`precision`	string	`"low"`	Precision strategy: `"low"` (fp16) or `"high"` (fp32)
`memory`	string	`"low"`	Memory strategy: `"low"` (runtime quant enabled) or `"high"` (disabled)

Inference Configuration

Key	Type	Default	Description
`max_new_tokens`	int	512	Maximum number of tokens to generate per response
`reuse_kv`	bool	false	Reuse KV-cache across multi-turn conversations
`attention_mode`	int	8	CPU: 0/1/2 (no Flash Attn) or 8/9/10 (Flash Attn) with QKV quant levels. GPU: 0/8/16
`use_mmap`	bool	false	Use mmap for weight loading (write weights to disk when memory is insufficient)
`kvcache_mmap`	bool	false	Use mmap for KV-cache (write to disk when memory is insufficient)
`chunk`	int	(none)	Maximum tokens per processing step (splits long prompts to reduce memory)
`chunk_limits`	array	(none)	Token processing limits, e.g., `[128, 1]`. Overrides `chunk`
`tmp_path`	string	(none)	Temporary directory for mmap cache files

CPU Dynamic Quantization Configuration

Key	Type	Default	Description
`dynamic_option`	int	0	Feature map quantization mode: 0 (per-channel), 1 (per-tensor), 2 (per-block), 8+ (decode acceleration)
`cpu_sme2_neon_division_ratio`	int	41	SME2/NEON workload ratio (format: 8*x+y where x=prefill ratio, y=decode ratio)

Sampler Configuration

Key	Type	Default	Description
`sampler_type`	string	`"greedy"`	Sampler type: `"greedy"`, `"temperature"`, `"topK"`, `"topP"`, `"minP"`, `"tfs"`, `"typical"`, `"penalty"`, or `"mixed"`
`mixed_samplers`	array	`["topK", "tfs", "typical", "topP", "min_p", "temperature"]`	Sampler chain for `"mixed"` mode
`temperature`	float	1.0	Sampling temperature
`topK`	int	40	Top-K filtering threshold
`topP`	float	0.9	Top-P (nucleus) filtering threshold
`minP`	float	0.1	Min-P filtering threshold
`tfsZ`	float	1.0	Tail-free sampling Z parameter (1.0 = disabled)
`typical`	float	1.0	Typical sampling p parameter (1.0 = disabled)
`penalty`	float	0.0	Repetition penalty (0.0 = disabled, recommended 1.05-1.5)
`n_gram`	int	8	Maximum n-gram size for repetition penalty
`ngram_factor`	float	1.0	Extra penalty for repeated n-grams (n>1)
`penalty_sampler`	string	`"greedy"`	Sampling strategy after penalty application (`"greedy"` or `"temperature"`)

Speculative Decoding Configuration

Key	Type	Default	Description
`speculative_type`	string	(none)	Speculative decoding algorithm: `"lookahead"`
`draft_predict_length`	int	4	Draft sequence length (2-8)
`draft_match_strictness`	string	`"low"`	Draft matching strictness: `"low"`, `"medium"`, `"high"`
`draft_selection_rule`	string	`"freqxlen"`	Draft selection rule: `"freqxlen"` or `"fcfs"`
`lookup_file`	string	`"lookup_file.txt"`	External knowledge base file for lookahead decoding

Inputs

Exported model directory from llmexport.py containing model files and a default config.json

Outputs

A configured config.json file ready for use by llm_demo, llm_bench, or the C++ API

Usage Examples

Minimal CPU Configuration

{
    "llm_model": "qwen2-1.5b-int4.mnn",
    "llm_weight": "qwen2-1.5b-int4.mnn.weight",

    "backend_type": "cpu",
    "thread_num": 4,
    "precision": "low",
    "memory": "low"
}

Full Configuration with Mixed Sampling

{
    "llm_model": "qwen2-1.5b-int4.mnn",
    "llm_weight": "qwen2-1.5b-int4.mnn.weight",

    "backend_type": "cpu",
    "thread_num": 4,
    "precision": "low",
    "memory": "low",
    "sampler_type": "mixed",
    "mixed_samplers": ["topK", "tfs", "typical", "topP", "min_p", "temperature"],
    "temperature": 1.0,
    "topK": 40,
    "topP": 0.9,
    "tfsZ": 1.0,
    "minP": 0.1,
    "reuse_kv": true
}

OpenCL GPU Configuration

{
    "llm_model": "qwen2-1.5b-int4.mnn",
    "llm_weight": "qwen2-1.5b-int4.mnn.weight",

    "backend_type": "opencl",
    "thread_num": 68,
    "precision": "low",
    "memory": "low",
    "max_new_tokens": 512,
    "sampler_type": "temperature",
    "temperature": 0.7
}

Mobile Configuration with Memory Optimization

{
    "llm_model": "qwen2-1.5b-int4.mnn",
    "llm_weight": "qwen2-1.5b-int4.mnn.weight",

    "backend_type": "cpu",
    "thread_num": 4,
    "precision": "low",
    "memory": "low",
    "use_mmap": true,
    "kvcache_mmap": true,
    "chunk": 128,
    "max_new_tokens": 256,
    "reuse_kv": true
}

Notes

The config.json is auto-generated by llmexport.py during model export with sensible defaults. Manual editing is only needed for tuning.
When using OpenCL backend, the first run performs kernel tuning (which is slow). Performance should be measured on subsequent runs after the tuning cache is generated.
For iOS, the tmp_path should be set to a temporary directory, e.g., using NSTemporaryDirectory().
The attention_mode parameter replaces the deprecated quant_qkv parameter.
Dynamic configuration updates at runtime are supported via llm->set_config(json_string) in the C++ API.

Related Pages

Principle:Alibaba_MNN_LLM_Runtime_Configuration
Heuristic:Alibaba_MNN_LLM_Runtime_Tuning
Implementation:Alibaba_MNN_CMake_Build_LLM - Previous step: compiling the engine
Implementation:Alibaba_MNN_LLM_Demo_CLI - Next step: running inference

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment