| Field |
Value
|
| implementation_name |
LLM_Config_JSON
|
| implementation_type |
Pattern Doc
|
| repository |
Alibaba_MNN
|
| workflow |
LLM_Deployment_Pipeline
|
| pipeline_stage |
Runtime Configuration
|
| source_file |
docs/transformers/llm.md (L365-502)
|
| last_updated |
2026-02-10 14:00 GMT
|
Summary
This implementation documents the config.json configuration file format used to control MNN LLM inference behavior at runtime. The file resides in the exported model directory alongside llm.mnn, llm.mnn.weight, and other model artifacts. It provides a declarative interface for tuning hardware backend, precision, memory, sampling, and generation parameters without modifying the model itself.
API Signature
Edit config.json in the model directory (JSON key-value configuration)
The config.json is read by Llm::createLLM(config_path) in the C++ runtime and can also be updated dynamically via the llm->set_config(json_string) API.
Source Reference
The full configuration schema is documented in the MNN LLM documentation:
docs/transformers/llm.md (Lines 365-502)
Configuration Schema
Model File Information
| Key |
Type |
Default |
Description
|
base_dir |
string |
(directory of config.json) |
Base directory for resolving model file paths
|
llm_config |
string |
"config.json" |
Path to llm_config.json (relative to base_dir)
|
llm_model |
string |
"llm.mnn" |
Path to the MNN model file
|
llm_weight |
string |
"llm.mnn.weight" |
Path to the MNN weight file
|
block_model |
string |
"block_{idx}.mnn" |
Path pattern for segmented block models
|
lm_model |
string |
"lm.mnn" |
Path for segmented LM head model
|
embedding_model |
string |
"embedding.mnn" |
Path for embedding model (when using model-based embedding)
|
embedding_file |
string |
"embeddings_bf16.bin" |
Path for binary embedding file
|
tokenizer_file |
string |
"tokenizer.txt" |
Path to the tokenizer file
|
visual_model |
string |
"visual.mnn" |
Path for vision encoder (VL models)
|
audio_model |
string |
"audio.mnn" |
Path for audio encoder (Audio models)
|
Hardware Configuration
| Key |
Type |
Default |
Description
|
backend_type |
string |
"cpu" |
Inference backend: "cpu", "opencl", or "metal"
|
thread_num |
int |
4 |
CPU thread count. For OpenCL, use 68 (buffer mode + wide tuning)
|
precision |
string |
"low" |
Precision strategy: "low" (fp16) or "high" (fp32)
|
memory |
string |
"low" |
Memory strategy: "low" (runtime quant enabled) or "high" (disabled)
|
Inference Configuration
| Key |
Type |
Default |
Description
|
max_new_tokens |
int |
512 |
Maximum number of tokens to generate per response
|
reuse_kv |
bool |
false |
Reuse KV-cache across multi-turn conversations
|
attention_mode |
int |
8 |
CPU: 0/1/2 (no Flash Attn) or 8/9/10 (Flash Attn) with QKV quant levels. GPU: 0/8/16
|
use_mmap |
bool |
false |
Use mmap for weight loading (write weights to disk when memory is insufficient)
|
kvcache_mmap |
bool |
false |
Use mmap for KV-cache (write to disk when memory is insufficient)
|
chunk |
int |
(none) |
Maximum tokens per processing step (splits long prompts to reduce memory)
|
chunk_limits |
array |
(none) |
Token processing limits, e.g., [128, 1]. Overrides chunk
|
tmp_path |
string |
(none) |
Temporary directory for mmap cache files
|
CPU Dynamic Quantization Configuration
| Key |
Type |
Default |
Description
|
dynamic_option |
int |
0 |
Feature map quantization mode: 0 (per-channel), 1 (per-tensor), 2 (per-block), 8+ (decode acceleration)
|
cpu_sme2_neon_division_ratio |
int |
41 |
SME2/NEON workload ratio (format: 8*x+y where x=prefill ratio, y=decode ratio)
|
Sampler Configuration
| Key |
Type |
Default |
Description
|
sampler_type |
string |
"greedy" |
Sampler type: "greedy", "temperature", "topK", "topP", "minP", "tfs", "typical", "penalty", or "mixed"
|
mixed_samplers |
array |
["topK", "tfs", "typical", "topP", "min_p", "temperature"] |
Sampler chain for "mixed" mode
|
temperature |
float |
1.0 |
Sampling temperature
|
topK |
int |
40 |
Top-K filtering threshold
|
topP |
float |
0.9 |
Top-P (nucleus) filtering threshold
|
minP |
float |
0.1 |
Min-P filtering threshold
|
tfsZ |
float |
1.0 |
Tail-free sampling Z parameter (1.0 = disabled)
|
typical |
float |
1.0 |
Typical sampling p parameter (1.0 = disabled)
|
penalty |
float |
0.0 |
Repetition penalty (0.0 = disabled, recommended 1.05-1.5)
|
n_gram |
int |
8 |
Maximum n-gram size for repetition penalty
|
ngram_factor |
float |
1.0 |
Extra penalty for repeated n-grams (n>1)
|
penalty_sampler |
string |
"greedy" |
Sampling strategy after penalty application ("greedy" or "temperature")
|
Speculative Decoding Configuration
| Key |
Type |
Default |
Description
|
speculative_type |
string |
(none) |
Speculative decoding algorithm: "lookahead"
|
draft_predict_length |
int |
4 |
Draft sequence length (2-8)
|
draft_match_strictness |
string |
"low" |
Draft matching strictness: "low", "medium", "high"
|
draft_selection_rule |
string |
"freqxlen" |
Draft selection rule: "freqxlen" or "fcfs"
|
lookup_file |
string |
"lookup_file.txt" |
External knowledge base file for lookahead decoding
|
Inputs
- Exported model directory from
llmexport.py containing model files and a default config.json
Outputs
- A configured
config.json file ready for use by llm_demo, llm_bench, or the C++ API
Usage Examples
Minimal CPU Configuration
{
"llm_model": "qwen2-1.5b-int4.mnn",
"llm_weight": "qwen2-1.5b-int4.mnn.weight",
"backend_type": "cpu",
"thread_num": 4,
"precision": "low",
"memory": "low"
}
Full Configuration with Mixed Sampling
{
"llm_model": "qwen2-1.5b-int4.mnn",
"llm_weight": "qwen2-1.5b-int4.mnn.weight",
"backend_type": "cpu",
"thread_num": 4,
"precision": "low",
"memory": "low",
"sampler_type": "mixed",
"mixed_samplers": ["topK", "tfs", "typical", "topP", "min_p", "temperature"],
"temperature": 1.0,
"topK": 40,
"topP": 0.9,
"tfsZ": 1.0,
"minP": 0.1,
"reuse_kv": true
}
OpenCL GPU Configuration
{
"llm_model": "qwen2-1.5b-int4.mnn",
"llm_weight": "qwen2-1.5b-int4.mnn.weight",
"backend_type": "opencl",
"thread_num": 68,
"precision": "low",
"memory": "low",
"max_new_tokens": 512,
"sampler_type": "temperature",
"temperature": 0.7
}
Mobile Configuration with Memory Optimization
{
"llm_model": "qwen2-1.5b-int4.mnn",
"llm_weight": "qwen2-1.5b-int4.mnn.weight",
"backend_type": "cpu",
"thread_num": 4,
"precision": "low",
"memory": "low",
"use_mmap": true,
"kvcache_mmap": true,
"chunk": 128,
"max_new_tokens": 256,
"reuse_kv": true
}
Notes
- The
config.json is auto-generated by llmexport.py during model export with sensible defaults. Manual editing is only needed for tuning.
- When using OpenCL backend, the first run performs kernel tuning (which is slow). Performance should be measured on subsequent runs after the tuning cache is generated.
- For iOS, the
tmp_path should be set to a temporary directory, e.g., using NSTemporaryDirectory().
- The
attention_mode parameter replaces the deprecated quant_qkv parameter.
- Dynamic configuration updates at runtime are supported via
llm->set_config(json_string) in the C++ API.
Related Pages