Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Alibaba MNN LLM Config JSON

From Leeroopedia


Field Value
implementation_name LLM_Config_JSON
implementation_type Pattern Doc
repository Alibaba_MNN
workflow LLM_Deployment_Pipeline
pipeline_stage Runtime Configuration
source_file docs/transformers/llm.md (L365-502)
last_updated 2026-02-10 14:00 GMT

Summary

This implementation documents the config.json configuration file format used to control MNN LLM inference behavior at runtime. The file resides in the exported model directory alongside llm.mnn, llm.mnn.weight, and other model artifacts. It provides a declarative interface for tuning hardware backend, precision, memory, sampling, and generation parameters without modifying the model itself.

API Signature

Edit config.json in the model directory (JSON key-value configuration)

The config.json is read by Llm::createLLM(config_path) in the C++ runtime and can also be updated dynamically via the llm->set_config(json_string) API.

Source Reference

The full configuration schema is documented in the MNN LLM documentation:

docs/transformers/llm.md (Lines 365-502)

Configuration Schema

Model File Information

Key Type Default Description
base_dir string (directory of config.json) Base directory for resolving model file paths
llm_config string "config.json" Path to llm_config.json (relative to base_dir)
llm_model string "llm.mnn" Path to the MNN model file
llm_weight string "llm.mnn.weight" Path to the MNN weight file
block_model string "block_{idx}.mnn" Path pattern for segmented block models
lm_model string "lm.mnn" Path for segmented LM head model
embedding_model string "embedding.mnn" Path for embedding model (when using model-based embedding)
embedding_file string "embeddings_bf16.bin" Path for binary embedding file
tokenizer_file string "tokenizer.txt" Path to the tokenizer file
visual_model string "visual.mnn" Path for vision encoder (VL models)
audio_model string "audio.mnn" Path for audio encoder (Audio models)

Hardware Configuration

Key Type Default Description
backend_type string "cpu" Inference backend: "cpu", "opencl", or "metal"
thread_num int 4 CPU thread count. For OpenCL, use 68 (buffer mode + wide tuning)
precision string "low" Precision strategy: "low" (fp16) or "high" (fp32)
memory string "low" Memory strategy: "low" (runtime quant enabled) or "high" (disabled)

Inference Configuration

Key Type Default Description
max_new_tokens int 512 Maximum number of tokens to generate per response
reuse_kv bool false Reuse KV-cache across multi-turn conversations
attention_mode int 8 CPU: 0/1/2 (no Flash Attn) or 8/9/10 (Flash Attn) with QKV quant levels. GPU: 0/8/16
use_mmap bool false Use mmap for weight loading (write weights to disk when memory is insufficient)
kvcache_mmap bool false Use mmap for KV-cache (write to disk when memory is insufficient)
chunk int (none) Maximum tokens per processing step (splits long prompts to reduce memory)
chunk_limits array (none) Token processing limits, e.g., [128, 1]. Overrides chunk
tmp_path string (none) Temporary directory for mmap cache files

CPU Dynamic Quantization Configuration

Key Type Default Description
dynamic_option int 0 Feature map quantization mode: 0 (per-channel), 1 (per-tensor), 2 (per-block), 8+ (decode acceleration)
cpu_sme2_neon_division_ratio int 41 SME2/NEON workload ratio (format: 8*x+y where x=prefill ratio, y=decode ratio)

Sampler Configuration

Key Type Default Description
sampler_type string "greedy" Sampler type: "greedy", "temperature", "topK", "topP", "minP", "tfs", "typical", "penalty", or "mixed"
mixed_samplers array ["topK", "tfs", "typical", "topP", "min_p", "temperature"] Sampler chain for "mixed" mode
temperature float 1.0 Sampling temperature
topK int 40 Top-K filtering threshold
topP float 0.9 Top-P (nucleus) filtering threshold
minP float 0.1 Min-P filtering threshold
tfsZ float 1.0 Tail-free sampling Z parameter (1.0 = disabled)
typical float 1.0 Typical sampling p parameter (1.0 = disabled)
penalty float 0.0 Repetition penalty (0.0 = disabled, recommended 1.05-1.5)
n_gram int 8 Maximum n-gram size for repetition penalty
ngram_factor float 1.0 Extra penalty for repeated n-grams (n>1)
penalty_sampler string "greedy" Sampling strategy after penalty application ("greedy" or "temperature")

Speculative Decoding Configuration

Key Type Default Description
speculative_type string (none) Speculative decoding algorithm: "lookahead"
draft_predict_length int 4 Draft sequence length (2-8)
draft_match_strictness string "low" Draft matching strictness: "low", "medium", "high"
draft_selection_rule string "freqxlen" Draft selection rule: "freqxlen" or "fcfs"
lookup_file string "lookup_file.txt" External knowledge base file for lookahead decoding

Inputs

  • Exported model directory from llmexport.py containing model files and a default config.json

Outputs

  • A configured config.json file ready for use by llm_demo, llm_bench, or the C++ API

Usage Examples

Minimal CPU Configuration

{
    "llm_model": "qwen2-1.5b-int4.mnn",
    "llm_weight": "qwen2-1.5b-int4.mnn.weight",

    "backend_type": "cpu",
    "thread_num": 4,
    "precision": "low",
    "memory": "low"
}

Full Configuration with Mixed Sampling

{
    "llm_model": "qwen2-1.5b-int4.mnn",
    "llm_weight": "qwen2-1.5b-int4.mnn.weight",

    "backend_type": "cpu",
    "thread_num": 4,
    "precision": "low",
    "memory": "low",
    "sampler_type": "mixed",
    "mixed_samplers": ["topK", "tfs", "typical", "topP", "min_p", "temperature"],
    "temperature": 1.0,
    "topK": 40,
    "topP": 0.9,
    "tfsZ": 1.0,
    "minP": 0.1,
    "reuse_kv": true
}

OpenCL GPU Configuration

{
    "llm_model": "qwen2-1.5b-int4.mnn",
    "llm_weight": "qwen2-1.5b-int4.mnn.weight",

    "backend_type": "opencl",
    "thread_num": 68,
    "precision": "low",
    "memory": "low",
    "max_new_tokens": 512,
    "sampler_type": "temperature",
    "temperature": 0.7
}

Mobile Configuration with Memory Optimization

{
    "llm_model": "qwen2-1.5b-int4.mnn",
    "llm_weight": "qwen2-1.5b-int4.mnn.weight",

    "backend_type": "cpu",
    "thread_num": 4,
    "precision": "low",
    "memory": "low",
    "use_mmap": true,
    "kvcache_mmap": true,
    "chunk": 128,
    "max_new_tokens": 256,
    "reuse_kv": true
}

Notes

  • The config.json is auto-generated by llmexport.py during model export with sensible defaults. Manual editing is only needed for tuning.
  • When using OpenCL backend, the first run performs kernel tuning (which is slow). Performance should be measured on subsequent runs after the tuning cache is generated.
  • For iOS, the tmp_path should be set to a temporary directory, e.g., using NSTemporaryDirectory().
  • The attention_mode parameter replaces the deprecated quant_qkv parameter.
  • Dynamic configuration updates at runtime are supported via llm->set_config(json_string) in the C++ API.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment