Implementation:Alibaba MNN Llmexport Script
| Field | Value |
|---|---|
| implementation_name | Llmexport_Script |
| implementation_type | API Doc |
| repository | Alibaba_MNN |
| workflow | LLM_Deployment_Pipeline |
| pipeline_stage | Model Export |
| source_file | transformers/llm/export/llmexport.py (L676-758) |
| last_updated | 2026-02-10 14:00 GMT |
Summary
The llmexport.py script is the primary tool for converting HuggingFace-format LLM models into MNN inference format. It handles model loading, ONNX export, MNN conversion, weight quantization, tokenizer extraction, and configuration generation. The script is built around the LlmExporter class (inheriting from torch.nn.Module) and supports a wide range of model architectures through the ModelMapper system.
API Signature
python llmexport.py --path <model_dir> --export mnn [options]
Full CLI Signature
usage: llmexport.py [-h] --path PATH [--type TYPE] [--tokenizer_path TOKENIZER_PATH]
[--eagle_path EAGLE_PATH] [--lora_path LORA_PATH] [--gptq_path GPTQ_PATH]
[--dst_path DST_PATH] [--verbose] [--test TEST] [--export EXPORT]
[--onnx_slim] [--quant_bit QUANT_BIT] [--quant_block QUANT_BLOCK]
[--visual_quant_bit VISUAL_QUANT_BIT] [--visual_quant_block VISUAL_QUANT_BLOCK]
[--lm_quant_bit LM_QUANT_BIT] [--lm_quant_block LM_QUANT_BLOCK]
[--mnnconvert MNNCONVERT] [--ppl] [--awq] [--hqq] [--omni]
[--transformer_fuse] [--group_conv_native] [--smooth] [--sym]
[--visual_sym] [--seperate_embed] [--lora_split]
[--calib_data CALIB_DATA] [--act_bit ACT_BIT] [--embed_bit EMBED_BIT]
[--act_sym] [--quant_config QUANT_CONFIG] [--generate_for_npu]
[--skip_weight] [--omni_epochs OMNI_EPOCHS] [--omni_lr OMNI_LR]
[--omni_wd OMNI_WD]
Source Reference
The argument parser is defined in the build_args() function at line 676, the programmatic export() entry point is at line 724, and the CLI main() entry point is at line 737 of:
transformers/llm/export/llmexport.py
Key Imports
import onnx
import torch
from utils.model import LlmModel, EmbeddingModel
from utils.tokenizer import LlmTokenizer
from utils.spinner import spinner_run
from utils.custom_op import FakeLinear
from utils.onnx_rebuilder import OnnxRebuilder
from utils.mnn_converter import MNNConverter
from utils.awq_quantizer import AwqQuantizer
from utils.smooth_quantizer import SmoothQuantizer
from utils.omni_quantizer import OmniQuantizer
from utils.torch_utils import onnx_export
Key Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
--path |
str | required | Path to the HuggingFace model directory or model ID |
--export |
str | None | Export format: 'onnx' or 'mnn'
|
--quant_bit |
int | 4 | Quantization bit-width for model weights (4 or 8) |
--quant_block |
int | 64 | Quantization block size (0 = channel-wise) |
--lm_quant_bit |
int | None | Separate quantization bit-width for LM head (defaults to --quant_bit)
|
--lm_quant_block |
int | None | Separate quantization block size for LM head (defaults to --quant_block)
|
--hqq |
flag | False | Enable Half-Quadratic Quantization for improved accuracy |
--awq |
flag | False | Enable Activation-Aware Weight Quantization |
--omni |
flag | False | Enable OmniQuant learned quantization |
--smooth |
flag | False | Enable Smooth Quantization |
--sym |
flag | False | Use symmetric quantization (no zero-point) |
--dst_path |
str | './model' |
Output directory for exported model files |
--tokenizer_path |
str | None | Custom tokenizer path (defaults to --path)
|
--lora_path |
str | None | Path to LoRA weights for merging before export |
--gptq_path |
str | None | Path to pre-quantized GPTQ weights |
--eagle_path |
str | None | Path to EAGLE speculative decoding model |
--mnnconvert |
str | '../../../build/MNNConvert' |
Path to local MNNConvert binary (falls back to pymnn) |
--seperate_embed |
flag | False | Separate embedding weights to embeddings_bf16.bin to avoid quantization
|
--embed_bit |
int | 16 | Embedding export bit precision (choices: 16, 8, 4) |
--visual_quant_bit |
int | None | Quantization bit-width for visual encoder |
--visual_sym |
flag | False | Symmetric quantization for visual model |
--transformer_fuse |
flag | False | Fuse vision transformer operations |
--calib_data |
str | None | Calibration data path for quantization methods that require it |
--act_bit |
int | 16 | Smooth quantization activation bit-width (8 or 16) |
--test |
str | None | Test model inference with the given query string |
--skip_weight |
flag | False | Skip loading model weights (for testing export flow) |
--generate_for_npu |
flag | False | Generate model for NPU deployment |
Inputs
- HuggingFace model directory containing
config.json, tokenizer files, and weight files (.safetensorsor.bin)
Outputs
The exported model directory (default ./model) contains:
| File | Description |
|---|---|
llm.mnn |
The MNN model graph file |
llm.mnn.weight |
The quantized weight data file |
llm.mnn.json |
MNN model JSON (for LoRA/GPTQ post-processing) |
llm_config.json |
Model configuration for the MNN runtime (hidden_size, layer_nums, key_value_shape, prompt_template) |
config.json |
Runtime inference configuration (modifiable by user) |
tokenizer.txt |
Extracted tokenizer in MNN-compatible text format |
embeddings_bf16.bin |
Embedding weights in bf16 (only if --seperate_embed or non-tie-embedding model)
|
onnx/llm.onnx |
Intermediate ONNX model (not needed for inference) |
Usage Examples
Basic 4-bit Export
cd transformers/llm/export
python llmexport.py \
--path /path/to/Qwen2-0.5B-Instruct \
--export mnn
Export with HQQ Quantization
python llmexport.py \
--path /path/to/Qwen2-0.5B-Instruct \
--export mnn \
--hqq \
--quant_bit 4 \
--quant_block 64 \
--dst_path ./qwen2_mnn
Export with LoRA Merging
python llmexport.py \
--path /path/to/Qwen2-0.5B-Instruct \
--lora_path /path/to/lora_weights \
--export mnn \
--quant_bit 8
Two-Step Export via ONNX
# Step 1: Export to ONNX
python llmexport.py \
--path /path/to/Qwen2-0.5B-Instruct \
--export onnx
# Step 2: Manual MNN conversion with custom quantization
./MNNConvert --modelFile ./model/onnx/llm.onnx \
--MNNModel llm.mnn \
--keepInputFormat \
--weightQuantBits=4 \
--weightQuantBlock=128 \
-f ONNX \
--transformerFuse=1 \
--allowCustomOp \
--saveExternalData
Programmatic API
from llmexport import export
export('/path/to/Qwen2-0.5B-Instruct', export='mnn', quant_bit=4, hqq=True)
The export() function (line 724) provides a programmatic interface:
def export(path, **kwargs):
parser = argparse.ArgumentParser()
build_args(parser)
args = parser.parse_args(['--path', path])
for k, v in kwargs.items():
setattr(args, k, v)
if 'bge' in path:
llm_exporter = EmbeddingExporter(args)
else:
llm_exporter = LlmExporter(args)
llm_exporter.export(args.export)
Notes
- The
LlmExporter.__init__method setsmax_new_tokens = 1024anddst_name = 'llm'as defaults. - If
--lm_quant_bitor--lm_quant_blockis not specified, they default to the values of--quant_bitand--quant_blockrespectively. - Embedding models (BGE, GTE, Qwen3-Embedding) are handled by the
EmbeddingExporterclass instead ofLlmExporter. - The script requires either
pymnninstalled or a validMNNConvertbinary path for the MNN conversion step.
Related Pages
- Principle:Alibaba_MNN_LLM_Model_Export
- Environment:Alibaba_MNN_Python_Export_Environment
- Environment:Alibaba_MNN_HuggingFace_Ecosystem_Environment
- Implementation:Alibaba_MNN_HuggingFace_Model_Download - Previous step: downloading the model
- Implementation:Alibaba_MNN_CMake_Build_LLM - Next step: compiling the inference engine