Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Alibaba MNN Llmexport Script

From Leeroopedia


Field Value
implementation_name Llmexport_Script
implementation_type API Doc
repository Alibaba_MNN
workflow LLM_Deployment_Pipeline
pipeline_stage Model Export
source_file transformers/llm/export/llmexport.py (L676-758)
last_updated 2026-02-10 14:00 GMT

Summary

The llmexport.py script is the primary tool for converting HuggingFace-format LLM models into MNN inference format. It handles model loading, ONNX export, MNN conversion, weight quantization, tokenizer extraction, and configuration generation. The script is built around the LlmExporter class (inheriting from torch.nn.Module) and supports a wide range of model architectures through the ModelMapper system.

API Signature

python llmexport.py --path <model_dir> --export mnn [options]

Full CLI Signature

usage: llmexport.py [-h] --path PATH [--type TYPE] [--tokenizer_path TOKENIZER_PATH]
                    [--eagle_path EAGLE_PATH] [--lora_path LORA_PATH] [--gptq_path GPTQ_PATH]
                    [--dst_path DST_PATH] [--verbose] [--test TEST] [--export EXPORT]
                    [--onnx_slim] [--quant_bit QUANT_BIT] [--quant_block QUANT_BLOCK]
                    [--visual_quant_bit VISUAL_QUANT_BIT] [--visual_quant_block VISUAL_QUANT_BLOCK]
                    [--lm_quant_bit LM_QUANT_BIT] [--lm_quant_block LM_QUANT_BLOCK]
                    [--mnnconvert MNNCONVERT] [--ppl] [--awq] [--hqq] [--omni]
                    [--transformer_fuse] [--group_conv_native] [--smooth] [--sym]
                    [--visual_sym] [--seperate_embed] [--lora_split]
                    [--calib_data CALIB_DATA] [--act_bit ACT_BIT] [--embed_bit EMBED_BIT]
                    [--act_sym] [--quant_config QUANT_CONFIG] [--generate_for_npu]
                    [--skip_weight] [--omni_epochs OMNI_EPOCHS] [--omni_lr OMNI_LR]
                    [--omni_wd OMNI_WD]

Source Reference

The argument parser is defined in the build_args() function at line 676, the programmatic export() entry point is at line 724, and the CLI main() entry point is at line 737 of:

transformers/llm/export/llmexport.py

Key Imports

import onnx
import torch

from utils.model import LlmModel, EmbeddingModel
from utils.tokenizer import LlmTokenizer
from utils.spinner import spinner_run
from utils.custom_op import FakeLinear
from utils.onnx_rebuilder import OnnxRebuilder
from utils.mnn_converter import MNNConverter
from utils.awq_quantizer import AwqQuantizer
from utils.smooth_quantizer import SmoothQuantizer
from utils.omni_quantizer import OmniQuantizer
from utils.torch_utils import onnx_export

Key Parameters

Parameter Type Default Description
--path str required Path to the HuggingFace model directory or model ID
--export str None Export format: 'onnx' or 'mnn'
--quant_bit int 4 Quantization bit-width for model weights (4 or 8)
--quant_block int 64 Quantization block size (0 = channel-wise)
--lm_quant_bit int None Separate quantization bit-width for LM head (defaults to --quant_bit)
--lm_quant_block int None Separate quantization block size for LM head (defaults to --quant_block)
--hqq flag False Enable Half-Quadratic Quantization for improved accuracy
--awq flag False Enable Activation-Aware Weight Quantization
--omni flag False Enable OmniQuant learned quantization
--smooth flag False Enable Smooth Quantization
--sym flag False Use symmetric quantization (no zero-point)
--dst_path str './model' Output directory for exported model files
--tokenizer_path str None Custom tokenizer path (defaults to --path)
--lora_path str None Path to LoRA weights for merging before export
--gptq_path str None Path to pre-quantized GPTQ weights
--eagle_path str None Path to EAGLE speculative decoding model
--mnnconvert str '../../../build/MNNConvert' Path to local MNNConvert binary (falls back to pymnn)
--seperate_embed flag False Separate embedding weights to embeddings_bf16.bin to avoid quantization
--embed_bit int 16 Embedding export bit precision (choices: 16, 8, 4)
--visual_quant_bit int None Quantization bit-width for visual encoder
--visual_sym flag False Symmetric quantization for visual model
--transformer_fuse flag False Fuse vision transformer operations
--calib_data str None Calibration data path for quantization methods that require it
--act_bit int 16 Smooth quantization activation bit-width (8 or 16)
--test str None Test model inference with the given query string
--skip_weight flag False Skip loading model weights (for testing export flow)
--generate_for_npu flag False Generate model for NPU deployment

Inputs

  • HuggingFace model directory containing config.json, tokenizer files, and weight files (.safetensors or .bin)

Outputs

The exported model directory (default ./model) contains:

File Description
llm.mnn The MNN model graph file
llm.mnn.weight The quantized weight data file
llm.mnn.json MNN model JSON (for LoRA/GPTQ post-processing)
llm_config.json Model configuration for the MNN runtime (hidden_size, layer_nums, key_value_shape, prompt_template)
config.json Runtime inference configuration (modifiable by user)
tokenizer.txt Extracted tokenizer in MNN-compatible text format
embeddings_bf16.bin Embedding weights in bf16 (only if --seperate_embed or non-tie-embedding model)
onnx/llm.onnx Intermediate ONNX model (not needed for inference)

Usage Examples

Basic 4-bit Export

cd transformers/llm/export
python llmexport.py \
    --path /path/to/Qwen2-0.5B-Instruct \
    --export mnn

Export with HQQ Quantization

python llmexport.py \
    --path /path/to/Qwen2-0.5B-Instruct \
    --export mnn \
    --hqq \
    --quant_bit 4 \
    --quant_block 64 \
    --dst_path ./qwen2_mnn

Export with LoRA Merging

python llmexport.py \
    --path /path/to/Qwen2-0.5B-Instruct \
    --lora_path /path/to/lora_weights \
    --export mnn \
    --quant_bit 8

Two-Step Export via ONNX

# Step 1: Export to ONNX
python llmexport.py \
    --path /path/to/Qwen2-0.5B-Instruct \
    --export onnx

# Step 2: Manual MNN conversion with custom quantization
./MNNConvert --modelFile ./model/onnx/llm.onnx \
    --MNNModel llm.mnn \
    --keepInputFormat \
    --weightQuantBits=4 \
    --weightQuantBlock=128 \
    -f ONNX \
    --transformerFuse=1 \
    --allowCustomOp \
    --saveExternalData

Programmatic API

from llmexport import export

export('/path/to/Qwen2-0.5B-Instruct', export='mnn', quant_bit=4, hqq=True)

The export() function (line 724) provides a programmatic interface:

def export(path, **kwargs):
    parser = argparse.ArgumentParser()
    build_args(parser)
    args = parser.parse_args(['--path', path])
    for k, v in kwargs.items():
        setattr(args, k, v)
    if 'bge' in path:
        llm_exporter = EmbeddingExporter(args)
    else:
        llm_exporter = LlmExporter(args)
    llm_exporter.export(args.export)

Notes

  • The LlmExporter.__init__ method sets max_new_tokens = 1024 and dst_name = 'llm' as defaults.
  • If --lm_quant_bit or --lm_quant_block is not specified, they default to the values of --quant_bit and --quant_block respectively.
  • Embedding models (BGE, GTE, Qwen3-Embedding) are handled by the EmbeddingExporter class instead of LlmExporter.
  • The script requires either pymnn installed or a valid MNNConvert binary path for the MNN conversion step.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment