Implementation:Mlc ai Mlc llm Gen config

Knowledge Sources	MLC-LLM
Domains	Deep_Learning, Model_Deployment, Configuration_Management
Last Updated	2026-02-09 00:00 GMT

Overview

Concrete tool for generating deployment configuration that bridges model architecture with runtime requirements (quantization, parallelism, context windows), provided by MLC-LLM.

Description

The gen_config function is the entrypoint for MLC Chat configuration generation. It reads a model's config.json, applies user-specified overrides for quantization, context window, parallelism, and other deployment parameters, then produces an mlc-chat-config.json file along with all necessary tokenizer files in the output directory. The function performs the following steps:

Initializes an MLCChatConfig from the model architecture config with user overrides applied via ModelConfigOverride.
Loads optional generation_config.json and config.json for generation-related defaults (temperature, top-p, etc.).
Copies tokenizer files (tokenizer.model, tokenizer.json, vocab.json, merges.txt, added_tokens.json, tokenizer_config.json) to the output directory.
Handles special tokenizer formats: converts RWKV vocabulary files to binary, converts SentencePiece tokenizer.model to tokenizer.json via HuggingFace transformers, and converts tiktoken files.
Detects tokenizer metadata and validates tokenizer.json for duplicate tokens.
Applies system default values for any remaining unset fields.
Writes the final mlc-chat-config.json to the output directory.

Usage

Use this function as the second step of the MLC-LLM compilation pipeline, after downloading the source model and before converting weights or compiling the model library. It is also used standalone when you need to regenerate configuration for a model with different deployment parameters.

Code Reference

Source Location

Repository: MLC-LLM
File: python/mlc_llm/interface/gen_config.py (lines 89-287)

Signature

def gen_config(
    config: Path,
    model: Model,
    quantization: Quantization,
    conv_template: str,
    context_window_size: Optional[int],
    sliding_window_size: Optional[int],
    prefill_chunk_size: Optional[int],
    attention_sink_size: Optional[int],
    tensor_parallel_shards: Optional[int],
    pipeline_parallel_stages: Optional[int],
    disaggregation: Optional[bool],
    max_batch_size: int,
    output: Path,
):
    """Entrypoint of MLC Chat configuration generation."""

Import

from mlc_llm.interface.gen_config import gen_config

I/O Contract

Inputs

Name	Type	Required	Description
config	Path	Yes	Path to the model's `config.json` file. The parent directory is expected to contain tokenizer files and optionally a `generation_config.json`.
model	Model	Yes	The MLC model descriptor object that provides model configuration class and quantization methods. Obtained from the MLC model registry.
quantization	Quantization	Yes	The quantization scheme to apply (e.g., q4f16_1, q3f16_0). Determines the `quantization` field in the output config.
conv_template	str	Yes	Name of the conversation template to use (e.g., "llama-3", "chatml", "vicuna_v1.1"). Must be registered in `ConvTemplateRegistry` or provided as a raw JSON string.
context_window_size	Optional[int]	No	Override for the maximum context window size. When `None`, the value from the model's native config is used.
sliding_window_size	Optional[int]	No	Override for sliding window attention size. When `None`, the value from the model's native config is used.
prefill_chunk_size	Optional[int]	No	Override for the prefill chunk size, controlling how many tokens are processed in a single prefill step.
attention_sink_size	Optional[int]	No	Override for attention sink size, used in streaming attention mechanisms to retain a fixed number of initial tokens.
tensor_parallel_shards	Optional[int]	No	Number of tensor parallel shards for multi-GPU inference. When `None`, defaults to the model's native setting.
pipeline_parallel_stages	Optional[int]	No	Number of pipeline parallel stages for multi-GPU inference across model layers.
disaggregation	Optional[bool]	No	Whether to enable disaggregated serving mode, where prefill and decode run on separate workers.
max_batch_size	int	Yes	Maximum batch size for the serving engine.
output	Path	Yes	Path to the output directory where `mlc-chat-config.json` and tokenizer files will be written.

Outputs

Name	Type	Description
return value	None	The function returns nothing. Side effects include writing `mlc-chat-config.json` and copying tokenizer files to the `output` directory.

Usage Examples

Basic Usage

from pathlib import Path
from mlc_llm.interface.gen_config import gen_config
from mlc_llm.model import Model
from mlc_llm.quantization import Quantization

# Assume model source is at ./Llama-2-7b-chat-hf/
config_path = Path("./Llama-2-7b-chat-hf/config.json")
output_path = Path("./Llama-2-7b-chat-q4f16_1-MLC/")
output_path.mkdir(parents=True, exist_ok=True)

# Look up model type and quantization from the MLC registry
model_type = Model.from_name("llama")
quantization = Quantization.from_name("q4f16_1")

gen_config(
    config=config_path,
    model=model_type,
    quantization=quantization,
    conv_template="llama-2",
    context_window_size=None,
    sliding_window_size=None,
    prefill_chunk_size=None,
    attention_sink_size=None,
    tensor_parallel_shards=None,
    pipeline_parallel_stages=None,
    disaggregation=None,
    max_batch_size=1,
    output=output_path,
)
# Output: mlc-chat-config.json and tokenizer files written to output_path

With Tensor Parallelism

from pathlib import Path
from mlc_llm.interface.gen_config import gen_config

gen_config(
    config=Path("./Llama-2-70b-chat-hf/config.json"),
    model=model_type,
    quantization=quantization,
    conv_template="llama-2",
    context_window_size=4096,
    sliding_window_size=None,
    prefill_chunk_size=2048,
    attention_sink_size=None,
    tensor_parallel_shards=4,
    pipeline_parallel_stages=None,
    disaggregation=None,
    max_batch_size=8,
    output=Path("./Llama-2-70b-chat-q4f16_1-MLC/"),
)

Related Pages

Implements Principle

Principle:Mlc_ai_Mlc_llm_MLC_Configuration_Generation

Environment Links

Environment:Mlc_ai_Mlc_llm_TVM_Runtime_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment