Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Mlc ai Mlc llm Gen config

From Leeroopedia


Knowledge Sources
Domains Deep_Learning, Model_Deployment, Configuration_Management
Last Updated 2026-02-09 00:00 GMT

Overview

Concrete tool for generating deployment configuration that bridges model architecture with runtime requirements (quantization, parallelism, context windows), provided by MLC-LLM.

Description

The gen_config function is the entrypoint for MLC Chat configuration generation. It reads a model's config.json, applies user-specified overrides for quantization, context window, parallelism, and other deployment parameters, then produces an mlc-chat-config.json file along with all necessary tokenizer files in the output directory. The function performs the following steps:

  1. Initializes an MLCChatConfig from the model architecture config with user overrides applied via ModelConfigOverride.
  2. Loads optional generation_config.json and config.json for generation-related defaults (temperature, top-p, etc.).
  3. Copies tokenizer files (tokenizer.model, tokenizer.json, vocab.json, merges.txt, added_tokens.json, tokenizer_config.json) to the output directory.
  4. Handles special tokenizer formats: converts RWKV vocabulary files to binary, converts SentencePiece tokenizer.model to tokenizer.json via HuggingFace transformers, and converts tiktoken files.
  5. Detects tokenizer metadata and validates tokenizer.json for duplicate tokens.
  6. Applies system default values for any remaining unset fields.
  7. Writes the final mlc-chat-config.json to the output directory.

Usage

Use this function as the second step of the MLC-LLM compilation pipeline, after downloading the source model and before converting weights or compiling the model library. It is also used standalone when you need to regenerate configuration for a model with different deployment parameters.

Code Reference

Source Location

  • Repository: MLC-LLM
  • File: python/mlc_llm/interface/gen_config.py (lines 89-287)

Signature

def gen_config(
    config: Path,
    model: Model,
    quantization: Quantization,
    conv_template: str,
    context_window_size: Optional[int],
    sliding_window_size: Optional[int],
    prefill_chunk_size: Optional[int],
    attention_sink_size: Optional[int],
    tensor_parallel_shards: Optional[int],
    pipeline_parallel_stages: Optional[int],
    disaggregation: Optional[bool],
    max_batch_size: int,
    output: Path,
):
    """Entrypoint of MLC Chat configuration generation."""

Import

from mlc_llm.interface.gen_config import gen_config

I/O Contract

Inputs

Name Type Required Description
config Path Yes Path to the model's config.json file. The parent directory is expected to contain tokenizer files and optionally a generation_config.json.
model Model Yes The MLC model descriptor object that provides model configuration class and quantization methods. Obtained from the MLC model registry.
quantization Quantization Yes The quantization scheme to apply (e.g., q4f16_1, q3f16_0). Determines the quantization field in the output config.
conv_template str Yes Name of the conversation template to use (e.g., "llama-3", "chatml", "vicuna_v1.1"). Must be registered in ConvTemplateRegistry or provided as a raw JSON string.
context_window_size Optional[int] No Override for the maximum context window size. When None, the value from the model's native config is used.
sliding_window_size Optional[int] No Override for sliding window attention size. When None, the value from the model's native config is used.
prefill_chunk_size Optional[int] No Override for the prefill chunk size, controlling how many tokens are processed in a single prefill step.
attention_sink_size Optional[int] No Override for attention sink size, used in streaming attention mechanisms to retain a fixed number of initial tokens.
tensor_parallel_shards Optional[int] No Number of tensor parallel shards for multi-GPU inference. When None, defaults to the model's native setting.
pipeline_parallel_stages Optional[int] No Number of pipeline parallel stages for multi-GPU inference across model layers.
disaggregation Optional[bool] No Whether to enable disaggregated serving mode, where prefill and decode run on separate workers.
max_batch_size int Yes Maximum batch size for the serving engine.
output Path Yes Path to the output directory where mlc-chat-config.json and tokenizer files will be written.

Outputs

Name Type Description
return value None The function returns nothing. Side effects include writing mlc-chat-config.json and copying tokenizer files to the output directory.

Usage Examples

Basic Usage

from pathlib import Path
from mlc_llm.interface.gen_config import gen_config
from mlc_llm.model import Model
from mlc_llm.quantization import Quantization

# Assume model source is at ./Llama-2-7b-chat-hf/
config_path = Path("./Llama-2-7b-chat-hf/config.json")
output_path = Path("./Llama-2-7b-chat-q4f16_1-MLC/")
output_path.mkdir(parents=True, exist_ok=True)

# Look up model type and quantization from the MLC registry
model_type = Model.from_name("llama")
quantization = Quantization.from_name("q4f16_1")

gen_config(
    config=config_path,
    model=model_type,
    quantization=quantization,
    conv_template="llama-2",
    context_window_size=None,
    sliding_window_size=None,
    prefill_chunk_size=None,
    attention_sink_size=None,
    tensor_parallel_shards=None,
    pipeline_parallel_stages=None,
    disaggregation=None,
    max_batch_size=1,
    output=output_path,
)
# Output: mlc-chat-config.json and tokenizer files written to output_path

With Tensor Parallelism

from pathlib import Path
from mlc_llm.interface.gen_config import gen_config

gen_config(
    config=Path("./Llama-2-70b-chat-hf/config.json"),
    model=model_type,
    quantization=quantization,
    conv_template="llama-2",
    context_window_size=4096,
    sliding_window_size=None,
    prefill_chunk_size=2048,
    attention_sink_size=None,
    tensor_parallel_shards=4,
    pipeline_parallel_stages=None,
    disaggregation=None,
    max_batch_size=8,
    output=Path("./Llama-2-70b-chat-q4f16_1-MLC/"),
)

Related Pages

Implements Principle

Environment Links

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment