Implementation:Mlc ai Mlc llm Gen config
| Knowledge Sources | |
|---|---|
| Domains | Deep_Learning, Model_Deployment, Configuration_Management |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Concrete tool for generating deployment configuration that bridges model architecture with runtime requirements (quantization, parallelism, context windows), provided by MLC-LLM.
Description
The gen_config function is the entrypoint for MLC Chat configuration generation. It reads a model's config.json, applies user-specified overrides for quantization, context window, parallelism, and other deployment parameters, then produces an mlc-chat-config.json file along with all necessary tokenizer files in the output directory. The function performs the following steps:
- Initializes an
MLCChatConfigfrom the model architecture config with user overrides applied viaModelConfigOverride. - Loads optional
generation_config.jsonandconfig.jsonfor generation-related defaults (temperature, top-p, etc.). - Copies tokenizer files (tokenizer.model, tokenizer.json, vocab.json, merges.txt, added_tokens.json, tokenizer_config.json) to the output directory.
- Handles special tokenizer formats: converts RWKV vocabulary files to binary, converts SentencePiece
tokenizer.modeltotokenizer.jsonvia HuggingFace transformers, and converts tiktoken files. - Detects tokenizer metadata and validates tokenizer.json for duplicate tokens.
- Applies system default values for any remaining unset fields.
- Writes the final
mlc-chat-config.jsonto the output directory.
Usage
Use this function as the second step of the MLC-LLM compilation pipeline, after downloading the source model and before converting weights or compiling the model library. It is also used standalone when you need to regenerate configuration for a model with different deployment parameters.
Code Reference
Source Location
- Repository: MLC-LLM
- File:
python/mlc_llm/interface/gen_config.py(lines 89-287)
Signature
def gen_config(
config: Path,
model: Model,
quantization: Quantization,
conv_template: str,
context_window_size: Optional[int],
sliding_window_size: Optional[int],
prefill_chunk_size: Optional[int],
attention_sink_size: Optional[int],
tensor_parallel_shards: Optional[int],
pipeline_parallel_stages: Optional[int],
disaggregation: Optional[bool],
max_batch_size: int,
output: Path,
):
"""Entrypoint of MLC Chat configuration generation."""
Import
from mlc_llm.interface.gen_config import gen_config
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| config | Path | Yes | Path to the model's config.json file. The parent directory is expected to contain tokenizer files and optionally a generation_config.json.
|
| model | Model | Yes | The MLC model descriptor object that provides model configuration class and quantization methods. Obtained from the MLC model registry. |
| quantization | Quantization | Yes | The quantization scheme to apply (e.g., q4f16_1, q3f16_0). Determines the quantization field in the output config.
|
| conv_template | str | Yes | Name of the conversation template to use (e.g., "llama-3", "chatml", "vicuna_v1.1"). Must be registered in ConvTemplateRegistry or provided as a raw JSON string.
|
| context_window_size | Optional[int] | No | Override for the maximum context window size. When None, the value from the model's native config is used.
|
| sliding_window_size | Optional[int] | No | Override for sliding window attention size. When None, the value from the model's native config is used.
|
| prefill_chunk_size | Optional[int] | No | Override for the prefill chunk size, controlling how many tokens are processed in a single prefill step. |
| attention_sink_size | Optional[int] | No | Override for attention sink size, used in streaming attention mechanisms to retain a fixed number of initial tokens. |
| tensor_parallel_shards | Optional[int] | No | Number of tensor parallel shards for multi-GPU inference. When None, defaults to the model's native setting.
|
| pipeline_parallel_stages | Optional[int] | No | Number of pipeline parallel stages for multi-GPU inference across model layers. |
| disaggregation | Optional[bool] | No | Whether to enable disaggregated serving mode, where prefill and decode run on separate workers. |
| max_batch_size | int | Yes | Maximum batch size for the serving engine. |
| output | Path | Yes | Path to the output directory where mlc-chat-config.json and tokenizer files will be written.
|
Outputs
| Name | Type | Description |
|---|---|---|
| return value | None | The function returns nothing. Side effects include writing mlc-chat-config.json and copying tokenizer files to the output directory.
|
Usage Examples
Basic Usage
from pathlib import Path
from mlc_llm.interface.gen_config import gen_config
from mlc_llm.model import Model
from mlc_llm.quantization import Quantization
# Assume model source is at ./Llama-2-7b-chat-hf/
config_path = Path("./Llama-2-7b-chat-hf/config.json")
output_path = Path("./Llama-2-7b-chat-q4f16_1-MLC/")
output_path.mkdir(parents=True, exist_ok=True)
# Look up model type and quantization from the MLC registry
model_type = Model.from_name("llama")
quantization = Quantization.from_name("q4f16_1")
gen_config(
config=config_path,
model=model_type,
quantization=quantization,
conv_template="llama-2",
context_window_size=None,
sliding_window_size=None,
prefill_chunk_size=None,
attention_sink_size=None,
tensor_parallel_shards=None,
pipeline_parallel_stages=None,
disaggregation=None,
max_batch_size=1,
output=output_path,
)
# Output: mlc-chat-config.json and tokenizer files written to output_path
With Tensor Parallelism
from pathlib import Path
from mlc_llm.interface.gen_config import gen_config
gen_config(
config=Path("./Llama-2-70b-chat-hf/config.json"),
model=model_type,
quantization=quantization,
conv_template="llama-2",
context_window_size=4096,
sliding_window_size=None,
prefill_chunk_size=2048,
attention_sink_size=None,
tensor_parallel_shards=4,
pipeline_parallel_stages=None,
disaggregation=None,
max_batch_size=8,
output=Path("./Llama-2-70b-chat-q4f16_1-MLC/"),
)