Implementation:Mlc ai Mlc llm CLI Model Metadata
Overview
The file python/mlc_llm/cli/model_metadata.py implements a CLI tool for inspecting the metadata embedded in compiled MLC LLM model libraries. It can display the full metadata as JSON or perform a detailed analysis of memory usage, including parameter sizes, temporary buffer requirements, and KV cache costs.
Location
- Repository: Mlc_ai_Mlc_llm
- File:
python/mlc_llm/cli/model_metadata.py - Lines: 197
Key Components
_extract_metadata
def _extract_metadata(model_lib: Path) -> Dict[str, Any]:
from tvm.runtime import device, load_module
from tvm.runtime.vm import VirtualMachine
return json.loads(VirtualMachine(load_module(model_lib), device("cpu"))["_metadata"]())
This function loads the compiled model library using TVM's load_module, instantiates a VirtualMachine on the CPU device, and calls the embedded _metadata function. The returned JSON string is parsed into a Python dictionary. The imports are performed inside the function to avoid loading TVM runtime until needed.
_report_all
def _report_all(metadata: Dict[str, Any]) -> None:
Formats and prints the full metadata as beautified JSON. The function applies special formatting to the "params" list so that each parameter entry is compacted onto a single line, while the rest of the metadata remains indented for readability.
_read_dynamic_shape
def _read_dynamic_shape(shape: List[Union[int, str]], config: Union[Dict, ConfigBase]) -> List[int]:
Resolves dynamic shapes in parameter definitions. When a parameter shape contains string elements (e.g., "vocab_size"), this function looks up the concrete integer value from the model configuration dictionary. It raises:
AttributeErrorif no configuration is provided but dynamic shapes are encountered.KeyErrorif the dynamic shape key is not found in the configuration.
_compute_memory_usage
def _compute_memory_usage(metadata: Dict[str, Any], config: Union[Dict, ConfigBase]):
Computes two memory quantities from the metadata:
- Parameter bytes -- The total memory required for all model parameters, computed as the product of each parameter's shape dimensions multiplied by its data type size (using
tvm.runtime.DataType.itemsize). - Temporary function bytes -- The peak temporary buffer memory across all functions, determined by taking the maximum of the
memory_usageentries in the metadata.
Returns both values as a tuple (params_bytes, temp_func_bytes).
_report_memory_usage
def _report_memory_usage(metadata: Dict[str, Any], config: Union[Dict, ConfigBase]) -> None:
Generates a detailed memory report including:
- Total memory usage without KV cache -- Sum of parameter bytes and temporary buffer bytes, reported in megabytes.
- KV cache size per token -- Computed when the config provides
head_dim,num_hidden_layers, andnum_key_value_heads, and the metadata includes aquantizationfield. The formula is:
bytes_per_token = head_dim * num_hidden_layers * num_key_value_heads * dtype_bytes * 2
The factor of 2 accounts for both key and value tensors. The dtype is inferred from the quantization string (f32 = 4 bytes, f16/bf16 = 2 bytes).
- Total memory with 4K KV cache -- The total memory usage assuming a context window of 4096 tokens.
- A hint to tweak
prefill_chunk_size,context_window_size, andsliding_window_sizeto reduce memory consumption.
CLI Entry Point
def main():
parser = ArgumentParser(description="A tool that inspects the metadata of a model lib.")
parser.add_argument("model_lib", type=Path, help="...")
parser.add_argument("--mlc-chat-config", type=Path, help="...")
parser.add_argument("--memory-only", action="store_true", help="...")
parsed = parser.parse_args()
CLI arguments:
| Argument | Type | Required | Description |
|---|---|---|---|
model_lib (positional) |
Path |
Yes | Path to the compiled model library (.so or .a).
|
--mlc-chat-config |
Path |
No | Path to mlc-chat-config.json. Required only when --memory-only is set and the model library contains dynamic parameter shapes.
|
--memory-only |
flag | No | When set, only memory usage analysis is displayed. Otherwise, the full metadata JSON is printed. |
Execution flow:
- The metadata is extracted from the model library using
_extract_metadata. If extraction fails (e.g., legacy model library format), the error is logged and the tool exits gracefully. - If
--mlc-chat-configis provided, the JSON configuration is loaded from disk. - If
--memory-onlyis set,_report_memory_usageis called. Otherwise,_report_allprints the full metadata.
Design Notes
- The tool gracefully handles legacy model libraries that lack metadata sections by catching all exceptions during extraction and logging an informative error message.
- Dynamic shape resolution allows the tool to work with model libraries compiled with symbolic dimensions, which is common in MLC LLM's compilation pipeline.
- The KV cache calculation includes a
TODOcomment noting that quantized KV caches are not yet supported in the size calculation. - The file also supports direct execution via the
if __name__ == "__main__"guard.