Implementation:Mlc ai Mlc llm CLI Model Metadata

Overview

The file python/mlc_llm/cli/model_metadata.py implements a CLI tool for inspecting the metadata embedded in compiled MLC LLM model libraries. It can display the full metadata as JSON or perform a detailed analysis of memory usage, including parameter sizes, temporary buffer requirements, and KV cache costs.

Location

Repository: Mlc_ai_Mlc_llm
File: python/mlc_llm/cli/model_metadata.py
Lines: 197

Key Components

_extract_metadata

def _extract_metadata(model_lib: Path) -> Dict[str, Any]:
    from tvm.runtime import device, load_module
    from tvm.runtime.vm import VirtualMachine

    return json.loads(VirtualMachine(load_module(model_lib), device("cpu"))["_metadata"]())

This function loads the compiled model library using TVM's load_module, instantiates a VirtualMachine on the CPU device, and calls the embedded _metadata function. The returned JSON string is parsed into a Python dictionary. The imports are performed inside the function to avoid loading TVM runtime until needed.

_report_all

def _report_all(metadata: Dict[str, Any]) -> None:

Formats and prints the full metadata as beautified JSON. The function applies special formatting to the "params" list so that each parameter entry is compacted onto a single line, while the rest of the metadata remains indented for readability.

_read_dynamic_shape

def _read_dynamic_shape(shape: List[Union[int, str]], config: Union[Dict, ConfigBase]) -> List[int]:

Resolves dynamic shapes in parameter definitions. When a parameter shape contains string elements (e.g., "vocab_size"), this function looks up the concrete integer value from the model configuration dictionary. It raises:

AttributeError if no configuration is provided but dynamic shapes are encountered.
KeyError if the dynamic shape key is not found in the configuration.

_compute_memory_usage

def _compute_memory_usage(metadata: Dict[str, Any], config: Union[Dict, ConfigBase]):

Computes two memory quantities from the metadata:

Parameter bytes -- The total memory required for all model parameters, computed as the product of each parameter's shape dimensions multiplied by its data type size (using tvm.runtime.DataType.itemsize).
Temporary function bytes -- The peak temporary buffer memory across all functions, determined by taking the maximum of the memory_usage entries in the metadata.

Returns both values as a tuple (params_bytes, temp_func_bytes).

_report_memory_usage

def _report_memory_usage(metadata: Dict[str, Any], config: Union[Dict, ConfigBase]) -> None:

Generates a detailed memory report including:

Total memory usage without KV cache -- Sum of parameter bytes and temporary buffer bytes, reported in megabytes.
KV cache size per token -- Computed when the config provides head_dim, num_hidden_layers, and num_key_value_heads, and the metadata includes a quantization field. The formula is:

bytes_per_token = head_dim * num_hidden_layers * num_key_value_heads * dtype_bytes * 2

The factor of 2 accounts for both key and value tensors. The dtype is inferred from the quantization string (f32 = 4 bytes, f16/bf16 = 2 bytes).

Total memory with 4K KV cache -- The total memory usage assuming a context window of 4096 tokens.
A hint to tweak prefill_chunk_size, context_window_size, and sliding_window_size to reduce memory consumption.

CLI Entry Point

def main():
    parser = ArgumentParser(description="A tool that inspects the metadata of a model lib.")
    parser.add_argument("model_lib", type=Path, help="...")
    parser.add_argument("--mlc-chat-config", type=Path, help="...")
    parser.add_argument("--memory-only", action="store_true", help="...")
    parsed = parser.parse_args()

CLI arguments:

Argument	Type	Required	Description
`model_lib` (positional)	`Path`	Yes	Path to the compiled model library (`.so` or `.a`).
`--mlc-chat-config`	`Path`	No	Path to `mlc-chat-config.json`. Required only when `--memory-only` is set and the model library contains dynamic parameter shapes.
`--memory-only`	flag	No	When set, only memory usage analysis is displayed. Otherwise, the full metadata JSON is printed.

Execution flow:

The metadata is extracted from the model library using _extract_metadata. If extraction fails (e.g., legacy model library format), the error is logged and the tool exits gracefully.
If --mlc-chat-config is provided, the JSON configuration is loaded from disk.
If --memory-only is set, _report_memory_usage is called. Otherwise, _report_all prints the full metadata.

Design Notes

The tool gracefully handles legacy model libraries that lack metadata sections by catching all exceptions during extraction and logging an informative error message.
Dynamic shape resolution allows the tool to work with model libraries compiled with symbolic dimensions, which is common in MLC LLM's compilation pipeline.
The KV cache calculation includes a TODO comment noting that quantized KV caches are not yet supported in the size calculation.
The file also supports direct execution via the if __name__ == "__main__" guard.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment