Principle:Triton inference server Server Weight Conversion
Metadata
| Field | Value |
|---|---|
| Type | Principle |
| Principle_type | External Tool Doc |
| Workflow | LLM_Deployment_With_TRT_LLM |
| Repo | Triton_inference_server_Server |
| Source | docs/getting_started/llm.md:L88-95 |
| Domains | NLP, Model_Optimization |
| Knowledge_Sources | TRT-LLM Docs|https://nvidia.github.io/TensorRT-LLM/, source::Repo|Triton Server|https://github.com/triton-inference-server/server |
| implemented_by | Implementation:Triton_inference_server_Server_Convert_Checkpoint |
| 2026-02-13 17:00 GMT |
Overview
Process of transforming model weights from a framework-native format to an optimized engine-compatible format.
Description
TensorRT-LLM requires weights in its own checkpoint format before engine compilation. This conversion step reorganizes weights, applies dtype casting, and prepares sharding metadata for tensor/pipeline parallelism.
The conversion process performs several transformations:
- Weight reorganization — Restructures weight tensors from the source framework's layout (e.g., HuggingFace) to TRT-LLM's expected layout
- Data type casting — Converts weights to the target precision (float16, bfloat16) to match the desired engine precision
- Parallelism metadata — Generates sharding configuration for tensor parallelism (TP) and pipeline parallelism (PP), describing how weights should be split across GPUs
- Vocabulary embedding handling — Processes embedding tables and language model heads with proper padding for hardware alignment
This step is distinct from engine compilation because it operates on weights only, without performing graph optimization or kernel selection. The output is a portable checkpoint that can be compiled into engines for different hardware targets.
Usage
This principle is applied after model weights are downloaded and before engine compilation. It is a required intermediate step in the TRT-LLM deployment pipeline.
Workflow context:
- Precedes: Principle:Triton_inference_server_Server_TensorRT_Engine_Build
- Depends on: Principle:Triton_inference_server_Server_Model_Weight_Download, Principle:Triton_inference_server_Server_TRT_LLM_Environment_Setup
Theoretical Basis
Weight serialization transformation follows this pipeline:
HF safetensors → TRT-LLM checkpoint format with dtype conversion and parallelism metadata
The conversion addresses several technical requirements:
- Format compatibility — Different frameworks store weights in different tensor layouts. For example, attention weights may be stored as separate Q/K/V matrices or as a fused QKV matrix. TRT-LLM has specific expectations about tensor layout
- Precision management — The
--dtypeparameter controls the numerical precision of the converted weights. Common choices:float16— Standard half-precision, widely compatiblebfloat16— Better dynamic range, preferred for training-adjacent workloads
- Sharding for parallelism — For multi-GPU deployment, the converter pre-computes how weight tensors should be partitioned:
- Tensor parallelism (TP) — Splits individual weight matrices across GPUs (e.g., splitting the attention head dimension)
- Pipeline parallelism (PP) — Assigns entire transformer layers to different GPUs
The output checkpoint directory contains:
- Converted weight files in safetensors or binary format
config.jsondescribing the model architecture in TRT-LLM terms- Parallelism mapping metadata
Related Pages
- Implementation:Triton_inference_server_Server_Convert_Checkpoint
- Principle:Triton_inference_server_Server_Model_Weight_Download — Prerequisite step
- Principle:Triton_inference_server_Server_TensorRT_Engine_Build — Next step after conversion