Principle:Triton inference server Server Weight Conversion

Metadata

Field	Value
Type	Principle
Principle_type	External Tool Doc
Workflow	LLM_Deployment_With_TRT_LLM
Repo	Triton_inference_server_Server
Source	docs/getting_started/llm.md:L88-95
Domains	NLP, Model_Optimization
Knowledge_Sources	TRT-LLM Docs\|https://nvidia.github.io/TensorRT-LLM/, source::Repo\|Triton Server\|https://github.com/triton-inference-server/server
implemented_by	Implementation:Triton_inference_server_Server_Convert_Checkpoint
2026-02-13 17:00 GMT

Overview

Process of transforming model weights from a framework-native format to an optimized engine-compatible format.

Description

TensorRT-LLM requires weights in its own checkpoint format before engine compilation. This conversion step reorganizes weights, applies dtype casting, and prepares sharding metadata for tensor/pipeline parallelism.

The conversion process performs several transformations:

Weight reorganization — Restructures weight tensors from the source framework's layout (e.g., HuggingFace) to TRT-LLM's expected layout
Data type casting — Converts weights to the target precision (float16, bfloat16) to match the desired engine precision
Parallelism metadata — Generates sharding configuration for tensor parallelism (TP) and pipeline parallelism (PP), describing how weights should be split across GPUs
Vocabulary embedding handling — Processes embedding tables and language model heads with proper padding for hardware alignment

This step is distinct from engine compilation because it operates on weights only, without performing graph optimization or kernel selection. The output is a portable checkpoint that can be compiled into engines for different hardware targets.

Usage

This principle is applied after model weights are downloaded and before engine compilation. It is a required intermediate step in the TRT-LLM deployment pipeline.

Workflow context:

Precedes: Principle:Triton_inference_server_Server_TensorRT_Engine_Build
Depends on: Principle:Triton_inference_server_Server_Model_Weight_Download, Principle:Triton_inference_server_Server_TRT_LLM_Environment_Setup

Theoretical Basis

Weight serialization transformation follows this pipeline:

HF safetensors → TRT-LLM checkpoint format with dtype conversion and parallelism metadata

The conversion addresses several technical requirements:

Format compatibility — Different frameworks store weights in different tensor layouts. For example, attention weights may be stored as separate Q/K/V matrices or as a fused QKV matrix. TRT-LLM has specific expectations about tensor layout
Precision management — The --dtype parameter controls the numerical precision of the converted weights. Common choices:
- float16 — Standard half-precision, widely compatible
- bfloat16 — Better dynamic range, preferred for training-adjacent workloads
Sharding for parallelism — For multi-GPU deployment, the converter pre-computes how weight tensors should be partitioned:
- Tensor parallelism (TP) — Splits individual weight matrices across GPUs (e.g., splitting the attention head dimension)
- Pipeline parallelism (PP) — Assigns entire transformer layers to different GPUs

The output checkpoint directory contains:

Converted weight files in safetensors or binary format
config.json describing the model architecture in TRT-LLM terms
Parallelism mapping metadata

Related Pages

Implementation:Triton_inference_server_Server_Convert_Checkpoint
Principle:Triton_inference_server_Server_Model_Weight_Download — Prerequisite step
Principle:Triton_inference_server_Server_TensorRT_Engine_Build — Next step after conversion

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment