Implementation:Triton inference server Server Trtllm Build

Metadata

Field	Value
Type	Implementation
Workflow	LLM_Deployment_With_TRT_LLM
Repo	Triton_inference_server_Server
Source	docs/getting_started/llm.md:L97-112
Domains	Model_Optimization, GPU_Computing, NLP
Knowledge_Sources	TRT-LLM Docs\|https://nvidia.github.io/TensorRT-LLM/, source::Repo\|Triton Server\|https://github.com/triton-inference-server/server
implements	Principle:Triton_inference_server_Server_TensorRT_Engine_Build
2026-02-13 17:00 GMT

Overview

Concrete TRT-LLM engine builder CLI for compiling LLM checkpoints into TensorRT engines. This implementation covers the exact trtllm-build command and its key parameters for building optimized inference engines.

Description

The trtllm-build CLI is the primary interface for compiling TRT-LLM checkpoints into hardware-optimized TensorRT engines. It reads the checkpoint directory produced by convert_checkpoint.py, performs graph optimization and kernel auto-tuning, and writes a serialized engine that can be loaded by Triton Inference Server.

The build process is GPU-intensive and time-consuming. Build times range from minutes for small models to hours for large models. The resulting engine files are hardware-specific and must be rebuilt when changing GPU type.

Usage

Run after checkpoint conversion. The trtllm-build command is available in the TRT-LLM installation and should be executed on the same GPU type that will be used for inference.

Code Reference

Source Location

Item	Value
File	docs/getting_started/llm.md
Lines	L97-112
Repo	https://github.com/triton-inference-server/server

Signature

trtllm-build \
    --checkpoint_dir ./phi-checkpoint \
    --output_dir ./phi-engine \
    --gemm_plugin float16 \
    --max_batch_size 8 \
    --max_input_len 1024 \
    --max_seq_len 2048 \
    --tp_size 1 \
    --pp_size 1

Import / Verification

# Verify engine directory was created
ls -lh ./phi-engine/

# Check for engine file and config
ls ./phi-engine/*.engine
cat ./phi-engine/config.json

I/O Contract

Inputs

Name	Type	Description
`--checkpoint_dir`	Directory path	Path to TRT-LLM formatted checkpoint directory from convert_checkpoint.py
`--output_dir`	Directory path	Target directory for compiled engine output
`--gemm_plugin`	String	Precision for GEMM operations: `float16`, `bfloat16`
`--max_batch_size`	Integer	Maximum batch size the engine will support at runtime
`--max_input_len`	Integer	Maximum input sequence length in tokens
`--max_seq_len`	Integer	Maximum total sequence length (input + output) in tokens
`--tp_size`	Integer	Tensor parallelism degree (number of GPUs for each layer)
`--pp_size`	Integer	Pipeline parallelism degree (number of pipeline stages)

Outputs

Name	Type	Description
Engine directory	Directory	`./phi-engine/` containing compiled engine files
Engine file(s)	Binary	`*.engine` file(s) — the compiled TensorRT engine
config.json	JSON file	Engine configuration with build parameters and model metadata

Usage Examples

Single-GPU engine build for Phi-3-mini

trtllm-build \
    --checkpoint_dir ./phi-checkpoint \
    --output_dir ./phi-engine \
    --gemm_plugin float16 \
    --max_batch_size 8 \
    --max_input_len 1024 \
    --max_seq_len 2048 \
    --tp_size 1 \
    --pp_size 1

Multi-GPU engine build with 2-way tensor parallelism

trtllm-build \
    --checkpoint_dir ./phi-checkpoint-tp2 \
    --output_dir ./phi-engine-tp2 \
    --gemm_plugin float16 \
    --max_batch_size 8 \
    --max_input_len 1024 \
    --max_seq_len 2048 \
    --tp_size 2 \
    --pp_size 1

Key Parameters

Parameter	Description	Example Value
`--checkpoint_dir`	Input checkpoint directory	`./phi-checkpoint`
`--output_dir`	Output engine directory	`./phi-engine`
`--gemm_plugin`	GEMM precision for Tensor Core usage	`float16`
`--max_batch_size`	Runtime batch size limit	`8`
`--max_input_len`	Maximum prompt length in tokens	`1024`
`--max_seq_len`	Maximum total sequence length	`2048`
`--tp_size`	Tensor parallelism GPU count	`1`
`--pp_size`	Pipeline parallelism stage count	`1`

Related Pages

Principle:Triton_inference_server_Server_TensorRT_Engine_Build
Implementation:Triton_inference_server_Server_Convert_Checkpoint — Prerequisite: checkpoint conversion
Implementation:Triton_inference_server_Server_TRT_LLM_Run — Next step: engine validation
Implementation:Triton_inference_server_Server_Fill_Template — Uses the engine directory for model repo setup
Environment:Triton_inference_server_Server_TRT_LLM_Deployment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment