Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Triton inference server Server Trtllm Build

From Leeroopedia

Metadata

Field Value
Type Implementation
Workflow LLM_Deployment_With_TRT_LLM
Repo Triton_inference_server_Server
Source docs/getting_started/llm.md:L97-112
Domains Model_Optimization, GPU_Computing, NLP
Knowledge_Sources TRT-LLM Docs|https://nvidia.github.io/TensorRT-LLM/, source::Repo|Triton Server|https://github.com/triton-inference-server/server
implements Principle:Triton_inference_server_Server_TensorRT_Engine_Build
2026-02-13 17:00 GMT

Overview

Concrete TRT-LLM engine builder CLI for compiling LLM checkpoints into TensorRT engines. This implementation covers the exact trtllm-build command and its key parameters for building optimized inference engines.

Description

The trtllm-build CLI is the primary interface for compiling TRT-LLM checkpoints into hardware-optimized TensorRT engines. It reads the checkpoint directory produced by convert_checkpoint.py, performs graph optimization and kernel auto-tuning, and writes a serialized engine that can be loaded by Triton Inference Server.

The build process is GPU-intensive and time-consuming. Build times range from minutes for small models to hours for large models. The resulting engine files are hardware-specific and must be rebuilt when changing GPU type.

Usage

Run after checkpoint conversion. The trtllm-build command is available in the TRT-LLM installation and should be executed on the same GPU type that will be used for inference.

Code Reference

Source Location

Item Value
File docs/getting_started/llm.md
Lines L97-112
Repo https://github.com/triton-inference-server/server

Signature

trtllm-build \
    --checkpoint_dir ./phi-checkpoint \
    --output_dir ./phi-engine \
    --gemm_plugin float16 \
    --max_batch_size 8 \
    --max_input_len 1024 \
    --max_seq_len 2048 \
    --tp_size 1 \
    --pp_size 1

Import / Verification

# Verify engine directory was created
ls -lh ./phi-engine/

# Check for engine file and config
ls ./phi-engine/*.engine
cat ./phi-engine/config.json

I/O Contract

Inputs

Name Type Description
--checkpoint_dir Directory path Path to TRT-LLM formatted checkpoint directory from convert_checkpoint.py
--output_dir Directory path Target directory for compiled engine output
--gemm_plugin String Precision for GEMM operations: float16, bfloat16
--max_batch_size Integer Maximum batch size the engine will support at runtime
--max_input_len Integer Maximum input sequence length in tokens
--max_seq_len Integer Maximum total sequence length (input + output) in tokens
--tp_size Integer Tensor parallelism degree (number of GPUs for each layer)
--pp_size Integer Pipeline parallelism degree (number of pipeline stages)

Outputs

Name Type Description
Engine directory Directory ./phi-engine/ containing compiled engine files
Engine file(s) Binary *.engine file(s) — the compiled TensorRT engine
config.json JSON file Engine configuration with build parameters and model metadata

Usage Examples

Single-GPU engine build for Phi-3-mini

trtllm-build \
    --checkpoint_dir ./phi-checkpoint \
    --output_dir ./phi-engine \
    --gemm_plugin float16 \
    --max_batch_size 8 \
    --max_input_len 1024 \
    --max_seq_len 2048 \
    --tp_size 1 \
    --pp_size 1

Multi-GPU engine build with 2-way tensor parallelism

trtllm-build \
    --checkpoint_dir ./phi-checkpoint-tp2 \
    --output_dir ./phi-engine-tp2 \
    --gemm_plugin float16 \
    --max_batch_size 8 \
    --max_input_len 1024 \
    --max_seq_len 2048 \
    --tp_size 2 \
    --pp_size 1

Key Parameters

Parameter Description Example Value
--checkpoint_dir Input checkpoint directory ./phi-checkpoint
--output_dir Output engine directory ./phi-engine
--gemm_plugin GEMM precision for Tensor Core usage float16
--max_batch_size Runtime batch size limit 8
--max_input_len Maximum prompt length in tokens 1024
--max_seq_len Maximum total sequence length 2048
--tp_size Tensor parallelism GPU count 1
--pp_size Pipeline parallelism stage count 1

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment