Metadata
Overview
Concrete TRT-LLM engine builder CLI for compiling LLM checkpoints into TensorRT engines. This implementation covers the exact trtllm-build command and its key parameters for building optimized inference engines.
Description
The trtllm-build CLI is the primary interface for compiling TRT-LLM checkpoints into hardware-optimized TensorRT engines. It reads the checkpoint directory produced by convert_checkpoint.py, performs graph optimization and kernel auto-tuning, and writes a serialized engine that can be loaded by Triton Inference Server.
The build process is GPU-intensive and time-consuming. Build times range from minutes for small models to hours for large models. The resulting engine files are hardware-specific and must be rebuilt when changing GPU type.
Usage
Run after checkpoint conversion. The trtllm-build command is available in the TRT-LLM installation and should be executed on the same GPU type that will be used for inference.
Code Reference
Source Location
Signature
trtllm-build \
--checkpoint_dir ./phi-checkpoint \
--output_dir ./phi-engine \
--gemm_plugin float16 \
--max_batch_size 8 \
--max_input_len 1024 \
--max_seq_len 2048 \
--tp_size 1 \
--pp_size 1
Import / Verification
# Verify engine directory was created
ls -lh ./phi-engine/
# Check for engine file and config
ls ./phi-engine/*.engine
cat ./phi-engine/config.json
I/O Contract
Inputs
| Name |
Type |
Description
|
--checkpoint_dir |
Directory path |
Path to TRT-LLM formatted checkpoint directory from convert_checkpoint.py
|
--output_dir |
Directory path |
Target directory for compiled engine output
|
--gemm_plugin |
String |
Precision for GEMM operations: float16, bfloat16
|
--max_batch_size |
Integer |
Maximum batch size the engine will support at runtime
|
--max_input_len |
Integer |
Maximum input sequence length in tokens
|
--max_seq_len |
Integer |
Maximum total sequence length (input + output) in tokens
|
--tp_size |
Integer |
Tensor parallelism degree (number of GPUs for each layer)
|
--pp_size |
Integer |
Pipeline parallelism degree (number of pipeline stages)
|
Outputs
| Name |
Type |
Description
|
| Engine directory |
Directory |
./phi-engine/ containing compiled engine files
|
| Engine file(s) |
Binary |
*.engine file(s) — the compiled TensorRT engine
|
| config.json |
JSON file |
Engine configuration with build parameters and model metadata
|
Usage Examples
Single-GPU engine build for Phi-3-mini
trtllm-build \
--checkpoint_dir ./phi-checkpoint \
--output_dir ./phi-engine \
--gemm_plugin float16 \
--max_batch_size 8 \
--max_input_len 1024 \
--max_seq_len 2048 \
--tp_size 1 \
--pp_size 1
Multi-GPU engine build with 2-way tensor parallelism
trtllm-build \
--checkpoint_dir ./phi-checkpoint-tp2 \
--output_dir ./phi-engine-tp2 \
--gemm_plugin float16 \
--max_batch_size 8 \
--max_input_len 1024 \
--max_seq_len 2048 \
--tp_size 2 \
--pp_size 1
Key Parameters
| Parameter |
Description |
Example Value
|
--checkpoint_dir |
Input checkpoint directory |
./phi-checkpoint
|
--output_dir |
Output engine directory |
./phi-engine
|
--gemm_plugin |
GEMM precision for Tensor Core usage |
float16
|
--max_batch_size |
Runtime batch size limit |
8
|
--max_input_len |
Maximum prompt length in tokens |
1024
|
--max_seq_len |
Maximum total sequence length |
2048
|
--tp_size |
Tensor parallelism GPU count |
1
|
--pp_size |
Pipeline parallelism stage count |
1
|
Related Pages
Page Connections
Double-click a node to navigate. Hold to expand connections.