Environment:NVIDIA NeMo Aligner TensorRT LLM Acceleration Environment
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Optimization, Inference |
| Last Updated | 2026-02-07 22:00 GMT |
Overview
Optional TensorRT-LLM v0.13.0 acceleration environment for high-throughput LLM generation during PPO and REINFORCE rollouts.
Description
TensorRT-LLM provides optimized inference for LLM text generation during the rollout phase of PPO and REINFORCE training. It compiles the model into a TensorRT engine, enabling significantly faster generation compared to native PyTorch inference. This environment extends the base NeMo Framework GPU environment with TRT-LLM dependencies. It supports resharding from pipeline-parallel to tensor-parallel-only during inference for further speedup.
Usage
Use this environment when running PPO or REINFORCE training workflows and you need faster rollout generation. Enable via `trainer.ppo.trt_llm.enable=True` in the training config. The Dockerfile must be built with TRT-LLM support enabled. This is optional - training works without TRT-LLM using native PyTorch generation.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux (via NGC container) | Must use the NeMo-Aligner Dockerfile build |
| Hardware | NVIDIA GPU with TensorRT support | A100/H100 recommended |
| GPU Memory | Additional VRAM for TRT engine | Engine compilation requires extra memory |
| Disk | 50GB+ additional | For TRT-LLM engine cache at `/tmp/trt_llm_model` |
Dependencies
System Packages
- TensorRT (installed via TRT-LLM build script)
- CUDA 12.x compatibility libraries
Python Packages
- `tensorrt-llm` == v0.13.0
- `pynvml` == 11.5.3 (strict pin, 12.0.0 has breaking changes)
- All dependencies from Environment:NVIDIA_NeMo_Aligner_NeMo_Framework_GPU_Environment
Credentials
- `DISABLE_TORCH_DEVICE_SET`: Set to `1` automatically by NeMo-Aligner (prevents device reassignment within TRT-LLM)
Quick Install
# TRT-LLM must be built from the Dockerfile - it cannot be pip installed standalone
git clone https://github.com/NVIDIA/NeMo-Aligner.git
cd NeMo-Aligner
docker buildx build -t aligner:latest .
# Inside the container, TRT-LLM is available automatically
# Enable in config:
# trainer.ppo.trt_llm.enable=True
# trainer.ppo.trt_llm.reshard=True # optional: reshard PP to TP-only for inference
Code Evidence
TRT-LLM availability check from `nemo_aligner/utils/trt_llm.py:26-32`:
try:
import tensorrt_llm
HAVE_TRTLLM = True
except (ImportError, ModuleNotFoundError) as e:
logging.info(f"got error message {e} when importing trt-llm dependencies, disabling")
HAVE_TRTLLM = False
Runtime error when TRT-LLM is not available from `nemo_aligner/utils/trt_llm.py:71-74`:
if not HAVE_TRTLLM:
raise RuntimeError(
"You are trying to use NeMo-Aligner's TensorRT-LLM acceleration for LLM generation. "
"Please build the dockerfile to enable this feature: "
"https://github.com/NVIDIA/NeMo-Aligner/blob/main/Dockerfile"
)
Sequence length constraint validation from `nemo_aligner/utils/trt_llm.py:76-80`:
assert max_input_len > 0
assert max_generation_length > 0
assert (
max_input_len + max_generation_length <= model_cfg.encoder_seq_length
), f"We require max_input_len ({max_input_len}) + max_generation_length ({max_generation_length}) <= model_cfg.encoder_seq_length ({model_cfg.encoder_seq_length})"
Device reassignment prevention from `nemo_aligner/__init__.py:17`:
os.environ["DISABLE_TORCH_DEVICE_SET"] = "1"
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `RuntimeError: You are trying to use NeMo-Aligner's TensorRT-LLM acceleration` | TRT-LLM not installed | Build the NeMo-Aligner Dockerfile with TRT-LLM support |
| `AssertionError: max_input_len + max_generation_length <= encoder_seq_length` | Input + output exceeds model sequence length | Reduce `max_input_len` or `max_generation_length` in config |
| `pynvml.NVMLError` | pynvml version 12.0.0 breaking change | Pin `pynvml==11.5.3` |
| `'use_greedy=True' overrides sample_top_k to 1` | Greedy mode overrides sampling config | This is a warning only; greedy forces top_k=1 |
Compatibility Notes
- Build Required: TRT-LLM cannot be `pip install`ed; it must be built from source via the Dockerfile.
- Model Types: Supports `llama` model type by default (configurable via `trt_llm.model_type`).
- Resharding: When `reshard=True`, pipeline parallelism is converted to tensor parallelism during inference. This speeds up generation but adds resharding overhead.
- Engine Unloading: Set `unload_engine_train=True` to free GPU memory occupied by the TRT engine during the training phase.
- Patching: The Dockerfile applies a patch (`setup/trtllm.patch`) to TRT-LLM for NeMo-Aligner compatibility.