Environment:NVIDIA NeMo Aligner TensorRT LLM Acceleration Environment

Knowledge Sources	NeMo-Aligner NeMo-Aligner Dockerfile TensorRT-LLM
Domains	Infrastructure, Optimization, Inference
Last Updated	2026-02-07 22:00 GMT

Overview

Optional TensorRT-LLM v0.13.0 acceleration environment for high-throughput LLM generation during PPO and REINFORCE rollouts.

Description

TensorRT-LLM provides optimized inference for LLM text generation during the rollout phase of PPO and REINFORCE training. It compiles the model into a TensorRT engine, enabling significantly faster generation compared to native PyTorch inference. This environment extends the base NeMo Framework GPU environment with TRT-LLM dependencies. It supports resharding from pipeline-parallel to tensor-parallel-only during inference for further speedup.

Usage

Use this environment when running PPO or REINFORCE training workflows and you need faster rollout generation. Enable via `trainer.ppo.trt_llm.enable=True` in the training config. The Dockerfile must be built with TRT-LLM support enabled. This is optional - training works without TRT-LLM using native PyTorch generation.

System Requirements

Category	Requirement	Notes
OS	Linux (via NGC container)	Must use the NeMo-Aligner Dockerfile build
Hardware	NVIDIA GPU with TensorRT support	A100/H100 recommended
GPU Memory	Additional VRAM for TRT engine	Engine compilation requires extra memory
Disk	50GB+ additional	For TRT-LLM engine cache at `/tmp/trt_llm_model`

Dependencies

System Packages

TensorRT (installed via TRT-LLM build script)
CUDA 12.x compatibility libraries

Python Packages

`tensorrt-llm` == v0.13.0
`pynvml` == 11.5.3 (strict pin, 12.0.0 has breaking changes)
All dependencies from Environment:NVIDIA_NeMo_Aligner_NeMo_Framework_GPU_Environment

Credentials

`DISABLE_TORCH_DEVICE_SET`: Set to `1` automatically by NeMo-Aligner (prevents device reassignment within TRT-LLM)

Quick Install

# TRT-LLM must be built from the Dockerfile - it cannot be pip installed standalone
git clone https://github.com/NVIDIA/NeMo-Aligner.git
cd NeMo-Aligner
docker buildx build -t aligner:latest .

# Inside the container, TRT-LLM is available automatically
# Enable in config:
# trainer.ppo.trt_llm.enable=True
# trainer.ppo.trt_llm.reshard=True  # optional: reshard PP to TP-only for inference

Code Evidence

TRT-LLM availability check from `nemo_aligner/utils/trt_llm.py:26-32`:

try:
    import tensorrt_llm
    HAVE_TRTLLM = True
except (ImportError, ModuleNotFoundError) as e:
    logging.info(f"got error message {e} when importing trt-llm dependencies, disabling")
    HAVE_TRTLLM = False

Runtime error when TRT-LLM is not available from `nemo_aligner/utils/trt_llm.py:71-74`:

if not HAVE_TRTLLM:
    raise RuntimeError(
        "You are trying to use NeMo-Aligner's TensorRT-LLM acceleration for LLM generation. "
        "Please build the dockerfile to enable this feature: "
        "https://github.com/NVIDIA/NeMo-Aligner/blob/main/Dockerfile"
    )

Sequence length constraint validation from `nemo_aligner/utils/trt_llm.py:76-80`:

assert max_input_len > 0
assert max_generation_length > 0
assert (
    max_input_len + max_generation_length <= model_cfg.encoder_seq_length
), f"We require max_input_len ({max_input_len}) + max_generation_length ({max_generation_length}) <= model_cfg.encoder_seq_length ({model_cfg.encoder_seq_length})"

Device reassignment prevention from `nemo_aligner/__init__.py:17`:

os.environ["DISABLE_TORCH_DEVICE_SET"] = "1"

Common Errors

Error Message	Cause	Solution
`RuntimeError: You are trying to use NeMo-Aligner's TensorRT-LLM acceleration`	TRT-LLM not installed	Build the NeMo-Aligner Dockerfile with TRT-LLM support
`AssertionError: max_input_len + max_generation_length <= encoder_seq_length`	Input + output exceeds model sequence length	Reduce `max_input_len` or `max_generation_length` in config
`pynvml.NVMLError`	pynvml version 12.0.0 breaking change	Pin `pynvml==11.5.3`
`'use_greedy=True' overrides sample_top_k to 1`	Greedy mode overrides sampling config	This is a warning only; greedy forces top_k=1

Compatibility Notes

Build Required: TRT-LLM cannot be `pip install`ed; it must be built from source via the Dockerfile.
Model Types: Supports `llama` model type by default (configurable via `trt_llm.model_type`).
Resharding: When `reshard=True`, pipeline parallelism is converted to tensor parallelism during inference. This speeds up generation but adds resharding overhead.
Engine Unloading: Set `unload_engine_train=True` to free GPU memory occupied by the TRT engine during the training phase.
Patching: The Dockerfile applies a patch (`setup/trtllm.patch`) to TRT-LLM for NeMo-Aligner compatibility.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment