Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:NVIDIA NeMo Aligner TensorRT LLM Acceleration Environment

From Leeroopedia


Knowledge Sources
Domains Infrastructure, Optimization, Inference
Last Updated 2026-02-07 22:00 GMT

Overview

Optional TensorRT-LLM v0.13.0 acceleration environment for high-throughput LLM generation during PPO and REINFORCE rollouts.

Description

TensorRT-LLM provides optimized inference for LLM text generation during the rollout phase of PPO and REINFORCE training. It compiles the model into a TensorRT engine, enabling significantly faster generation compared to native PyTorch inference. This environment extends the base NeMo Framework GPU environment with TRT-LLM dependencies. It supports resharding from pipeline-parallel to tensor-parallel-only during inference for further speedup.

Usage

Use this environment when running PPO or REINFORCE training workflows and you need faster rollout generation. Enable via `trainer.ppo.trt_llm.enable=True` in the training config. The Dockerfile must be built with TRT-LLM support enabled. This is optional - training works without TRT-LLM using native PyTorch generation.

System Requirements

Category Requirement Notes
OS Linux (via NGC container) Must use the NeMo-Aligner Dockerfile build
Hardware NVIDIA GPU with TensorRT support A100/H100 recommended
GPU Memory Additional VRAM for TRT engine Engine compilation requires extra memory
Disk 50GB+ additional For TRT-LLM engine cache at `/tmp/trt_llm_model`

Dependencies

System Packages

  • TensorRT (installed via TRT-LLM build script)
  • CUDA 12.x compatibility libraries

Python Packages

Credentials

  • `DISABLE_TORCH_DEVICE_SET`: Set to `1` automatically by NeMo-Aligner (prevents device reassignment within TRT-LLM)

Quick Install

# TRT-LLM must be built from the Dockerfile - it cannot be pip installed standalone
git clone https://github.com/NVIDIA/NeMo-Aligner.git
cd NeMo-Aligner
docker buildx build -t aligner:latest .

# Inside the container, TRT-LLM is available automatically
# Enable in config:
# trainer.ppo.trt_llm.enable=True
# trainer.ppo.trt_llm.reshard=True  # optional: reshard PP to TP-only for inference

Code Evidence

TRT-LLM availability check from `nemo_aligner/utils/trt_llm.py:26-32`:

try:
    import tensorrt_llm
    HAVE_TRTLLM = True
except (ImportError, ModuleNotFoundError) as e:
    logging.info(f"got error message {e} when importing trt-llm dependencies, disabling")
    HAVE_TRTLLM = False

Runtime error when TRT-LLM is not available from `nemo_aligner/utils/trt_llm.py:71-74`:

if not HAVE_TRTLLM:
    raise RuntimeError(
        "You are trying to use NeMo-Aligner's TensorRT-LLM acceleration for LLM generation. "
        "Please build the dockerfile to enable this feature: "
        "https://github.com/NVIDIA/NeMo-Aligner/blob/main/Dockerfile"
    )

Sequence length constraint validation from `nemo_aligner/utils/trt_llm.py:76-80`:

assert max_input_len > 0
assert max_generation_length > 0
assert (
    max_input_len + max_generation_length <= model_cfg.encoder_seq_length
), f"We require max_input_len ({max_input_len}) + max_generation_length ({max_generation_length}) <= model_cfg.encoder_seq_length ({model_cfg.encoder_seq_length})"

Device reassignment prevention from `nemo_aligner/__init__.py:17`:

os.environ["DISABLE_TORCH_DEVICE_SET"] = "1"

Common Errors

Error Message Cause Solution
`RuntimeError: You are trying to use NeMo-Aligner's TensorRT-LLM acceleration` TRT-LLM not installed Build the NeMo-Aligner Dockerfile with TRT-LLM support
`AssertionError: max_input_len + max_generation_length <= encoder_seq_length` Input + output exceeds model sequence length Reduce `max_input_len` or `max_generation_length` in config
`pynvml.NVMLError` pynvml version 12.0.0 breaking change Pin `pynvml==11.5.3`
`'use_greedy=True' overrides sample_top_k to 1` Greedy mode overrides sampling config This is a warning only; greedy forces top_k=1

Compatibility Notes

  • Build Required: TRT-LLM cannot be `pip install`ed; it must be built from source via the Dockerfile.
  • Model Types: Supports `llama` model type by default (configurable via `trt_llm.model_type`).
  • Resharding: When `reshard=True`, pipeline parallelism is converted to tensor parallelism during inference. This speeds up generation but adds resharding overhead.
  • Engine Unloading: Set `unload_engine_train=True` to free GPU memory occupied by the TRT engine during the training phase.
  • Patching: The Dockerfile applies a patch (`setup/trtllm.patch`) to TRT-LLM for NeMo-Aligner compatibility.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment