Implementation:Triton inference server Server TRT LLM Run

Metadata

Field	Value
Type	Implementation
Workflow	LLM_Deployment_With_TRT_LLM
Repo	Triton_inference_server_Server
Source	docs/getting_started/llm.md:L114-135
Domains	Quality_Assurance, NLP
Knowledge_Sources	TRT-LLM Docs\|https://nvidia.github.io/TensorRT-LLM/, source::Repo\|Triton Server\|https://github.com/triton-inference-server/server
implements	Principle:Triton_inference_server_Server_Engine_Validation
2026-02-13 17:00 GMT

Overview

Concrete TRT-LLM validation scripts for testing compiled engine outputs. This implementation covers both qualitative text generation testing via run.py and quantitative summarization benchmarking via summarize.py.

Description

Two validation approaches are provided:

run.py — Interactive text generation script for qualitative validation. Takes a prompt and generates text, allowing visual inspection of output quality
summarize.py — Automated summarization benchmark using the CNN/DailyMail dataset with ROUGE-1 metric computation and threshold checking

Both scripts load the compiled TensorRT engine and the original HuggingFace tokenizer to perform end-to-end inference validation.

Usage

Run after engine compilation. Both scripts are located in the TRT-LLM examples directory. The run.py script is used for quick qualitative checks, while summarize.py provides quantitative regression detection.

Code Reference

Source Location

Item	Value
File	docs/getting_started/llm.md
Lines	L114-135
Repo	https://github.com/triton-inference-server/server
Scripts	TensorRT-LLM/examples/run.py, TensorRT-LLM/examples/summarize.py

Signature (run.py)

python3 ../run.py \
    --engine_dir ./phi-engine \
    --max_output_len 500 \
    --tokenizer_dir ./Phi-3-mini-4k-instruct \
    --input_text "How do I count to nine in French?"

Signature (summarize.py)

python3 ../summarize.py \
    --test_trt_llm \
    --engine_dir ./phi-engine \
    --check_accuracy \
    --tensorrt_llm_rouge1_threshold 20

Import / Verification

# run.py outputs generated text to stdout
# summarize.py outputs ROUGE-1 score and pass/fail status

I/O Contract

Inputs

Name	Type	Description
`--engine_dir`	Directory path	Path to compiled TRT engine directory (e.g., `./phi-engine`)
`--tokenizer_dir`	Directory path	Path to HuggingFace tokenizer directory (e.g., `./Phi-3-mini-4k-instruct`)
`--max_output_len`	Integer	Maximum number of output tokens to generate (run.py)
`--input_text`	String	Prompt text for generation (run.py)
`--test_trt_llm`	Flag	Enables TRT-LLM engine testing mode (summarize.py)
`--check_accuracy`	Flag	Enables ROUGE-1 accuracy checking (summarize.py)
`--tensorrt_llm_rouge1_threshold`	Float	Minimum ROUGE-1 score to pass validation (summarize.py)

Outputs

Name	Type	Description
Generated text	stdout	Text generated by the engine in response to the input prompt (run.py)
ROUGE-1 score	stdout	Computed ROUGE-1 score against reference summaries (summarize.py)
Pass/fail status	stdout	Whether the ROUGE-1 score meets the threshold (summarize.py)

Usage Examples

Qualitative validation with run.py

python3 ../run.py \
    --engine_dir ./phi-engine \
    --max_output_len 500 \
    --tokenizer_dir ./Phi-3-mini-4k-instruct \
    --input_text "How do I count to nine in French?"

Expected output should contain French numbers: un, deux, trois, quatre, cinq, six, sept, huit, neuf.

Quantitative validation with summarize.py

python3 ../summarize.py \
    --test_trt_llm \
    --engine_dir ./phi-engine \
    --check_accuracy \
    --tensorrt_llm_rouge1_threshold 20

This script:

Loads test samples from the CNN/DailyMail dataset
Generates summaries using the TRT-LLM engine
Computes ROUGE-1 scores against reference summaries
Fails with a non-zero exit code if the score is below the threshold (20)

Key Parameters

Parameter	Script	Description	Example Value
`--engine_dir`	Both	Compiled TRT engine directory	`./phi-engine`
`--tokenizer_dir`	run.py	HuggingFace tokenizer path	`./Phi-3-mini-4k-instruct`
`--max_output_len`	run.py	Max tokens to generate	`500`
`--input_text`	run.py	Input prompt	`"How do I count to nine in French?"`
`--tensorrt_llm_rouge1_threshold`	summarize.py	Minimum ROUGE-1 score	`20`

Related Pages

Principle:Triton_inference_server_Server_Engine_Validation
Implementation:Triton_inference_server_Server_Trtllm_Build — Prerequisite: engine compilation
Implementation:Triton_inference_server_Server_Fill_Template — Next step: model repository setup
Environment:Triton_inference_server_Server_TRT_LLM_Deployment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment