Implementation:Triton inference server Server TRT LLM Run
Metadata
| Field | Value |
|---|---|
| Type | Implementation |
| Workflow | LLM_Deployment_With_TRT_LLM |
| Repo | Triton_inference_server_Server |
| Source | docs/getting_started/llm.md:L114-135 |
| Domains | Quality_Assurance, NLP |
| Knowledge_Sources | TRT-LLM Docs|https://nvidia.github.io/TensorRT-LLM/, source::Repo|Triton Server|https://github.com/triton-inference-server/server |
| implements | Principle:Triton_inference_server_Server_Engine_Validation |
| 2026-02-13 17:00 GMT |
Overview
Concrete TRT-LLM validation scripts for testing compiled engine outputs. This implementation covers both qualitative text generation testing via run.py and quantitative summarization benchmarking via summarize.py.
Description
Two validation approaches are provided:
run.py— Interactive text generation script for qualitative validation. Takes a prompt and generates text, allowing visual inspection of output qualitysummarize.py— Automated summarization benchmark using the CNN/DailyMail dataset with ROUGE-1 metric computation and threshold checking
Both scripts load the compiled TensorRT engine and the original HuggingFace tokenizer to perform end-to-end inference validation.
Usage
Run after engine compilation. Both scripts are located in the TRT-LLM examples directory. The run.py script is used for quick qualitative checks, while summarize.py provides quantitative regression detection.
Code Reference
Source Location
| Item | Value |
|---|---|
| File | docs/getting_started/llm.md |
| Lines | L114-135 |
| Repo | https://github.com/triton-inference-server/server |
| Scripts | TensorRT-LLM/examples/run.py, TensorRT-LLM/examples/summarize.py |
Signature (run.py)
python3 ../run.py \
--engine_dir ./phi-engine \
--max_output_len 500 \
--tokenizer_dir ./Phi-3-mini-4k-instruct \
--input_text "How do I count to nine in French?"
Signature (summarize.py)
python3 ../summarize.py \
--test_trt_llm \
--engine_dir ./phi-engine \
--check_accuracy \
--tensorrt_llm_rouge1_threshold 20
Import / Verification
# run.py outputs generated text to stdout
# summarize.py outputs ROUGE-1 score and pass/fail status
I/O Contract
Inputs
| Name | Type | Description |
|---|---|---|
--engine_dir |
Directory path | Path to compiled TRT engine directory (e.g., ./phi-engine)
|
--tokenizer_dir |
Directory path | Path to HuggingFace tokenizer directory (e.g., ./Phi-3-mini-4k-instruct)
|
--max_output_len |
Integer | Maximum number of output tokens to generate (run.py) |
--input_text |
String | Prompt text for generation (run.py) |
--test_trt_llm |
Flag | Enables TRT-LLM engine testing mode (summarize.py) |
--check_accuracy |
Flag | Enables ROUGE-1 accuracy checking (summarize.py) |
--tensorrt_llm_rouge1_threshold |
Float | Minimum ROUGE-1 score to pass validation (summarize.py) |
Outputs
| Name | Type | Description |
|---|---|---|
| Generated text | stdout | Text generated by the engine in response to the input prompt (run.py) |
| ROUGE-1 score | stdout | Computed ROUGE-1 score against reference summaries (summarize.py) |
| Pass/fail status | stdout | Whether the ROUGE-1 score meets the threshold (summarize.py) |
Usage Examples
Qualitative validation with run.py
python3 ../run.py \
--engine_dir ./phi-engine \
--max_output_len 500 \
--tokenizer_dir ./Phi-3-mini-4k-instruct \
--input_text "How do I count to nine in French?"
Expected output should contain French numbers: un, deux, trois, quatre, cinq, six, sept, huit, neuf.
Quantitative validation with summarize.py
python3 ../summarize.py \
--test_trt_llm \
--engine_dir ./phi-engine \
--check_accuracy \
--tensorrt_llm_rouge1_threshold 20
This script:
- Loads test samples from the CNN/DailyMail dataset
- Generates summaries using the TRT-LLM engine
- Computes ROUGE-1 scores against reference summaries
- Fails with a non-zero exit code if the score is below the threshold (20)
Key Parameters
| Parameter | Script | Description | Example Value |
|---|---|---|---|
--engine_dir |
Both | Compiled TRT engine directory | ./phi-engine
|
--tokenizer_dir |
run.py | HuggingFace tokenizer path | ./Phi-3-mini-4k-instruct
|
--max_output_len |
run.py | Max tokens to generate | 500
|
--input_text |
run.py | Input prompt | "How do I count to nine in French?"
|
--tensorrt_llm_rouge1_threshold |
summarize.py | Minimum ROUGE-1 score | 20
|
Related Pages
- Principle:Triton_inference_server_Server_Engine_Validation
- Implementation:Triton_inference_server_Server_Trtllm_Build — Prerequisite: engine compilation
- Implementation:Triton_inference_server_Server_Fill_Template — Next step: model repository setup
- Environment:Triton_inference_server_Server_TRT_LLM_Deployment