Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Triton inference server Server TRT LLM Run

From Leeroopedia

Metadata

Field Value
Type Implementation
Workflow LLM_Deployment_With_TRT_LLM
Repo Triton_inference_server_Server
Source docs/getting_started/llm.md:L114-135
Domains Quality_Assurance, NLP
Knowledge_Sources TRT-LLM Docs|https://nvidia.github.io/TensorRT-LLM/, source::Repo|Triton Server|https://github.com/triton-inference-server/server
implements Principle:Triton_inference_server_Server_Engine_Validation
2026-02-13 17:00 GMT

Overview

Concrete TRT-LLM validation scripts for testing compiled engine outputs. This implementation covers both qualitative text generation testing via run.py and quantitative summarization benchmarking via summarize.py.

Description

Two validation approaches are provided:

  1. run.py — Interactive text generation script for qualitative validation. Takes a prompt and generates text, allowing visual inspection of output quality
  2. summarize.py — Automated summarization benchmark using the CNN/DailyMail dataset with ROUGE-1 metric computation and threshold checking

Both scripts load the compiled TensorRT engine and the original HuggingFace tokenizer to perform end-to-end inference validation.

Usage

Run after engine compilation. Both scripts are located in the TRT-LLM examples directory. The run.py script is used for quick qualitative checks, while summarize.py provides quantitative regression detection.

Code Reference

Source Location

Item Value
File docs/getting_started/llm.md
Lines L114-135
Repo https://github.com/triton-inference-server/server
Scripts TensorRT-LLM/examples/run.py, TensorRT-LLM/examples/summarize.py

Signature (run.py)

python3 ../run.py \
    --engine_dir ./phi-engine \
    --max_output_len 500 \
    --tokenizer_dir ./Phi-3-mini-4k-instruct \
    --input_text "How do I count to nine in French?"

Signature (summarize.py)

python3 ../summarize.py \
    --test_trt_llm \
    --engine_dir ./phi-engine \
    --check_accuracy \
    --tensorrt_llm_rouge1_threshold 20

Import / Verification

# run.py outputs generated text to stdout
# summarize.py outputs ROUGE-1 score and pass/fail status

I/O Contract

Inputs

Name Type Description
--engine_dir Directory path Path to compiled TRT engine directory (e.g., ./phi-engine)
--tokenizer_dir Directory path Path to HuggingFace tokenizer directory (e.g., ./Phi-3-mini-4k-instruct)
--max_output_len Integer Maximum number of output tokens to generate (run.py)
--input_text String Prompt text for generation (run.py)
--test_trt_llm Flag Enables TRT-LLM engine testing mode (summarize.py)
--check_accuracy Flag Enables ROUGE-1 accuracy checking (summarize.py)
--tensorrt_llm_rouge1_threshold Float Minimum ROUGE-1 score to pass validation (summarize.py)

Outputs

Name Type Description
Generated text stdout Text generated by the engine in response to the input prompt (run.py)
ROUGE-1 score stdout Computed ROUGE-1 score against reference summaries (summarize.py)
Pass/fail status stdout Whether the ROUGE-1 score meets the threshold (summarize.py)

Usage Examples

Qualitative validation with run.py

python3 ../run.py \
    --engine_dir ./phi-engine \
    --max_output_len 500 \
    --tokenizer_dir ./Phi-3-mini-4k-instruct \
    --input_text "How do I count to nine in French?"

Expected output should contain French numbers: un, deux, trois, quatre, cinq, six, sept, huit, neuf.

Quantitative validation with summarize.py

python3 ../summarize.py \
    --test_trt_llm \
    --engine_dir ./phi-engine \
    --check_accuracy \
    --tensorrt_llm_rouge1_threshold 20

This script:

  • Loads test samples from the CNN/DailyMail dataset
  • Generates summaries using the TRT-LLM engine
  • Computes ROUGE-1 scores against reference summaries
  • Fails with a non-zero exit code if the score is below the threshold (20)

Key Parameters

Parameter Script Description Example Value
--engine_dir Both Compiled TRT engine directory ./phi-engine
--tokenizer_dir run.py HuggingFace tokenizer path ./Phi-3-mini-4k-instruct
--max_output_len run.py Max tokens to generate 500
--input_text run.py Input prompt "How do I count to nine in French?"
--tensorrt_llm_rouge1_threshold summarize.py Minimum ROUGE-1 score 20

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment