Implementation:Triton inference server Server HTTP Generate Endpoint
Metadata
| Field | Value |
|---|---|
| Type | Implementation |
| Workflow | LLM_Deployment_With_TRT_LLM |
| Repo | Triton_inference_server_Server |
| Source | src/http_server.cc:L3297-3461, docs/protocol/extension_generate.md:L29-194 |
| Domains | NLP, HTTP_API, LLM_Deployment |
| Knowledge_Sources | Triton Server|https://github.com/triton-inference-server/server, source::Doc|Generate Extension|https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/protocol/extension_generate.html |
| implements | Principle:Triton_inference_server_Server_Generate_API |
| 2026-02-13 17:00 GMT |
Overview
Concrete HTTP handler for text generation requests in Triton Inference Server. This implementation covers the server-side handler code, endpoint URLs, request/response JSON formats, and the internal conversion logic.
Description
The Generate endpoint is implemented in src/http_server.cc as the HTTPAPIServer::HandleGenerate method. It provides two HTTP endpoints:
POST /v2/models/{model_name}/generate— Synchronous single-response generationPOST /v2/models/{model_name}/generate_stream— Streaming SSE (Server-Sent Events) generation
The handler performs request conversion via ConvertGenerateRequest (L3507-3559), which transforms the text-based JSON request into the internal KServe v2 tensor format, then calls TRITONSERVER_ServerInferAsync for execution.
Usage
Send HTTP POST requests to a running Triton server. The model name in the URL should be the ensemble model name (e.g., ensemble) for TRT-LLM deployments.
Code Reference
Source Location
| Item | Value |
|---|---|
| File | src/http_server.cc |
| Lines | L3297-3461 (HandleGenerate), L3507-3559 (ConvertGenerateRequest) |
| Repo | https://github.com/triton-inference-server/server |
| Protocol doc | docs/protocol/extension_generate.md:L29-194 |
Signature
POST /v2/models/<model_name>/generate
POST /v2/models/<model_name>/generate_stream
Server-side handler:
// src/http_server.cc
void HTTPAPIServer::HandleGenerate(evhtp_request_t* req); // L3297-3461
// Internal conversion: text JSON → KServe v2 tensor format
TRITONSERVER_Error* ConvertGenerateRequest(
const std::string& model_name,
evhtp_request_t* req,
...); // L3507-3559
Import
No client-side import required. The endpoint is accessible via standard HTTP clients (curl, Python requests, etc.).
I/O Contract
Inputs
| Name | Type | Description |
|---|---|---|
text_input |
String (required) | The prompt text for generation |
parameters.max_tokens |
Integer | Maximum number of tokens to generate |
parameters.temperature |
Float | Sampling temperature (higher = more random) |
parameters.top_k |
Integer | Top-k sampling parameter |
parameters.top_p |
Float | Top-p (nucleus) sampling parameter |
parameters.beam_width |
Integer | Beam search width (1 = greedy) |
parameters.bad_words |
List of strings | Words to exclude from generation |
parameters.stop_words |
List of strings | Words that trigger generation stop |
parameters.stream |
Boolean | Enable streaming (alternative to using /generate_stream) |
Outputs
| Name | Type | Description |
|---|---|---|
model_name |
String | Name of the model that generated the response |
model_version |
String | Version of the model |
text_output |
String | Generated text response |
Usage Examples
Single-response generation with curl
curl -X POST localhost:8000/v2/models/ensemble/generate \
-H "Content-Type: application/json" \
-d '{
"text_input": "How do I count to nine in French?",
"parameters": {
"max_tokens": 256,
"bad_words": [""],
"stop_words": [""]
}
}'
Response:
{
"model_name": "ensemble",
"model_version": "1",
"text_output": "To count to nine in French, you say: un, deux, trois, quatre, cinq, six, sept, huit, neuf."
}
Streaming generation with curl
curl -X POST localhost:8000/v2/models/ensemble/generate_stream \
-H "Content-Type: application/json" \
-d '{
"text_input": "Explain quantum computing in simple terms.",
"parameters": {
"max_tokens": 512,
"stream": true
}
}'
Streaming response (SSE format):
data: {"model_name":"ensemble","model_version":"1","text_output":"Quantum"}
data: {"model_name":"ensemble","model_version":"1","text_output":" computing"}
data: {"model_name":"ensemble","model_version":"1","text_output":" is"}
...
Python client example
import requests
url = "http://localhost:8000/v2/models/ensemble/generate"
payload = {
"text_input": "What is the capital of France?",
"parameters": {
"max_tokens": 128,
"temperature": 0.7,
"top_p": 0.9
}
}
response = requests.post(url, json=payload)
result = response.json()
print(result["text_output"])
Key Request Parameters
| Parameter | Type | Description | Example |
|---|---|---|---|
text_input |
String | Input prompt (required) | "How do I count to nine in French?"
|
max_tokens |
Integer | Max output tokens | 256
|
temperature |
Float | Sampling temperature | 0.7
|
top_k |
Integer | Top-k sampling | 50
|
top_p |
Float | Nucleus sampling | 0.9
|
beam_width |
Integer | Beam search width | 1
|
bad_words |
List | Excluded words | [""]
|
stop_words |
List | Stop trigger words | [""]
|
Related Pages
- Principle:Triton_inference_server_Server_Generate_API
- Implementation:Triton_inference_server_Server_Launch_Triton_Server_Script — Server must be running
- Implementation:Triton_inference_server_Server_GenAI_Perf — Benchmarking tool that uses this endpoint
- Environment:Triton_inference_server_Server_TRT_LLM_Deployment