Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Triton inference server Server HTTP Generate Endpoint

From Leeroopedia

Metadata

Field Value
Type Implementation
Workflow LLM_Deployment_With_TRT_LLM
Repo Triton_inference_server_Server
Source src/http_server.cc:L3297-3461, docs/protocol/extension_generate.md:L29-194
Domains NLP, HTTP_API, LLM_Deployment
Knowledge_Sources Triton Server|https://github.com/triton-inference-server/server, source::Doc|Generate Extension|https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/protocol/extension_generate.html
implements Principle:Triton_inference_server_Server_Generate_API
2026-02-13 17:00 GMT

Overview

Concrete HTTP handler for text generation requests in Triton Inference Server. This implementation covers the server-side handler code, endpoint URLs, request/response JSON formats, and the internal conversion logic.

Description

The Generate endpoint is implemented in src/http_server.cc as the HTTPAPIServer::HandleGenerate method. It provides two HTTP endpoints:

  • POST /v2/models/{model_name}/generate — Synchronous single-response generation
  • POST /v2/models/{model_name}/generate_stream — Streaming SSE (Server-Sent Events) generation

The handler performs request conversion via ConvertGenerateRequest (L3507-3559), which transforms the text-based JSON request into the internal KServe v2 tensor format, then calls TRITONSERVER_ServerInferAsync for execution.

Usage

Send HTTP POST requests to a running Triton server. The model name in the URL should be the ensemble model name (e.g., ensemble) for TRT-LLM deployments.

Code Reference

Source Location

Item Value
File src/http_server.cc
Lines L3297-3461 (HandleGenerate), L3507-3559 (ConvertGenerateRequest)
Repo https://github.com/triton-inference-server/server
Protocol doc docs/protocol/extension_generate.md:L29-194

Signature

POST /v2/models/<model_name>/generate
POST /v2/models/<model_name>/generate_stream

Server-side handler:

// src/http_server.cc
void HTTPAPIServer::HandleGenerate(evhtp_request_t* req);  // L3297-3461

// Internal conversion: text JSON → KServe v2 tensor format
TRITONSERVER_Error* ConvertGenerateRequest(
    const std::string& model_name,
    evhtp_request_t* req,
    ...);  // L3507-3559

Import

No client-side import required. The endpoint is accessible via standard HTTP clients (curl, Python requests, etc.).

I/O Contract

Inputs

Name Type Description
text_input String (required) The prompt text for generation
parameters.max_tokens Integer Maximum number of tokens to generate
parameters.temperature Float Sampling temperature (higher = more random)
parameters.top_k Integer Top-k sampling parameter
parameters.top_p Float Top-p (nucleus) sampling parameter
parameters.beam_width Integer Beam search width (1 = greedy)
parameters.bad_words List of strings Words to exclude from generation
parameters.stop_words List of strings Words that trigger generation stop
parameters.stream Boolean Enable streaming (alternative to using /generate_stream)

Outputs

Name Type Description
model_name String Name of the model that generated the response
model_version String Version of the model
text_output String Generated text response

Usage Examples

Single-response generation with curl

curl -X POST localhost:8000/v2/models/ensemble/generate \
    -H "Content-Type: application/json" \
    -d '{
        "text_input": "How do I count to nine in French?",
        "parameters": {
            "max_tokens": 256,
            "bad_words": [""],
            "stop_words": [""]
        }
    }'

Response:

{
    "model_name": "ensemble",
    "model_version": "1",
    "text_output": "To count to nine in French, you say: un, deux, trois, quatre, cinq, six, sept, huit, neuf."
}

Streaming generation with curl

curl -X POST localhost:8000/v2/models/ensemble/generate_stream \
    -H "Content-Type: application/json" \
    -d '{
        "text_input": "Explain quantum computing in simple terms.",
        "parameters": {
            "max_tokens": 512,
            "stream": true
        }
    }'

Streaming response (SSE format):

data: {"model_name":"ensemble","model_version":"1","text_output":"Quantum"}

data: {"model_name":"ensemble","model_version":"1","text_output":" computing"}

data: {"model_name":"ensemble","model_version":"1","text_output":" is"}

...

Python client example

import requests

url = "http://localhost:8000/v2/models/ensemble/generate"
payload = {
    "text_input": "What is the capital of France?",
    "parameters": {
        "max_tokens": 128,
        "temperature": 0.7,
        "top_p": 0.9
    }
}

response = requests.post(url, json=payload)
result = response.json()
print(result["text_output"])

Key Request Parameters

Parameter Type Description Example
text_input String Input prompt (required) "How do I count to nine in French?"
max_tokens Integer Max output tokens 256
temperature Float Sampling temperature 0.7
top_k Integer Top-k sampling 50
top_p Float Nucleus sampling 0.9
beam_width Integer Beam search width 1
bad_words List Excluded words [""]
stop_words List Stop trigger words [""]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment