Principle:Triton inference server Server Generate API

Metadata

Field	Value
Type	Principle
Principle_type	Source Code Doc
Workflow	LLM_Deployment_With_TRT_LLM
Repo	Triton_inference_server_Server
Source	src/http_server.cc:L3297-3461, docs/protocol/extension_generate.md:L29-194
Domains	NLP, HTTP_API, LLM_Deployment
Knowledge_Sources	Triton Server\|https://github.com/triton-inference-server/server, source::Doc\|Generate Extension\|https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/protocol/extension_generate.html
implemented_by	Implementation:Triton_inference_server_Server_HTTP_Generate_Endpoint
2026-02-13 17:00 GMT

Overview

A simplified text-in/text-out HTTP API for LLM inference that abstracts away tensor-level details.

Description

The Generate API provides a higher-level interface than the standard KServe v2 infer endpoint, accepting plain text prompts and returning generated text. It supports both synchronous single-response (/generate) and streaming SSE responses (/generate_stream), making it accessible for LLM applications without requiring tensor manipulation.

Key characteristics of the Generate API:

Text-level abstraction — Clients send plain text prompts and receive plain text responses, without needing to understand tensor shapes, data types, or tokenization
Parameter passthrough — Generation parameters (max_tokens, temperature, top_k, top_p, beam_width) are passed as JSON fields and forwarded to the model backend
Streaming support — The /generate_stream endpoint uses Server-Sent Events (SSE) to deliver tokens as they are generated, enabling real-time streaming UIs
Backward compatible — Internally converts to the standard KServe v2 inference request format, so it works with any model that accepts text input tensors

The Generate API is an extension to the standard Triton protocol, meaning it is not part of the KServe v2 specification but is implemented as an additional endpoint alongside the standard /v2/models/{model}/infer endpoint.

Usage

The Generate API is available on any running Triton server that has HTTP endpoints enabled. It is the recommended interface for LLM applications because it eliminates the complexity of tensor serialization/deserialization on the client side.

Workflow context:

Depends on: Principle:Triton_inference_server_Server_TRT_LLM_Server_Launch (server must be running)
Used by: Client applications, benchmarking tools

Theoretical Basis

Abstraction layer:

text prompt → tensor conversion (internally) → model inference → tensor to text conversion

The Generate API implements a facade pattern over the standard inference pipeline:

Request conversion — The server's ConvertGenerateRequest function transforms the JSON text input into KServe v2 tensor format, mapping text_input to the model's expected input tensor
Inference execution — The standard TRITONSERVER_ServerInferAsync path handles scheduling, batching, and execution
Response conversion — Output tensors are converted back to plain text and wrapped in the JSON response format

SSE streaming enables token-by-token delivery for low-latency perception:

The client receives partial responses as each token (or small batch of tokens) is generated
This creates the perception of immediate response, even when full generation takes seconds
The SSE protocol uses data: prefixed lines, with each chunk containing a partial text_output

Comparison with standard infer endpoint:

Feature	/v2/models/{model}/infer	/v2/models/{model}/generate
Input format	Tensor JSON (shape, datatype, data)	Plain text JSON
Output format	Tensor JSON	Plain text JSON
Streaming	Custom protocol	Standard SSE
Client complexity	High (tensor handling)	Low (text only)

Related Pages

Implementation:Triton_inference_server_Server_HTTP_Generate_Endpoint
Principle:Triton_inference_server_Server_TRT_LLM_Server_Launch — Server must be running
Principle:Triton_inference_server_Server_LLM_Benchmarking — Benchmarking uses this API

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment