Principle:Triton inference server Server Generate API
Metadata
| Field | Value |
|---|---|
| Type | Principle |
| Principle_type | Source Code Doc |
| Workflow | LLM_Deployment_With_TRT_LLM |
| Repo | Triton_inference_server_Server |
| Source | src/http_server.cc:L3297-3461, docs/protocol/extension_generate.md:L29-194 |
| Domains | NLP, HTTP_API, LLM_Deployment |
| Knowledge_Sources | Triton Server|https://github.com/triton-inference-server/server, source::Doc|Generate Extension|https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/protocol/extension_generate.html |
| implemented_by | Implementation:Triton_inference_server_Server_HTTP_Generate_Endpoint |
| 2026-02-13 17:00 GMT |
Overview
A simplified text-in/text-out HTTP API for LLM inference that abstracts away tensor-level details.
Description
The Generate API provides a higher-level interface than the standard KServe v2 infer endpoint, accepting plain text prompts and returning generated text. It supports both synchronous single-response (/generate) and streaming SSE responses (/generate_stream), making it accessible for LLM applications without requiring tensor manipulation.
Key characteristics of the Generate API:
- Text-level abstraction — Clients send plain text prompts and receive plain text responses, without needing to understand tensor shapes, data types, or tokenization
- Parameter passthrough — Generation parameters (max_tokens, temperature, top_k, top_p, beam_width) are passed as JSON fields and forwarded to the model backend
- Streaming support — The
/generate_streamendpoint uses Server-Sent Events (SSE) to deliver tokens as they are generated, enabling real-time streaming UIs - Backward compatible — Internally converts to the standard KServe v2 inference request format, so it works with any model that accepts text input tensors
The Generate API is an extension to the standard Triton protocol, meaning it is not part of the KServe v2 specification but is implemented as an additional endpoint alongside the standard /v2/models/{model}/infer endpoint.
Usage
The Generate API is available on any running Triton server that has HTTP endpoints enabled. It is the recommended interface for LLM applications because it eliminates the complexity of tensor serialization/deserialization on the client side.
Workflow context:
- Depends on: Principle:Triton_inference_server_Server_TRT_LLM_Server_Launch (server must be running)
- Used by: Client applications, benchmarking tools
Theoretical Basis
Abstraction layer:
text prompt → tensor conversion (internally) → model inference → tensor to text conversion
The Generate API implements a facade pattern over the standard inference pipeline:
- Request conversion — The server's
ConvertGenerateRequestfunction transforms the JSON text input into KServe v2 tensor format, mappingtext_inputto the model's expected input tensor - Inference execution — The standard
TRITONSERVER_ServerInferAsyncpath handles scheduling, batching, and execution - Response conversion — Output tensors are converted back to plain text and wrapped in the JSON response format
SSE streaming enables token-by-token delivery for low-latency perception:
- The client receives partial responses as each token (or small batch of tokens) is generated
- This creates the perception of immediate response, even when full generation takes seconds
- The SSE protocol uses
data:prefixed lines, with each chunk containing a partialtext_output
Comparison with standard infer endpoint:
| Feature | /v2/models/{model}/infer | /v2/models/{model}/generate |
|---|---|---|
| Input format | Tensor JSON (shape, datatype, data) | Plain text JSON |
| Output format | Tensor JSON | Plain text JSON |
| Streaming | Custom protocol | Standard SSE |
| Client complexity | High (tensor handling) | Low (text only) |
Related Pages
- Implementation:Triton_inference_server_Server_HTTP_Generate_Endpoint
- Principle:Triton_inference_server_Server_TRT_LLM_Server_Launch — Server must be running
- Principle:Triton_inference_server_Server_LLM_Benchmarking — Benchmarking uses this API