Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Triton inference server Server Generate API

From Leeroopedia

Metadata

Field Value
Type Principle
Principle_type Source Code Doc
Workflow LLM_Deployment_With_TRT_LLM
Repo Triton_inference_server_Server
Source src/http_server.cc:L3297-3461, docs/protocol/extension_generate.md:L29-194
Domains NLP, HTTP_API, LLM_Deployment
Knowledge_Sources Triton Server|https://github.com/triton-inference-server/server, source::Doc|Generate Extension|https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/protocol/extension_generate.html
implemented_by Implementation:Triton_inference_server_Server_HTTP_Generate_Endpoint
2026-02-13 17:00 GMT

Overview

A simplified text-in/text-out HTTP API for LLM inference that abstracts away tensor-level details.

Description

The Generate API provides a higher-level interface than the standard KServe v2 infer endpoint, accepting plain text prompts and returning generated text. It supports both synchronous single-response (/generate) and streaming SSE responses (/generate_stream), making it accessible for LLM applications without requiring tensor manipulation.

Key characteristics of the Generate API:

  • Text-level abstraction — Clients send plain text prompts and receive plain text responses, without needing to understand tensor shapes, data types, or tokenization
  • Parameter passthrough — Generation parameters (max_tokens, temperature, top_k, top_p, beam_width) are passed as JSON fields and forwarded to the model backend
  • Streaming support — The /generate_stream endpoint uses Server-Sent Events (SSE) to deliver tokens as they are generated, enabling real-time streaming UIs
  • Backward compatible — Internally converts to the standard KServe v2 inference request format, so it works with any model that accepts text input tensors

The Generate API is an extension to the standard Triton protocol, meaning it is not part of the KServe v2 specification but is implemented as an additional endpoint alongside the standard /v2/models/{model}/infer endpoint.

Usage

The Generate API is available on any running Triton server that has HTTP endpoints enabled. It is the recommended interface for LLM applications because it eliminates the complexity of tensor serialization/deserialization on the client side.

Workflow context:

Theoretical Basis

Abstraction layer:

text prompt → tensor conversion (internally) → model inference → tensor to text conversion

The Generate API implements a facade pattern over the standard inference pipeline:

  1. Request conversion — The server's ConvertGenerateRequest function transforms the JSON text input into KServe v2 tensor format, mapping text_input to the model's expected input tensor
  2. Inference execution — The standard TRITONSERVER_ServerInferAsync path handles scheduling, batching, and execution
  3. Response conversion — Output tensors are converted back to plain text and wrapped in the JSON response format

SSE streaming enables token-by-token delivery for low-latency perception:

  • The client receives partial responses as each token (or small batch of tokens) is generated
  • This creates the perception of immediate response, even when full generation takes seconds
  • The SSE protocol uses data: prefixed lines, with each chunk containing a partial text_output

Comparison with standard infer endpoint:

Feature /v2/models/{model}/infer /v2/models/{model}/generate
Input format Tensor JSON (shape, datatype, data) Plain text JSON
Output format Tensor JSON Plain text JSON
Streaming Custom protocol Standard SSE
Client complexity High (tensor handling) Low (text only)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment