Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Triton inference server Server Tritonclient Infer

From Leeroopedia
Knowledge Sources
Domains MLOps, Inference, HTTP_API
Last Updated 2026-02-13 17:00 GMT

Overview

Concrete Python client library and HTTP/gRPC endpoint for sending inference requests to Triton Inference Server.

Description

The tritonclient Python package provides InferenceServerClient classes for both HTTP and gRPC transports. The .infer() method sends input tensors to a specified model and returns output tensors. On the server side, HandleInfer in HTTPAPIServer processes REST inference requests by creating a TRITONSERVER_InferenceRequest, parsing input tensors, and dispatching via TRITONSERVER_ServerInferAsync.

Usage

Use tritonclient whenever sending inference requests from Python code. Choose tritonclient.http for REST-based inference or tritonclient.grpc for gRPC-based inference. For simple testing, use curl with the KServe v2 JSON format directly.

Code Reference

Source Location

  • Repository: triton-inference-server/server
  • File: src/http_server.cc
  • Lines: L3667-3795 (HandleInfer), L3709-3713 (TRITONSERVER_InferenceRequestNew), L3782-3783 (TRITONSERVER_ServerInferAsync)
  • File: docs/getting_started/quickstart.md
  • Lines: L85-122 (client usage examples)

Signature

# tritonclient.http
class InferenceServerClient:
    def __init__(self, url: str, verbose: bool = False):
        """
        Args:
            url: Server URL (host:port), e.g., "localhost:8000"
            verbose: Enable verbose logging
        """

    def infer(
        self,
        model_name: str,
        inputs: List[InferInput],
        model_version: str = "",
        outputs: Optional[List[InferRequestedOutput]] = None,
        request_id: str = "",
        headers: Optional[Dict] = None,
    ) -> InferResult:
        """Send inference request."""

class InferInput:
    def __init__(self, name: str, shape: List[int], datatype: str):
        """
        Args:
            name: Input tensor name
            shape: Tensor shape
            datatype: Data type string (FP32, INT32, BYTES, etc.)
        """
    def set_data_from_numpy(self, input_tensor: np.ndarray) -> None: ...

class InferRequestedOutput:
    def __init__(self, name: str, class_count: int = 0):
        """
        Args:
            name: Output tensor name
            class_count: Number of classes for classification (0 = raw output)
        """
// Server-side: src/http_server.cc:L3667
void HTTPAPIServer::HandleInfer(
    evhtp_request_t* req,
    const std::string& model_name,
    const std::string& model_version_str);
// Internally calls:
//   TRITONSERVER_InferenceRequestNew(&irequest, server_.get(), model_name, version)
//   TRITONSERVER_ServerInferAsync(server_.get(), irequest, triton_trace)

Import

# Python client
import tritonclient.http as httpclient
# or
import tritonclient.grpc as grpcclient

# Install:
# pip install tritonclient[all]

I/O Contract

Inputs

Name Type Required Description
model_name string Yes Target model name deployed on Triton
inputs List[InferInput] Yes Input tensors with name, shape, datatype, and data
model_version string No Model version (default: latest)
outputs List[InferRequestedOutput] No Requested output tensors (default: all)
request_id string No Client-assigned request identifier

Outputs

Name Type Description
InferResult object Contains output tensors accessible via as_numpy()
model_name string Model that served the request
model_version string Model version used
request_id string Echo of client request ID

Usage Examples

Image Classification with HTTP Client

import numpy as np
import tritonclient.http as httpclient
from PIL import Image

# 1. Connect to Triton
client = httpclient.InferenceServerClient(url="localhost:8000")

# 2. Prepare input
image = np.array(Image.open("image.jpg").resize((224, 224))).astype(np.float32)
image = np.transpose(image, (2, 0, 1))  # HWC -> CHW
image = np.expand_dims(image, axis=0)    # Add batch dimension

input_tensor = httpclient.InferInput("data_0", image.shape, "FP32")
input_tensor.set_data_from_numpy(image)

# 3. Request top-5 classification
output = httpclient.InferRequestedOutput("fc6_1", class_count=5)

# 4. Run inference
result = client.infer(
    model_name="densenet_onnx",
    inputs=[input_tensor],
    outputs=[output]
)

# 5. Get results
output_data = result.as_numpy("fc6_1")
print(output_data)

Using curl

curl -X POST localhost:8000/v2/models/simple/versions/1/infer \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": [
      {"name": "INPUT0", "shape": [1, 16], "datatype": "INT32", "data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]},
      {"name": "INPUT1", "shape": [1, 16], "datatype": "INT32", "data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]}
    ],
    "outputs": [
      {"name": "OUTPUT0"},
      {"name": "OUTPUT1"}
    ]
  }'

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment