Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Kserve Kserve InferenceGraph Pipeline

From Leeroopedia
Knowledge Sources
Domains ML_Serving, Kubernetes, Inference_Pipelines, Model_Orchestration
Last Updated 2026-02-13 14:00 GMT

Overview

End-to-end process for building multi-model inference pipelines using KServe InferenceGraph with sequence, ensemble, splitter, and switch routing patterns.

Description

This workflow covers the creation of complex inference pipelines that chain, ensemble, split, or conditionally route requests across multiple InferenceServices. KServe InferenceGraph is a custom resource that defines a directed acyclic graph (DAG) of routing nodes. Each node can be a Sequence (chain models in order), Ensemble (fan-out to all models and combine results), Splitter (weighted traffic distribution), or Switch (conditional routing based on request content). The graph executes starting from a mandatory root node and passes request or response data between steps.

Usage

Execute this workflow when your inference pipeline requires multiple models to produce a final prediction. Common use cases include face recognition (detection then feature extraction), NLP pipelines (classification then entity extraction), A/B testing (weighted traffic splitting), and model ensembles (combining predictions from multiple models). Use InferenceGraph whenever a single InferenceService is insufficient for your inference logic.

Execution Steps

Step 1: Deploy component InferenceServices

Create and deploy each individual InferenceService that will participate in the graph. Each model must be independently accessible and in Ready state before being referenced in the graph. These serve as the building blocks of the pipeline.

Key considerations:

  • Each InferenceService must be deployed and ready before creating the graph
  • Models can use different frameworks (sklearn, xgboost, tensorflow, etc.)
  • All InferenceServices must be in the same namespace as the InferenceGraph

Step 2: Design the graph topology

Determine the routing pattern for your pipeline. Choose from four node types based on the inference logic required:

Node types:

  • Sequence: Chain models in order, passing request or response between steps
  • Ensemble: Fan-out request to all models in parallel, return combined results
  • Splitter: Distribute traffic by weight across models (e.g., 80/20 split)
  • Switch: Route to a specific model based on a condition matching the request using GJSON syntax

Step 3: Write the InferenceGraph specification

Author the InferenceGraph YAML manifest defining the root node and any additional nodes. Each node specifies its routerType and the steps or routes. Steps reference InferenceServices by serviceName or external endpoints by serviceUrl. Configure data passing ($request or $response) between steps, weights for splitters, and conditions for switch nodes.

Key considerations:

  • Every graph must have a node named "root"
  • Steps can reference InferenceServices (serviceName) or external URLs (serviceUrl)
  • Use $request to pass original input or $response to pass previous step output
  • Switch conditions use GJSON syntax for JSON path matching
  • Nodes can reference other nodes for composability

Step 4: Apply the InferenceGraph to Kubernetes

Submit the InferenceGraph manifest to the cluster. The KServe graph controller creates a router deployment that handles request routing according to the graph topology. The router pod starts and becomes ready when all referenced InferenceServices are accessible.

What happens:

  • A router deployment is created for the graph
  • The router resolves all referenced InferenceService endpoints
  • A URL is assigned for the graph entry point
  • The graph transitions to Ready state when the router is healthy

Step 5: Send request and validate pipeline output

Send a test request to the InferenceGraph URL. The router executes the graph starting from the root node, routing the request through the defined topology. Verify the response matches the expected output format based on the graph pattern (e.g., ensemble returns keyed results from each model, sequence returns the final step output).

Expected outputs by type:

  • Sequence: Returns the response from the last step in the chain
  • Ensemble: Returns a JSON object keyed by model/route name with each model's response
  • Splitter: Returns the response from whichever model received the request
  • Switch: Returns the response from the condition-matched model, or the input if no match

Execution Diagram

GitHub URL

Workflow Repository