Workflow:Kserve Kserve InferenceGraph Pipeline
| Knowledge Sources | |
|---|---|
| Domains | ML_Serving, Kubernetes, Inference_Pipelines, Model_Orchestration |
| Last Updated | 2026-02-13 14:00 GMT |
Overview
End-to-end process for building multi-model inference pipelines using KServe InferenceGraph with sequence, ensemble, splitter, and switch routing patterns.
Description
This workflow covers the creation of complex inference pipelines that chain, ensemble, split, or conditionally route requests across multiple InferenceServices. KServe InferenceGraph is a custom resource that defines a directed acyclic graph (DAG) of routing nodes. Each node can be a Sequence (chain models in order), Ensemble (fan-out to all models and combine results), Splitter (weighted traffic distribution), or Switch (conditional routing based on request content). The graph executes starting from a mandatory root node and passes request or response data between steps.
Usage
Execute this workflow when your inference pipeline requires multiple models to produce a final prediction. Common use cases include face recognition (detection then feature extraction), NLP pipelines (classification then entity extraction), A/B testing (weighted traffic splitting), and model ensembles (combining predictions from multiple models). Use InferenceGraph whenever a single InferenceService is insufficient for your inference logic.
Execution Steps
Step 1: Deploy component InferenceServices
Create and deploy each individual InferenceService that will participate in the graph. Each model must be independently accessible and in Ready state before being referenced in the graph. These serve as the building blocks of the pipeline.
Key considerations:
- Each InferenceService must be deployed and ready before creating the graph
- Models can use different frameworks (sklearn, xgboost, tensorflow, etc.)
- All InferenceServices must be in the same namespace as the InferenceGraph
Step 2: Design the graph topology
Determine the routing pattern for your pipeline. Choose from four node types based on the inference logic required:
Node types:
- Sequence: Chain models in order, passing request or response between steps
- Ensemble: Fan-out request to all models in parallel, return combined results
- Splitter: Distribute traffic by weight across models (e.g., 80/20 split)
- Switch: Route to a specific model based on a condition matching the request using GJSON syntax
Step 3: Write the InferenceGraph specification
Author the InferenceGraph YAML manifest defining the root node and any additional nodes. Each node specifies its routerType and the steps or routes. Steps reference InferenceServices by serviceName or external endpoints by serviceUrl. Configure data passing ($request or $response) between steps, weights for splitters, and conditions for switch nodes.
Key considerations:
- Every graph must have a node named "root"
- Steps can reference InferenceServices (serviceName) or external URLs (serviceUrl)
- Use $request to pass original input or $response to pass previous step output
- Switch conditions use GJSON syntax for JSON path matching
- Nodes can reference other nodes for composability
Step 4: Apply the InferenceGraph to Kubernetes
Submit the InferenceGraph manifest to the cluster. The KServe graph controller creates a router deployment that handles request routing according to the graph topology. The router pod starts and becomes ready when all referenced InferenceServices are accessible.
What happens:
- A router deployment is created for the graph
- The router resolves all referenced InferenceService endpoints
- A URL is assigned for the graph entry point
- The graph transitions to Ready state when the router is healthy
Step 5: Send request and validate pipeline output
Send a test request to the InferenceGraph URL. The router executes the graph starting from the root node, routing the request through the defined topology. Verify the response matches the expected output format based on the graph pattern (e.g., ensemble returns keyed results from each model, sequence returns the final step output).
Expected outputs by type:
- Sequence: Returns the response from the last step in the chain
- Ensemble: Returns a JSON object keyed by model/route name with each model's response
- Splitter: Returns the response from whichever model received the request
- Switch: Returns the response from the condition-matched model, or the input if no match