Implementation:Kserve Kserve HuggingFace Multinode Runtime
| Knowledge Sources | |
|---|---|
| Domains | Kubernetes, Model Serving |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
Concrete ClusterServingRuntime for multi-node distributed HuggingFace model inference provided by the KServe project.
Description
This file defines a ClusterServingRuntime named kserve-huggingfaceserver-multinode for distributed inference of HuggingFace models using Ray for tensor and pipeline parallelism across multiple GPU nodes. The head container starts a Ray head node, waits for worker registration, and launches the HuggingFace server with configurable --tensor-parallel-size and --pipeline-parallel-size arguments. A separate worker template joins the Ray cluster and participates in distributed computation. The runtime supports v1 and v2 protocols with priority 2 for the huggingface model format auto-selection. This enables serving of large language models that exceed single-node GPU memory by distributing inference across multiple Ray worker nodes.
Usage
Apply this ClusterServingRuntime to enable multi-node distributed inference for large HuggingFace models. Create an InferenceService that references the huggingface model format, and set the TENSOR_PARALLEL_SIZE and PIPELINE_PARALLEL_SIZE environment variables to control the parallelism strategy. The runtime automatically manages Ray cluster setup and worker registration.
Code Reference
Source Location
- Repository: Kserve_Kserve
- File: config/runtimes/kserve-huggingfaceserver-multinode.yaml
- Lines: 1-189
Signature
apiVersion: serving.kserve.io/v1alpha1
kind: ClusterServingRuntime
metadata:
name: kserve-huggingfaceserver-multinode
spec:
annotations:
prometheus.kserve.io/port: "8080"
prometheus.kserve.io/path: "/metrics"
supportedModelFormats:
- name: huggingface
version: "1"
autoSelect: true
priority: 2
protocolVersions:
- v2
- v1
containers:
- name: kserve-container
image: huggingfaceserver-gpu:replace
command:
- "bash"
- "-c"
- |
export RAY_ADDRESS=${POD_IP}:${RAY_PORT}
ray start --head --disable-usage-stats --include-dashboard false
python ./huggingfaceserver/health_check.py registered_nodes --retries 200
python -m huggingfaceserver ${MODEL_DIR_ARG} \
--tensor-parallel-size=${TENSOR_PARALLEL_SIZE} \
--pipeline-parallel-size=${PIPELINE_PARALLEL_SIZE} $0 $@
args:
- --model_name={{.Name}}
workerSpec:
pipelineParallelSize: 1
tensorParallelSize: 1
containers:
- name: worker-container
image: huggingfaceserver-gpu:replace
Import
kubectl apply -f config/runtimes/kserve-huggingfaceserver-multinode.yaml
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| TENSOR_PARALLEL_SIZE | env variable | No | Number of GPUs for tensor parallelism (default from workerSpec: 1) |
| PIPELINE_PARALLEL_SIZE | env variable | No | Number of nodes for pipeline parallelism (default from workerSpec: 1) |
| MODEL_ID | env variable | No | HuggingFace model ID to download and serve |
| MODEL_DIR | env variable | No | Local directory path for pre-downloaded model |
| RAY_PORT | env variable | Yes | Ray head node port (default: 6379) |
| Template:.Name | template variable | Yes | Model name injected by KServe at runtime |
Outputs
| Name | Type | Description |
|---|---|---|
| ClusterServingRuntime | Custom Resource | Multi-node HuggingFace runtime available cluster-wide |
| Ray head node | Container | Manages the Ray cluster and runs the inference server on port 8080 |
| Ray worker nodes | Container (workerSpec) | Join the Ray cluster for distributed computation |
| Prometheus metrics | HTTP port 8080 /metrics | Model serving metrics endpoint |
Usage Examples
Apply the runtime
kubectl apply -f config/runtimes/kserve-huggingfaceserver-multinode.yaml
Create a multi-node InferenceService
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: llama-multinode
spec:
predictor:
model:
modelFormat:
name: huggingface
runtime: kserve-huggingfaceserver-multinode
env:
- name: TENSOR_PARALLEL_SIZE
value: "4"
- name: PIPELINE_PARALLEL_SIZE
value: "2"