Implementation:Kserve Kserve HuggingFace Multinode Runtime

Knowledge Sources	Kserve_Kserve KServe Docs
Domains	Kubernetes, Model Serving
Last Updated	2026-02-13 00:00 GMT

Overview

Concrete ClusterServingRuntime for multi-node distributed HuggingFace model inference provided by the KServe project.

Description

This file defines a ClusterServingRuntime named kserve-huggingfaceserver-multinode for distributed inference of HuggingFace models using Ray for tensor and pipeline parallelism across multiple GPU nodes. The head container starts a Ray head node, waits for worker registration, and launches the HuggingFace server with configurable --tensor-parallel-size and --pipeline-parallel-size arguments. A separate worker template joins the Ray cluster and participates in distributed computation. The runtime supports v1 and v2 protocols with priority 2 for the huggingface model format auto-selection. This enables serving of large language models that exceed single-node GPU memory by distributing inference across multiple Ray worker nodes.

Usage

Apply this ClusterServingRuntime to enable multi-node distributed inference for large HuggingFace models. Create an InferenceService that references the huggingface model format, and set the TENSOR_PARALLEL_SIZE and PIPELINE_PARALLEL_SIZE environment variables to control the parallelism strategy. The runtime automatically manages Ray cluster setup and worker registration.

Code Reference

Source Location

Repository: Kserve_Kserve
File: config/runtimes/kserve-huggingfaceserver-multinode.yaml
Lines: 1-189

Signature

apiVersion: serving.kserve.io/v1alpha1
kind: ClusterServingRuntime
metadata:
  name: kserve-huggingfaceserver-multinode
spec:
  annotations:
    prometheus.kserve.io/port: "8080"
    prometheus.kserve.io/path: "/metrics"
  supportedModelFormats:
    - name: huggingface
      version: "1"
      autoSelect: true
      priority: 2
  protocolVersions:
    - v2
    - v1
  containers:
    - name: kserve-container
      image: huggingfaceserver-gpu:replace
      command:
      - "bash"
      - "-c"
      - |
        export RAY_ADDRESS=${POD_IP}:${RAY_PORT}
        ray start --head --disable-usage-stats --include-dashboard false
        python ./huggingfaceserver/health_check.py registered_nodes --retries 200
        python -m huggingfaceserver ${MODEL_DIR_ARG} \
          --tensor-parallel-size=${TENSOR_PARALLEL_SIZE} \
          --pipeline-parallel-size=${PIPELINE_PARALLEL_SIZE} $0 $@
      args:
      - --model_name={{.Name}}
  workerSpec:
    pipelineParallelSize: 1
    tensorParallelSize: 1
    containers:
      - name: worker-container
        image: huggingfaceserver-gpu:replace

Import

kubectl apply -f config/runtimes/kserve-huggingfaceserver-multinode.yaml

I/O Contract

Inputs

Name	Type	Required	Description
TENSOR_PARALLEL_SIZE	env variable	No	Number of GPUs for tensor parallelism (default from workerSpec: 1)
PIPELINE_PARALLEL_SIZE	env variable	No	Number of nodes for pipeline parallelism (default from workerSpec: 1)
MODEL_ID	env variable	No	HuggingFace model ID to download and serve
MODEL_DIR	env variable	No	Local directory path for pre-downloaded model
RAY_PORT	env variable	Yes	Ray head node port (default: 6379)
Template:.Name	template variable	Yes	Model name injected by KServe at runtime

Outputs

Name	Type	Description
ClusterServingRuntime	Custom Resource	Multi-node HuggingFace runtime available cluster-wide
Ray head node	Container	Manages the Ray cluster and runs the inference server on port 8080
Ray worker nodes	Container (workerSpec)	Join the Ray cluster for distributed computation
Prometheus metrics	HTTP port 8080 /metrics	Model serving metrics endpoint

Usage Examples

Apply the runtime

kubectl apply -f config/runtimes/kserve-huggingfaceserver-multinode.yaml

Create a multi-node InferenceService

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llama-multinode
spec:
  predictor:
    model:
      modelFormat:
        name: huggingface
      runtime: kserve-huggingfaceserver-multinode
      env:
        - name: TENSOR_PARALLEL_SIZE
          value: "4"
        - name: PIPELINE_PARALLEL_SIZE
          value: "2"

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment