Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Kserve Kserve HuggingFace Multinode Runtime

From Leeroopedia
Knowledge Sources
Domains Kubernetes, Model Serving
Last Updated 2026-02-13 00:00 GMT

Overview

Concrete ClusterServingRuntime for multi-node distributed HuggingFace model inference provided by the KServe project.

Description

This file defines a ClusterServingRuntime named kserve-huggingfaceserver-multinode for distributed inference of HuggingFace models using Ray for tensor and pipeline parallelism across multiple GPU nodes. The head container starts a Ray head node, waits for worker registration, and launches the HuggingFace server with configurable --tensor-parallel-size and --pipeline-parallel-size arguments. A separate worker template joins the Ray cluster and participates in distributed computation. The runtime supports v1 and v2 protocols with priority 2 for the huggingface model format auto-selection. This enables serving of large language models that exceed single-node GPU memory by distributing inference across multiple Ray worker nodes.

Usage

Apply this ClusterServingRuntime to enable multi-node distributed inference for large HuggingFace models. Create an InferenceService that references the huggingface model format, and set the TENSOR_PARALLEL_SIZE and PIPELINE_PARALLEL_SIZE environment variables to control the parallelism strategy. The runtime automatically manages Ray cluster setup and worker registration.

Code Reference

Source Location

Signature

apiVersion: serving.kserve.io/v1alpha1
kind: ClusterServingRuntime
metadata:
  name: kserve-huggingfaceserver-multinode
spec:
  annotations:
    prometheus.kserve.io/port: "8080"
    prometheus.kserve.io/path: "/metrics"
  supportedModelFormats:
    - name: huggingface
      version: "1"
      autoSelect: true
      priority: 2
  protocolVersions:
    - v2
    - v1
  containers:
    - name: kserve-container
      image: huggingfaceserver-gpu:replace
      command:
      - "bash"
      - "-c"
      - |
        export RAY_ADDRESS=${POD_IP}:${RAY_PORT}
        ray start --head --disable-usage-stats --include-dashboard false
        python ./huggingfaceserver/health_check.py registered_nodes --retries 200
        python -m huggingfaceserver ${MODEL_DIR_ARG} \
          --tensor-parallel-size=${TENSOR_PARALLEL_SIZE} \
          --pipeline-parallel-size=${PIPELINE_PARALLEL_SIZE} $0 $@
      args:
      - --model_name={{.Name}}
  workerSpec:
    pipelineParallelSize: 1
    tensorParallelSize: 1
    containers:
      - name: worker-container
        image: huggingfaceserver-gpu:replace

Import

kubectl apply -f config/runtimes/kserve-huggingfaceserver-multinode.yaml

I/O Contract

Inputs

Name Type Required Description
TENSOR_PARALLEL_SIZE env variable No Number of GPUs for tensor parallelism (default from workerSpec: 1)
PIPELINE_PARALLEL_SIZE env variable No Number of nodes for pipeline parallelism (default from workerSpec: 1)
MODEL_ID env variable No HuggingFace model ID to download and serve
MODEL_DIR env variable No Local directory path for pre-downloaded model
RAY_PORT env variable Yes Ray head node port (default: 6379)
Template:.Name template variable Yes Model name injected by KServe at runtime

Outputs

Name Type Description
ClusterServingRuntime Custom Resource Multi-node HuggingFace runtime available cluster-wide
Ray head node Container Manages the Ray cluster and runs the inference server on port 8080
Ray worker nodes Container (workerSpec) Join the Ray cluster for distributed computation
Prometheus metrics HTTP port 8080 /metrics Model serving metrics endpoint

Usage Examples

Apply the runtime

kubectl apply -f config/runtimes/kserve-huggingfaceserver-multinode.yaml

Create a multi-node InferenceService

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llama-multinode
spec:
  predictor:
    model:
      modelFormat:
        name: huggingface
      runtime: kserve-huggingfaceserver-multinode
      env:
        - name: TENSOR_PARALLEL_SIZE
          value: "4"
        - name: PIPELINE_PARALLEL_SIZE
          value: "2"

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment