Principle:Kserve Kserve Local Model Caching

Knowledge Sources	Kserve_Kserve KServe Docs KServe Local Model Cache
Domains	MLOps, Caching, Kubernetes
Last Updated	2026-02-13 00:00 GMT

Overview

A node-level model caching strategy that pre-downloads model artifacts to local storage on cluster nodes, eliminating remote storage fetch latency during inference service cold starts.

Description

Local Model Caching addresses the cold start problem in model serving, where large model files (often gigabytes for deep learning models, tens of gigabytes for LLMs) must be downloaded from remote storage before an inference pod can begin serving. By pre-caching model artifacts on node-local storage, new pods can mount the cached model immediately rather than waiting for a full download.

The system uses three custom resources working together:

LocalModelCache -- declares which model (identified by its storage URI) should be cached and on which node group. It represents the desired caching state for a specific model.
LocalModelNodeGroup -- defines a set of nodes (via label selectors) that should participate in local model caching, along with storage configuration (persistent volume claim templates, storage limits).
LocalModelNode -- a per-node resource (managed by the controller) that tracks the actual cache state on each node: which models are cached, their download status, and disk usage.

A dedicated controller manager orchestrates the caching lifecycle, while a DaemonSet agent on each node performs the actual model downloads, storage management, and eviction when disk space is constrained.

Usage

Use this principle when:

Deploying large models where download time dominates cold start latency
Running autoscaled inference services that frequently create new pods
Operating in environments with limited or expensive network bandwidth to remote storage
Serving LLMs where model weights exceed several gigabytes

Theoretical Basis

# Local model caching lifecycle (NOT implementation code)
Caching setup:
  1. Admin creates LocalModelNodeGroup:
     - nodeSelector: labels identifying eligible nodes
     - storageSpec: PVC template for local model storage
  2. Admin creates LocalModelCache:
     - sourceModelUri: remote storage URI of the model
     - nodeGroup: reference to LocalModelNodeGroup

Cache population:
  1. LocalModel controller sees new LocalModelCache resource
  2. Controller identifies target nodes via LocalModelNodeGroup selector
  3. Controller creates/updates LocalModelNode resource per node
  4. DaemonSet agent on each node detects its LocalModelNode update
  5. Agent downloads model from sourceModelUri to local PVC
  6. Agent updates LocalModelNode status (cached: true, size, hash)

Inference pod startup with cache:
  Without caching: Pod start → Init container downloads model → Ready (minutes)
  With caching:    Pod start → Mount local PVC with cached model → Ready (seconds)

Cache eviction:
  - Agent monitors disk usage on local PVC
  - When usage exceeds threshold, evict least-recently-used models
  - Update LocalModelNode status to reflect eviction
  - Controller may re-trigger download if model is still referenced

Related Pages

Implemented By

Related Principles

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment