Principle:Kserve Kserve Local Model Caching
| Knowledge Sources | |
|---|---|
| Domains | MLOps, Caching, Kubernetes |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
A node-level model caching strategy that pre-downloads model artifacts to local storage on cluster nodes, eliminating remote storage fetch latency during inference service cold starts.
Description
Local Model Caching addresses the cold start problem in model serving, where large model files (often gigabytes for deep learning models, tens of gigabytes for LLMs) must be downloaded from remote storage before an inference pod can begin serving. By pre-caching model artifacts on node-local storage, new pods can mount the cached model immediately rather than waiting for a full download.
The system uses three custom resources working together:
- LocalModelCache -- declares which model (identified by its storage URI) should be cached and on which node group. It represents the desired caching state for a specific model.
- LocalModelNodeGroup -- defines a set of nodes (via label selectors) that should participate in local model caching, along with storage configuration (persistent volume claim templates, storage limits).
- LocalModelNode -- a per-node resource (managed by the controller) that tracks the actual cache state on each node: which models are cached, their download status, and disk usage.
A dedicated controller manager orchestrates the caching lifecycle, while a DaemonSet agent on each node performs the actual model downloads, storage management, and eviction when disk space is constrained.
Usage
Use this principle when:
- Deploying large models where download time dominates cold start latency
- Running autoscaled inference services that frequently create new pods
- Operating in environments with limited or expensive network bandwidth to remote storage
- Serving LLMs where model weights exceed several gigabytes
Theoretical Basis
# Local model caching lifecycle (NOT implementation code)
Caching setup:
1. Admin creates LocalModelNodeGroup:
- nodeSelector: labels identifying eligible nodes
- storageSpec: PVC template for local model storage
2. Admin creates LocalModelCache:
- sourceModelUri: remote storage URI of the model
- nodeGroup: reference to LocalModelNodeGroup
Cache population:
1. LocalModel controller sees new LocalModelCache resource
2. Controller identifies target nodes via LocalModelNodeGroup selector
3. Controller creates/updates LocalModelNode resource per node
4. DaemonSet agent on each node detects its LocalModelNode update
5. Agent downloads model from sourceModelUri to local PVC
6. Agent updates LocalModelNode status (cached: true, size, hash)
Inference pod startup with cache:
Without caching: Pod start → Init container downloads model → Ready (minutes)
With caching: Pod start → Mount local PVC with cached model → Ready (seconds)
Cache eviction:
- Agent monitors disk usage on local PVC
- When usage exceeds threshold, evict least-recently-used models
- Update LocalModelNode status to reflect eviction
- Controller may re-trigger download if model is still referenced
Related Pages
Implemented By
- Implementation:Kserve_Kserve_LocalModelCache_CRD
- Implementation:Kserve_Kserve_LocalModelNode_CRD
- Implementation:Kserve_Kserve_LocalModelNodeGroup_CRD
- Implementation:Kserve_Kserve_LocalModelNode_Agent_DaemonSet
- Implementation:Kserve_Kserve_LocalModel_Manager_Deployment