Principle:Bentoml BentoML Model Loading From Store

Principle Metadata
Principle Name	Model Loading From Store
Workflow	Model_Store_Management
Domain	ML_Serving, Model_Management
Related Principle	Principle:Bentoml_BentoML_Model_Persistence
Implemented By	Implementation:Bentoml_BentoML_BentoModel_Descriptor
Last Updated	2026-02-13 15:00 GMT

Overview

Model Loading From Store is the principle of resolving and loading saved model artifacts from the BentoML local model store into running services. It uses a descriptor-based approach that enables declarative model dependencies with lazy resolution, automatic cloud pull, and seamless integration with the service lifecycle.

Core Concept

Loading models from the BentoML store into services requires a mechanism that is both declarative (models are specified at class definition time) and lazy (resolution happens at runtime when the model is actually needed). The BentoModel descriptor pattern achieves this by acting as a proxy that resolves the actual model artifact on first access.

Theory

BentoModel provides a descriptor-based approach to referencing models by tag. When accessed on a service instance, it lazily resolves the model from the local store or automatically pulls it from BentoCloud. This provides a declarative way to specify model dependencies.

The key aspects of this approach are:

Descriptor Pattern: BentoModel uses Python's descriptor protocol. When declared as a class attribute on a service, it intercepts attribute access to trigger model resolution. This means the model tag is declared once, and actual loading is deferred until the service needs it.

Lazy Resolution: The model is not loaded when the service class is defined or even when it is instantiated. Resolution occurs when the model attribute is first accessed, allowing the system to defer expensive I/O until it is truly needed.

Automatic Cloud Pull: If the model is not found in the local store, BentoModel can automatically pull it from BentoCloud. This eliminates the need for manual bentoml.models.pull() calls in deployment scripts and simplifies CI/CD pipelines.

Store-Centric Resolution: Unlike loading models directly from external sources (e.g., HuggingFace Hub), this principle centers on the BentoML model store as the canonical source. Models must first be saved to the store (via bentoml.models.create()) before they can be loaded via BentoModel.

Distinction From External Model Loading

This principle is distinct from Model Loading for Serving (e.g., loading from HuggingFace). The key differences are:

Aspect	Model Loading From Store	External Model Loading
Source	BentoML local store	External provider (HuggingFace, etc.)
Mechanism	`BentoModel` descriptor with tag resolution	Framework-specific loaders
Versioning	BentoML tag-based (`name:version`)	Provider-specific versioning
Offline Support	Fully offline from local store	Requires network for first download
Cloud Fallback	Auto-pulls from BentoCloud	Provider-dependent caching

Design Principles

Declarative Dependencies

Models are declared as class-level attributes, making it immediately clear which models a service depends on:

@bentoml.service
class MyService:
    model = bentoml.models.BentoModel("my_classifier:latest")

Transparent Resolution

The resolution process (local lookup, optional cloud pull) is transparent to the service code. The service simply accesses self.model and receives a resolved model with a .path to the artifact files.

Immutable References

Once resolved, the model reference is fixed for the lifetime of the service instance, ensuring consistent behavior across requests.

Relationship to Other Principles

Model Persistence: Models must be persisted before they can be loaded from the store.
Model Cloud Sync: BentoModel leverages push/pull to resolve models not found locally.
Model Versioning: The tag-based resolution uses the versioning system to find the correct model.

Knowledge Sources

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment