Principle:Bentoml BentoML Model Loading From Store
| Principle Metadata | |
|---|---|
| Principle Name | Model Loading From Store |
| Workflow | Model_Store_Management |
| Domain | ML_Serving, Model_Management |
| Related Principle | Principle:Bentoml_BentoML_Model_Persistence |
| Implemented By | Implementation:Bentoml_BentoML_BentoModel_Descriptor |
| Last Updated | 2026-02-13 15:00 GMT |
Overview
Model Loading From Store is the principle of resolving and loading saved model artifacts from the BentoML local model store into running services. It uses a descriptor-based approach that enables declarative model dependencies with lazy resolution, automatic cloud pull, and seamless integration with the service lifecycle.
Core Concept
Loading models from the BentoML store into services requires a mechanism that is both declarative (models are specified at class definition time) and lazy (resolution happens at runtime when the model is actually needed). The BentoModel descriptor pattern achieves this by acting as a proxy that resolves the actual model artifact on first access.
Theory
BentoModel provides a descriptor-based approach to referencing models by tag. When accessed on a service instance, it lazily resolves the model from the local store or automatically pulls it from BentoCloud. This provides a declarative way to specify model dependencies.
The key aspects of this approach are:
- Descriptor Pattern:
BentoModeluses Python's descriptor protocol. When declared as a class attribute on a service, it intercepts attribute access to trigger model resolution. This means the model tag is declared once, and actual loading is deferred until the service needs it.
- Lazy Resolution: The model is not loaded when the service class is defined or even when it is instantiated. Resolution occurs when the model attribute is first accessed, allowing the system to defer expensive I/O until it is truly needed.
- Automatic Cloud Pull: If the model is not found in the local store,
BentoModelcan automatically pull it from BentoCloud. This eliminates the need for manualbentoml.models.pull()calls in deployment scripts and simplifies CI/CD pipelines.
- Store-Centric Resolution: Unlike loading models directly from external sources (e.g., HuggingFace Hub), this principle centers on the BentoML model store as the canonical source. Models must first be saved to the store (via
bentoml.models.create()) before they can be loaded viaBentoModel.
Distinction From External Model Loading
This principle is distinct from Model Loading for Serving (e.g., loading from HuggingFace). The key differences are:
| Aspect | Model Loading From Store | External Model Loading |
|---|---|---|
| Source | BentoML local store | External provider (HuggingFace, etc.) |
| Mechanism | BentoModel descriptor with tag resolution |
Framework-specific loaders |
| Versioning | BentoML tag-based (name:version) |
Provider-specific versioning |
| Offline Support | Fully offline from local store | Requires network for first download |
| Cloud Fallback | Auto-pulls from BentoCloud | Provider-dependent caching |
Design Principles
Declarative Dependencies
Models are declared as class-level attributes, making it immediately clear which models a service depends on:
@bentoml.service
class MyService:
model = bentoml.models.BentoModel("my_classifier:latest")
Transparent Resolution
The resolution process (local lookup, optional cloud pull) is transparent to the service code. The service simply accesses self.model and receives a resolved model with a .path to the artifact files.
Immutable References
Once resolved, the model reference is fixed for the lifetime of the service instance, ensuring consistent behavior across requests.
Relationship to Other Principles
- Model Persistence: Models must be persisted before they can be loaded from the store.
- Model Cloud Sync: BentoModel leverages push/pull to resolve models not found locally.
- Model Versioning: The tag-based resolution uses the versioning system to find the correct model.