Principle:Tensorflow Serving Servable Caching
| Knowledge Sources | |
|---|---|
| Domains | Model Serving, Core Framework |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
The Servable Caching principle defines a pull-based, on-demand loading strategy where servables are loaded upon first request and cached for subsequent access.
Description
While the standard TensorFlow Serving pipeline uses a push-based model (Sources proactively tell the Manager which versions to load), the Caching pattern provides a complementary pull-based model. Servables are loaded lazily on first access rather than eagerly at startup.
The CachingManager delegates actual servable management to a BasicManager and uses a pluggable LoaderFactory to create loaders on demand. The key flow is:
- A request arrives for a servable.
- The CachingManager checks if the BasicManager already has it loaded.
- If yes, the handle is returned immediately.
- If not, the LoaderFactory creates a loader, which is transferred to the BasicManager for management and loading.
- The request blocks until loading completes, then the handle is returned.
A critical concurrent access concern is handled through a per-servable mutex map: when multiple requests arrive simultaneously for the same unloaded servable, only one thread performs the actual load while others block. The mutex entries are reference-counted and garbage-collected when no longer needed.
The PathPrefixLoaderFactory provides a simple concrete factory that maps servable names to file system paths by concatenating a prefix with the name.
Usage
Apply the caching pattern when the set of servables is large or unknown at startup, and only a subset will be actively used. It is also useful when servables should be loaded on a just-in-time basis rather than pre-loaded. Note that the first request for each servable incurs loading latency.
Theoretical Basis
The caching pattern implements a lazy-loading cache with per-key synchronization:
GetServable(request):
handle = basic_manager.Get(request)
if handle found: return handle
version = factory.GetVersion(request.name, policy)
loader_data = factory.CreateLoader({request.name, version})
LoadServable(loader_data):
mu = GetOrCreateMutex(servable_id)
lock(mu)
snapshot = basic_manager.GetSnapshot(servable_id)
if snapshot exists and state == Ready:
return OK // already loaded by another thread
basic_manager.Manage(loader_data)
basic_manager.Load(servable_id) // synchronous wait
CleanupMutex(servable_id)
return basic_manager.Get(request)
Key design properties:
- Pull-based vs. push-based: Unlike AspiredVersionsManager (which reacts to Source notifications), CachingManager reacts to client requests.
- Per-servable locking: The mutex map ensures one-loader-per-servable semantics without a global lock that would serialize all loads.
- Reference-counted mutex cleanup: Mutex entries are removed when the last reference is released, preventing unbounded map growth.
- Delegation to BasicManager: Reuses BasicManager for the actual load/unload/resource-tracking machinery, following the composition-over-inheritance principle.
- Error propagation: LoaderFactory errors are embedded in ServableData and propagated through the BasicManager's standard error handling, enabling EventBus monitoring.