Principle:Tensorflow Serving Servable Caching

Knowledge Sources	Tensorflow_Serving
Domains	Model Serving, Core Framework
Last Updated	2026-02-13 00:00 GMT

Overview

The Servable Caching principle defines a pull-based, on-demand loading strategy where servables are loaded upon first request and cached for subsequent access.

Description

While the standard TensorFlow Serving pipeline uses a push-based model (Sources proactively tell the Manager which versions to load), the Caching pattern provides a complementary pull-based model. Servables are loaded lazily on first access rather than eagerly at startup.

The CachingManager delegates actual servable management to a BasicManager and uses a pluggable LoaderFactory to create loaders on demand. The key flow is:

A request arrives for a servable.
The CachingManager checks if the BasicManager already has it loaded.
If yes, the handle is returned immediately.
If not, the LoaderFactory creates a loader, which is transferred to the BasicManager for management and loading.
The request blocks until loading completes, then the handle is returned.

A critical concurrent access concern is handled through a per-servable mutex map: when multiple requests arrive simultaneously for the same unloaded servable, only one thread performs the actual load while others block. The mutex entries are reference-counted and garbage-collected when no longer needed.

The PathPrefixLoaderFactory provides a simple concrete factory that maps servable names to file system paths by concatenating a prefix with the name.

Usage

Apply the caching pattern when the set of servables is large or unknown at startup, and only a subset will be actively used. It is also useful when servables should be loaded on a just-in-time basis rather than pre-loaded. Note that the first request for each servable incurs loading latency.

Theoretical Basis

The caching pattern implements a lazy-loading cache with per-key synchronization:

GetServable(request):
  handle = basic_manager.Get(request)
  if handle found: return handle

  version = factory.GetVersion(request.name, policy)
  loader_data = factory.CreateLoader({request.name, version})

  LoadServable(loader_data):
    mu = GetOrCreateMutex(servable_id)
    lock(mu)
    snapshot = basic_manager.GetSnapshot(servable_id)
    if snapshot exists and state == Ready:
      return OK  // already loaded by another thread
    basic_manager.Manage(loader_data)
    basic_manager.Load(servable_id)  // synchronous wait
    CleanupMutex(servable_id)

  return basic_manager.Get(request)

Key design properties:

Pull-based vs. push-based: Unlike AspiredVersionsManager (which reacts to Source notifications), CachingManager reacts to client requests.
Per-servable locking: The mutex map ensures one-loader-per-servable semantics without a global lock that would serialize all loads.
Reference-counted mutex cleanup: Mutex entries are removed when the last reference is released, preventing unbounded map growth.
Delegation to BasicManager: Reuses BasicManager for the actual load/unload/resource-tracking machinery, following the composition-over-inheritance principle.
Error propagation: LoaderFactory errors are embedded in ServableData and propagated through the BasicManager's standard error handling, enabling EventBus monitoring.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment