Principle:Tensorflow Serving Model Warmup
| Knowledge Sources | |
|---|---|
| Domains | Performance, Deployment |
| Last Updated | 2026-02-13 17:00 GMT |
Overview
A pre-serving optimization that executes sample inference requests during model loading to trigger lazy initializations (JIT compilation, memory allocation, XLA optimizations) before real traffic arrives.
Description
Model warmup addresses the "cold start" problem: the first inference requests to a freshly loaded model often have significantly higher latency due to lazy initializations in the TensorFlow runtime. These include:
- TF graph optimization: First-run graph optimizations and kernel selection
- XLA compilation: JIT compilation of computation kernels for the specific input shapes
- Memory allocation: Pre-allocation of GPU memory and scratch buffers
- Batching warmup: Pre-warming at all allowed batch sizes to compile kernels for each size
Warmup requests are stored in a TFRecord file at assets.extra/tf_serving_warmup_requests within the SavedModel directory. The file contains serialized PredictionLog protos with sample requests.
Usage
Enable warmup (default: on via --enable_model_warmup=true) and include a warmup file in your SavedModel export. This is critical for production deployments where first-request latency matters. Maximum 1000 warmup records are supported.
Theoretical Basis
# Abstract warmup process (NOT real implementation)
def warmup_model(saved_model_path, bundle):
warmup_file = f"{saved_model_path}/assets.extra/tf_serving_warmup_requests"
if not exists(warmup_file):
return # OK — warmup is optional
for record in read_tfrecord(warmup_file, max_records=1000):
prediction_log = parse_prediction_log(record)
if prediction_log.type == PREDICT:
run_predict(bundle.session, prediction_log.predict_request)
elif prediction_log.type == CLASSIFY:
run_classify(bundle.session, prediction_log.classify_request)
# ... etc for REGRESS, MULTI_INFERENCE