Heuristic:Mlflow Mlflow Prompt Cache Tuning

Knowledge Sources	MLflow Environment Variables
Domains	Optimization, Prompt_Management
Last Updated	2026-02-13 20:00 GMT

Overview

Performance tuning for the MLflow prompt version cache using LRU eviction, configurable TTL, and alias-vs-version cache behavior.

Description

MLflow caches loaded prompt versions in an LRU cache to avoid repeated network requests to the tracking server. The cache behavior differs for alias-based lookups (which may change over time and need shorter TTLs) versus version-based lookups (which are immutable and can be cached indefinitely). Tuning these cache parameters is important for production applications that load prompts at high frequency.

Usage

Use this heuristic when you are loading prompts at high frequency in production applications and need to balance freshness with performance. This is particularly relevant for serving endpoints that use `mlflow.genai.load_prompt()` to fetch prompt templates on every request.

The Insight (Rule of Thumb)

Action: Configure prompt cache size and TTL via environment variables based on your access pattern.
Value:
- `MLFLOW_PROMPT_CACHE_MAX_SIZE` = 128 (default, max prompt versions in LRU cache)
- `MLFLOW_ALIAS_PROMPT_CACHE_TTL_SECONDS` = 60 (default, TTL for alias-based lookups)
- `MLFLOW_VERSION_PROMPT_CACHE_TTL_SECONDS` = infinity (default, version-based lookups never expire)
Trade-off: Shorter alias TTL means more frequent fetches but fresher prompts. Larger cache size uses more memory but reduces network calls. Version-based lookups are safe to cache indefinitely since versions are immutable.

Configuration:

# High-frequency serving: increase cache, short alias TTL
export MLFLOW_PROMPT_CACHE_MAX_SIZE=256
export MLFLOW_ALIAS_PROMPT_CACHE_TTL_SECONDS=30

# Development: disable caching for immediate prompt updates
export MLFLOW_ALIAS_PROMPT_CACHE_TTL_SECONDS=0
export MLFLOW_VERSION_PROMPT_CACHE_TTL_SECONDS=0

Reasoning

The prompt caching system uses two different strategies based on the lookup type:

Alias-based lookups (e.g., `load_prompt("my-prompt@production")`) reference a mutable alias that can be pointed to different versions. The default 60-second TTL balances freshness with performance.
Version-based lookups (e.g., `load_prompt("my-prompt/1")`) reference immutable versions. The default infinite TTL is safe because a specific version number always returns the same content.

Code evidence from `mlflow/environment_variables.py`:

MLFLOW_PROMPT_CACHE_MAX_SIZE = _EnvironmentVariable(
    "MLFLOW_PROMPT_CACHE_MAX_SIZE", int, 128
)
MLFLOW_ALIAS_PROMPT_CACHE_TTL_SECONDS = _EnvironmentVariable(
    "MLFLOW_ALIAS_PROMPT_CACHE_TTL_SECONDS", float, 60
)
MLFLOW_VERSION_PROMPT_CACHE_TTL_SECONDS = _EnvironmentVariable(
    "MLFLOW_VERSION_PROMPT_CACHE_TTL_SECONDS", float, float("inf")
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment