Principle:BerriAI Litellm Budget Rate Limiting
| Knowledge Sources | Domains | Last Updated |
|---|---|---|
| litellm/router_strategy/budget_limiter.py | Cost Management, Rate Limiting | 2026-02-15 |
Overview
Budget and rate limiting is the practice of enforcing spend caps and throughput constraints per provider, deployment, and request tag to control LLM API costs and prevent resource exhaustion.
Description
When routing requests across multiple LLM providers and deployments, unconstrained spending can lead to unexpected costs. Budget and rate limiting addresses this at three levels:
- Provider-level budgets -- Set a maximum dollar spend per provider (e.g., $100/day for OpenAI, $200/week for Azure). When a provider's accumulated spend reaches its budget, all deployments from that provider are filtered out of the routing pool.
- Deployment-level budgets -- Set spend limits on individual deployments. This is useful when different deployments have different cost profiles or when you want to cap spend on expensive fine-tuned models.
- Tag-level budgets -- Set budgets scoped to request tags, allowing spend control by team, project, or use case.
- Throughput limits (TPM/RPM) -- Each deployment can declare its tokens-per-minute and requests-per-minute capacity, which the routing strategy uses to distribute load proportionally.
The budget limiter operates as a pre-call filter: it runs before deployment selection and removes any deployment whose provider, deployment, or tag budget has been exceeded. This means it composes cleanly with any routing strategy (simple-shuffle, latency-based, cost-based, etc.).
Spend tracking is maintained in a dual cache (in-memory + Redis) with time-windowed keys. A background task periodically syncs in-memory spend increments to Redis, ensuring that distributed proxy instances share a consistent view of spend without adding Redis latency to every request.
Usage
Use budget and rate limiting when:
- You need to enforce hard spending caps on LLM API providers to stay within organizational budgets.
- Different deployments have different cost allocations and you want per-deployment spend control.
- You run a multi-tenant proxy and need to limit spend per team or project via tags.
- You want budget enforcement to work seamlessly alongside any routing strategy.
Theoretical Basis
Budget limiting implements a Token Bucket variant where the "tokens" are dollars of spend, the "bucket capacity" is the budget limit, and the "refill interval" is the budget duration (e.g., daily, weekly).
Pseudocode for budget-based deployment filtering:
ASYNC FUNCTION filter_deployments_by_budget(healthy_deployments, budget_config, request_kwargs):
// Step 1: Collect all cache keys needed for budget checks
cache_keys = []
provider_configs = {}
deployment_configs = {}
FOR EACH deployment IN healthy_deployments:
provider = get_provider(deployment) // e.g., "openai", "azure"
IF provider IN budget_config.provider_budgets:
config = budget_config.provider_budgets[provider]
key = "provider_spend:{provider}:{config.budget_duration}"
cache_keys.append(key)
provider_configs[provider] = config
IF deployment.id IN budget_config.deployment_budgets:
config = budget_config.deployment_budgets[deployment.id]
key = "deployment_spend:{deployment.id}:{config.budget_duration}"
cache_keys.append(key)
deployment_configs[deployment.id] = config
// Step 2: Batch-read all current spend values from cache
current_spends = AWAIT cache.batch_get(cache_keys)
spend_map = dict(zip(cache_keys, current_spends))
// Step 3: Filter deployments
eligible = []
FOR EACH deployment IN healthy_deployments:
within_budget = True
// Check provider budget
provider = get_provider(deployment)
IF provider IN provider_configs:
config = provider_configs[provider]
spend = spend_map.get("provider_spend:{provider}:{config.duration}", 0)
IF spend >= config.max_budget:
within_budget = False
// Check deployment budget
IF deployment.id IN deployment_configs:
config = deployment_configs[deployment.id]
spend = spend_map.get("deployment_spend:{id}:{config.duration}", 0)
IF spend >= config.max_budget:
within_budget = False
// Check tag budget
FOR EACH tag IN request_kwargs.tags:
IF tag IN budget_config.tag_budgets:
config = budget_config.tag_budgets[tag]
spend = spend_map.get("tag_spend:{tag}:{config.duration}", 0)
IF spend >= config.max_budget:
within_budget = False
IF within_budget:
eligible.append(deployment)
IF eligible IS EMPTY:
RAISE "No deployments available - crossed budget"
RETURN eligible
Pseudocode for spend tracking (on success callback):
ASYNC FUNCTION on_completion_success(response_cost, provider, deployment_id, tags):
// Increment in-memory counters immediately
increment_queue.append(("provider_spend:{provider}:{duration}", response_cost))
increment_queue.append(("deployment_spend:{deployment_id}:{duration}", response_cost))
FOR EACH tag IN tags:
increment_queue.append(("tag_spend:{tag}:{duration}", response_cost))
// Background task periodically flushes queue to Redis
// This batches Redis writes for performance
The key architectural insight is the separation between spend tracking (on the success callback path, non-blocking) and budget enforcement (on the pre-call filter path, synchronous). This keeps the request hot path efficient while maintaining accurate spend accounting.