Principle:BerriAI Litellm Budget Rate Limiting

Knowledge Sources	Domains	Last Updated
litellm/router_strategy/budget_limiter.py	Cost Management, Rate Limiting	2026-02-15

Overview

Budget and rate limiting is the practice of enforcing spend caps and throughput constraints per provider, deployment, and request tag to control LLM API costs and prevent resource exhaustion.

Description

When routing requests across multiple LLM providers and deployments, unconstrained spending can lead to unexpected costs. Budget and rate limiting addresses this at three levels:

Provider-level budgets -- Set a maximum dollar spend per provider (e.g., $100/day for OpenAI, $200/week for Azure). When a provider's accumulated spend reaches its budget, all deployments from that provider are filtered out of the routing pool.
Deployment-level budgets -- Set spend limits on individual deployments. This is useful when different deployments have different cost profiles or when you want to cap spend on expensive fine-tuned models.
Tag-level budgets -- Set budgets scoped to request tags, allowing spend control by team, project, or use case.
Throughput limits (TPM/RPM) -- Each deployment can declare its tokens-per-minute and requests-per-minute capacity, which the routing strategy uses to distribute load proportionally.

The budget limiter operates as a pre-call filter: it runs before deployment selection and removes any deployment whose provider, deployment, or tag budget has been exceeded. This means it composes cleanly with any routing strategy (simple-shuffle, latency-based, cost-based, etc.).

Spend tracking is maintained in a dual cache (in-memory + Redis) with time-windowed keys. A background task periodically syncs in-memory spend increments to Redis, ensuring that distributed proxy instances share a consistent view of spend without adding Redis latency to every request.

Usage

Use budget and rate limiting when:

You need to enforce hard spending caps on LLM API providers to stay within organizational budgets.
Different deployments have different cost allocations and you want per-deployment spend control.
You run a multi-tenant proxy and need to limit spend per team or project via tags.
You want budget enforcement to work seamlessly alongside any routing strategy.

Theoretical Basis

Budget limiting implements a Token Bucket variant where the "tokens" are dollars of spend, the "bucket capacity" is the budget limit, and the "refill interval" is the budget duration (e.g., daily, weekly).

Pseudocode for budget-based deployment filtering:

ASYNC FUNCTION filter_deployments_by_budget(healthy_deployments, budget_config, request_kwargs):
    // Step 1: Collect all cache keys needed for budget checks
    cache_keys = []
    provider_configs = {}
    deployment_configs = {}

    FOR EACH deployment IN healthy_deployments:
        provider = get_provider(deployment)  // e.g., "openai", "azure"

        IF provider IN budget_config.provider_budgets:
            config = budget_config.provider_budgets[provider]
            key = "provider_spend:{provider}:{config.budget_duration}"
            cache_keys.append(key)
            provider_configs[provider] = config

        IF deployment.id IN budget_config.deployment_budgets:
            config = budget_config.deployment_budgets[deployment.id]
            key = "deployment_spend:{deployment.id}:{config.budget_duration}"
            cache_keys.append(key)
            deployment_configs[deployment.id] = config

    // Step 2: Batch-read all current spend values from cache
    current_spends = AWAIT cache.batch_get(cache_keys)
    spend_map = dict(zip(cache_keys, current_spends))

    // Step 3: Filter deployments
    eligible = []
    FOR EACH deployment IN healthy_deployments:
        within_budget = True

        // Check provider budget
        provider = get_provider(deployment)
        IF provider IN provider_configs:
            config = provider_configs[provider]
            spend = spend_map.get("provider_spend:{provider}:{config.duration}", 0)
            IF spend >= config.max_budget:
                within_budget = False

        // Check deployment budget
        IF deployment.id IN deployment_configs:
            config = deployment_configs[deployment.id]
            spend = spend_map.get("deployment_spend:{id}:{config.duration}", 0)
            IF spend >= config.max_budget:
                within_budget = False

        // Check tag budget
        FOR EACH tag IN request_kwargs.tags:
            IF tag IN budget_config.tag_budgets:
                config = budget_config.tag_budgets[tag]
                spend = spend_map.get("tag_spend:{tag}:{config.duration}", 0)
                IF spend >= config.max_budget:
                    within_budget = False

        IF within_budget:
            eligible.append(deployment)

    IF eligible IS EMPTY:
        RAISE "No deployments available - crossed budget"

    RETURN eligible

Pseudocode for spend tracking (on success callback):

ASYNC FUNCTION on_completion_success(response_cost, provider, deployment_id, tags):
    // Increment in-memory counters immediately
    increment_queue.append(("provider_spend:{provider}:{duration}", response_cost))
    increment_queue.append(("deployment_spend:{deployment_id}:{duration}", response_cost))
    FOR EACH tag IN tags:
        increment_queue.append(("tag_spend:{tag}:{duration}", response_cost))

    // Background task periodically flushes queue to Redis
    // This batches Redis writes for performance

The key architectural insight is the separation between spend tracking (on the success callback path, non-blocking) and budget enforcement (on the pre-call filter path, synchronous). This keeps the request hot path efficient while maintaining accurate spend accounting.

Related Pages

Implementation:BerriAI_Litellm_Router_Budget_Limiter

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment