Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:BerriAI Litellm Streaming Loop Detection

From Leeroopedia
Knowledge Sources
Domains LLM_Gateway, Debugging
Last Updated 2026-02-15 16:00 GMT

Overview

Streaming response loop detection using a 100-repeated-chunk threshold and 20ms cached chunk delay to catch infinite model loops while avoiding false positives.

Description

During streaming responses, some LLM providers can enter a pathological state where they repeatedly emit the same token or chunk indefinitely. LiteLLM detects this by tracking consecutive repeated chunks and flagging when the count exceeds a threshold. Additionally, when serving cached streaming responses, a small artificial delay is inserted between chunks to simulate natural streaming behavior and prevent overwhelming the client.

Usage

This heuristic is automatically applied during all streaming responses. The 100-chunk threshold is intentionally high to avoid false positives (some models legitimately repeat tokens, especially for formatted output). Adjust `REPEATED_STREAMING_CHUNK_LIMIT` if you see false positives or want earlier detection.

The Insight (Rule of Thumb)

  • Loop Detection: `REPEATED_STREAMING_CHUNK_LIMIT=100` consecutive identical chunks before flagging.
  • Cached Streaming: `CACHED_STREAMING_CHUNK_DELAY=0.02` (20ms between cached chunks).
  • Audio Streaming: `AUDIO_SPEECH_CHUNK_SIZE=8192` (8KB chunks for text-to-speech).
  • Trade-off: A lower chunk limit catches loops faster but risks flagging legitimate repetition. The 20ms cached delay adds slight latency but provides a smooth user experience.

Reasoning

The 100-chunk threshold was chosen based on production observation:

  1. Legitimate repetition can occur in formatted output (e.g., table rows, bullet lists, code blocks with repeated patterns) and may produce 10-50 identical or near-identical chunks.
  2. True infinite loops typically produce hundreds or thousands of identical chunks with no variation.
  3. 100 is the safe midpoint: it catches real loops within seconds (at ~50 chunks/second, detection happens in ~2 seconds) while almost never triggering on legitimate output.

The 20ms cached chunk delay simulates real-time streaming for cached responses, preventing clients from receiving the entire response in a single burst which can cause rendering issues in chat UIs.

Code Evidence

Streaming constants from `litellm/constants.py:287-341`:

REPEATED_STREAMING_CHUNK_LIMIT = 100
# catch if model starts looping the same chunk while streaming.
# Uses high default to prevent false positives.

CACHED_STREAMING_CHUNK_DELAY = 0.02  # 20ms

AUDIO_SPEECH_CHUNK_SIZE = 8192
# Balance between latency and memory usage

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment