Heuristic:BerriAI Litellm Streaming Loop Detection
| Knowledge Sources | |
|---|---|
| Domains | LLM_Gateway, Debugging |
| Last Updated | 2026-02-15 16:00 GMT |
Overview
Streaming response loop detection using a 100-repeated-chunk threshold and 20ms cached chunk delay to catch infinite model loops while avoiding false positives.
Description
During streaming responses, some LLM providers can enter a pathological state where they repeatedly emit the same token or chunk indefinitely. LiteLLM detects this by tracking consecutive repeated chunks and flagging when the count exceeds a threshold. Additionally, when serving cached streaming responses, a small artificial delay is inserted between chunks to simulate natural streaming behavior and prevent overwhelming the client.
Usage
This heuristic is automatically applied during all streaming responses. The 100-chunk threshold is intentionally high to avoid false positives (some models legitimately repeat tokens, especially for formatted output). Adjust `REPEATED_STREAMING_CHUNK_LIMIT` if you see false positives or want earlier detection.
The Insight (Rule of Thumb)
- Loop Detection: `REPEATED_STREAMING_CHUNK_LIMIT=100` consecutive identical chunks before flagging.
- Cached Streaming: `CACHED_STREAMING_CHUNK_DELAY=0.02` (20ms between cached chunks).
- Audio Streaming: `AUDIO_SPEECH_CHUNK_SIZE=8192` (8KB chunks for text-to-speech).
- Trade-off: A lower chunk limit catches loops faster but risks flagging legitimate repetition. The 20ms cached delay adds slight latency but provides a smooth user experience.
Reasoning
The 100-chunk threshold was chosen based on production observation:
- Legitimate repetition can occur in formatted output (e.g., table rows, bullet lists, code blocks with repeated patterns) and may produce 10-50 identical or near-identical chunks.
- True infinite loops typically produce hundreds or thousands of identical chunks with no variation.
- 100 is the safe midpoint: it catches real loops within seconds (at ~50 chunks/second, detection happens in ~2 seconds) while almost never triggering on legitimate output.
The 20ms cached chunk delay simulates real-time streaming for cached responses, preventing clients from receiving the entire response in a single burst which can cause rendering issues in chat UIs.
Code Evidence
Streaming constants from `litellm/constants.py:287-341`:
REPEATED_STREAMING_CHUNK_LIMIT = 100
# catch if model starts looping the same chunk while streaming.
# Uses high default to prevent false positives.
CACHED_STREAMING_CHUNK_DELAY = 0.02 # 20ms
AUDIO_SPEECH_CHUNK_SIZE = 8192
# Balance between latency and memory usage