Heuristic:BerriAI Litellm Streaming Loop Detection

Knowledge Sources	BerriAI/litellm Streaming debugging
Domains	LLM_Gateway, Debugging
Last Updated	2026-02-15 16:00 GMT

Overview

Streaming response loop detection using a 100-repeated-chunk threshold and 20ms cached chunk delay to catch infinite model loops while avoiding false positives.

Description

During streaming responses, some LLM providers can enter a pathological state where they repeatedly emit the same token or chunk indefinitely. LiteLLM detects this by tracking consecutive repeated chunks and flagging when the count exceeds a threshold. Additionally, when serving cached streaming responses, a small artificial delay is inserted between chunks to simulate natural streaming behavior and prevent overwhelming the client.

Usage

This heuristic is automatically applied during all streaming responses. The 100-chunk threshold is intentionally high to avoid false positives (some models legitimately repeat tokens, especially for formatted output). Adjust `REPEATED_STREAMING_CHUNK_LIMIT` if you see false positives or want earlier detection.

The Insight (Rule of Thumb)

Loop Detection: `REPEATED_STREAMING_CHUNK_LIMIT=100` consecutive identical chunks before flagging.
Cached Streaming: `CACHED_STREAMING_CHUNK_DELAY=0.02` (20ms between cached chunks).
Audio Streaming: `AUDIO_SPEECH_CHUNK_SIZE=8192` (8KB chunks for text-to-speech).
Trade-off: A lower chunk limit catches loops faster but risks flagging legitimate repetition. The 20ms cached delay adds slight latency but provides a smooth user experience.

Reasoning

The 100-chunk threshold was chosen based on production observation:

Legitimate repetition can occur in formatted output (e.g., table rows, bullet lists, code blocks with repeated patterns) and may produce 10-50 identical or near-identical chunks.
True infinite loops typically produce hundreds or thousands of identical chunks with no variation.
100 is the safe midpoint: it catches real loops within seconds (at ~50 chunks/second, detection happens in ~2 seconds) while almost never triggering on legitimate output.

The 20ms cached chunk delay simulates real-time streaming for cached responses, preventing clients from receiving the entire response in a single burst which can cause rendering issues in chat UIs.

Code Evidence

Streaming constants from `litellm/constants.py:287-341`:

REPEATED_STREAMING_CHUNK_LIMIT = 100
# catch if model starts looping the same chunk while streaming.
# Uses high default to prevent false positives.

CACHED_STREAMING_CHUNK_DELAY = 0.02  # 20ms

AUDIO_SPEECH_CHUNK_SIZE = 8192
# Balance between latency and memory usage

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment