Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:BerriAI Litellm PagerDuty Alerting

From Leeroopedia
Attribute Value
Sources enterprise/litellm_enterprise/enterprise_callbacks/pagerduty/pagerduty.py
Domains Alerting, Monitoring, Enterprise Callbacks
Last Updated 2026-02-15 16:00 GMT

Overview

PagerDutyAlerting is an enterprise callback integration that sends critical alerts to PagerDuty when LLM API failure rates or hanging request counts exceed configurable thresholds within sliding time windows.

Description

The PagerDutyAlerting class extends SlackAlerting and provides two distinct alert types:

  • High LLM API Failure Rate -- Triggers a PagerDuty alert when the number of failed LLM API responses exceeds a configurable threshold within a time window (default: 60 failures in 60 seconds).
  • High Number of Hanging LLM Requests -- Triggers a PagerDuty alert when the number of requests that do not complete within a configurable timeout exceed a threshold (default: 60-second hang detection within a 600-second window).

The class maintains separate in-memory event lists for failures and hanging requests. Events are pruned based on time windows before threshold evaluation. When a threshold is crossed, a critical-severity alert is dispatched to the PagerDuty Events API v2 and the event list is cleared to avoid repeated alerts.

Requires the PAGERDUTY_API_KEY environment variable to be set.

Usage

Import and instantiate PagerDutyAlerting when you need to monitor LLM proxy health and receive PagerDuty incident notifications for API failures or hanging requests. It is registered as a custom callback in the LiteLLM proxy configuration.

Code Reference

Source Location

enterprise/litellm_enterprise/enterprise_callbacks/pagerduty/pagerduty.py

Signature

class PagerDutyAlerting(SlackAlerting):
    def __init__(
        self, alerting_args: Optional[Union[AlertingConfig, dict]] = None, **kwargs
    ): ...

    async def async_log_failure_event(self, kwargs, response_obj, start_time, end_time): ...
    async def async_pre_call_hook(
        self, user_api_key_dict: UserAPIKeyAuth, cache: DualCache, data: dict, call_type: CallTypesLiteral
    ) -> Optional[Union[Exception, str, dict]]: ...
    async def hanging_response_handler(self, request_data: Optional[dict], user_api_key_dict: UserAPIKeyAuth): ...
    async def send_alert_to_pagerduty(self, alert_message: str, custom_details: dict): ...

Import

from litellm_enterprise.enterprise_callbacks.pagerduty.pagerduty import PagerDutyAlerting

I/O Contract

Inputs

Parameter Type Description
alerting_args Optional[Union[AlertingConfig, dict]] Configuration for failure/hanging thresholds and time windows.
PAGERDUTY_API_KEY (env) str PagerDuty Events API v2 routing key (environment variable).

AlertingConfig fields:

Field Type Default Description
failure_threshold int 60 Number of failures to trigger alert.
failure_threshold_window_seconds int 60 Time window for counting failures.
hanging_threshold_seconds int 60 Seconds before a request is considered hanging.
hanging_threshold_window_seconds int 600 Time window for counting hanging requests.

Outputs

Output Type Description
PagerDuty API response httpx.Response HTTP response from https://events.pagerduty.com/v2/enqueue.

Usage Examples

# In LiteLLM proxy config YAML
litellm_settings:
  callbacks:
    - pagerduty
  pagerduty_alerting_args:
    failure_threshold: 100
    failure_threshold_window_seconds: 120
    hanging_threshold_seconds: 30
    hanging_threshold_window_seconds: 300
# Programmatic usage
from litellm_enterprise.enterprise_callbacks.pagerduty.pagerduty import PagerDutyAlerting

alerter = PagerDutyAlerting(
    alerting_args={
        "failure_threshold": 50,
        "failure_threshold_window_seconds": 60,
        "hanging_threshold_seconds": 45,
        "hanging_threshold_window_seconds": 300,
    }
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment