Principle:Openclaw Openclaw Health Monitoring
Health Monitoring
Health Monitoring is the principle of continuously assessing the operational state of the OpenClaw gateway and its connected messaging channels. In a multi-channel AI agent gateway, the system must be able to answer a fundamental question at any point in time: "Is the gateway running, and can it communicate with its configured channels?"
Motivation
OpenClaw acts as a central hub routing messages between users on various platforms (Telegram, Discord, Slack, Signal, WhatsApp, and others) and AI agent backends. A failure in any single channel -- a revoked token, an unreachable API, a misconfigured webhook -- can silently degrade the user experience. Health monitoring provides the feedback loop that makes such failures visible before they compound.
Core Concepts
Gateway Reachability
The most basic health signal is whether the gateway process itself is running and accepting RPC calls. If the gateway cannot be reached, no channel communication is possible. Health monitoring treats gateway reachability as the binary "ok" signal: if the health payload was returned at all, the gateway is alive.
Channel Probing
Beyond gateway reachability, each configured channel can be individually probed. A probe is a lightweight connectivity test that verifies the channel's authentication credentials are valid and its external API is responsive. Probes return structured results including latency measurements, bot identity information, and webhook configuration status.
Each channel plugin exposes a probeAccount method that the health system invokes with a configurable timeout. Probes run per-account, since a single channel type (e.g., Telegram) can have multiple bot accounts configured for different agents.
Health Snapshot
A health snapshot is a point-in-time aggregate of all health signals. It captures:
- Timestamp and duration -- when the snapshot was taken and how long it took to collect.
- Channel summaries -- per-channel, per-account status including configuration state, link status, probe results, and authentication age.
- Agent summaries -- per-agent heartbeat intervals and session store information.
- Session statistics -- count and recency of active conversation sessions.
The snapshot is the canonical data structure for all health-related queries, consumed by both CLI output and the web control UI.
Heartbeat Intervals
Agents are configured with periodic heartbeat intervals that drive background processing (e.g., checking for new emails, running scheduled tasks). The health summary exposes these intervals so operators can verify agents are configured to wake at the expected cadence.
Design Principles
- Non-destructive observation -- Health checks must never modify state. They read configuration, probe external services, and aggregate results without side effects.
- Graceful degradation -- A failed channel probe does not make the overall health check fail. Gateway reachability is the success criterion; channel issues are reported as warnings within the snapshot.
- Timeout-bounded -- All external probes are capped by a configurable timeout (default 10 seconds) to prevent a slow or unresponsive channel from blocking the entire health check.
- Multi-account awareness -- The health system understands that channels can have multiple accounts bound to different agents, and probes each account independently.
Relationship to Other Concepts
Health Monitoring feeds into Diagnostic Repair (the doctor command invokes health checks as one of its diagnostic steps) and Version Update (the update flow runs doctor and health checks after applying updates to verify the system is still operational).
See Also
Implementation:Openclaw_Openclaw_HealthCommand_And_ProbeGateway