Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Apache Druid Cluster Health Diagnostic Thresholds

From Leeroopedia




Knowledge Sources
Domains Cluster Health, Operations, Web Console, Diagnostics
Last Updated 2026-02-10 10:00 GMT

Overview

The Druid web console's Doctor dialog implements a suite of automated health checks with specific thresholds for property agreement, Java version compatibility, Historical capacity warnings, and segment compaction recommendations.

Description

The DOCTOR_CHECKS array in doctor-checks.tsx defines a sequential set of diagnostic checks that evaluate cluster health by querying runtime properties, service status endpoints, and system tables. Each check produces either "issues" (errors) or "suggestions" (warnings), and can optionally terminate further checks if a fundamental problem is detected.

The checks fall into four categories:

1. Self (Router) Checks:

  • Verify the Router responds to /status with a valid version string
  • Verify runtime properties: management proxy must be enabled, Java version must be supported, file.encoding should be UTF-8, user.timezone should be UTC

2. Coordinator/Overlord Agreement Checks:

  • MUST agree (issue if mismatch): user.timezone, druid.zk.service.host -- these properties must be identical across all nodes
  • SHOULD agree (suggestion if mismatch): druid.metadata.storage.type, druid.metadata.storage.connector.connectURI -- the Coordinator and Overlord should point to the same metadata store
  • Version mismatch between Router and Coordinator/Overlord triggers a suggestion (acceptable during rolling upgrades)

3. Sampler Verification:

  • Submits a minimal test payload to the sampler endpoint and verifies the response contains the expected data

4. SQL and Historical Checks:

  • Verifies native SQL works with SELECT 1 + 1 AS "two"
  • Queries sys.servers for Historical fill levels: warn at 90%, error at 95%
  • Known limitation: the fill percentage filter does not work server-side in the SQL query, so filtering must happen client-side

5. Compaction Recommendations:

  • Queries sys.segments for time chunks with multiple segments averaging under 100MB each
  • Excludes datasources that already have auto-compaction configured
  • Suggests enabling compaction for qualifying datasources

Usage

Apply these heuristics when:

  • Operating a Druid cluster and investigating health warnings in the console
  • Configuring new Druid nodes and ensuring property consistency
  • Setting up monitoring/alerting thresholds aligned with the console's own checks
  • Extending the Doctor dialog with additional health checks

The Insight (Rule of Thumb)

  • Action: All cluster nodes must agree on ZooKeeper and timezone properties; master nodes should agree on metadata storage; warn at 90% Historical fill and error at 95%; recommend compaction when segments average under 100MB per time chunk.
  • Value: Catches configuration drift (mismatched ZK hosts, different metadata stores) that causes subtle, hard-to-diagnose cluster behavior. The capacity thresholds give operators advance warning before Historicals run out of disk. Compaction recommendations improve query performance by consolidating small segments.
  • Trade-off: The checks only inspect properties exposed via HTTP status endpoints, so they cannot detect all misconfigurations (e.g., MiddleManager or Peon settings). The 100MB compaction threshold is a general heuristic that may not suit all workloads (some use cases intentionally produce small segments). The Historical fill query filter limitation means all Historicals are fetched and filtered client-side, which could be slow on large clusters.

Reasoning

Property agreement checks are essential because Druid is a distributed system where each node reads its own configuration. If the Coordinator and Overlord disagree on the metadata store URL, they will operate on different state, leading to phantom tasks or missing segments. ZooKeeper host disagreement means nodes cannot discover each other.

The 90%/95% Historical fill thresholds are chosen to provide a two-tier alert system: 90% gives time to plan capacity expansion, while 95% indicates an urgent situation where new segments may fail to load.

The 100MB segment threshold for compaction is based on the Druid best practice that segments should ideally be 300-700MB for optimal query performance. Segments significantly under 100MB suggest that the data was ingested with suboptimal partitioning or that time chunks have accumulated many small segments from streaming ingestion.

The comment about the server-side filter not working reveals a Druid SQL limitation: complex expressions in WHERE clauses on system tables may not be pushed down, requiring client-side post-filtering.

Code Evidence

Properties that all nodes MUST agree on (doctor-checks.tsx:40-43):

const RUNTIME_PROPERTIES_ALL_NODES_MUST_AGREE_ON: string[] = [
  'user.timezone',
  'druid.zk.service.host',
];

Properties that master nodes SHOULD agree on (doctor-checks.tsx:49-52):

const RUNTIME_PROPERTIES_MASTER_NODES_SHOULD_AGREE_ON: string[] = [
  'druid.metadata.storage.type', // overlord + coordinator
  'druid.metadata.storage.connector.connectURI',
];

Management proxy check (doctor-checks.tsx:93-97):

      // Check that the management proxy is on
      if (properties['druid.router.managementProxy.enabled'] !== 'true') {
        controls.addIssue(
          `The Router's "druid.router.managementProxy.enabled" is not reported as "true". This means that the Coordinator and Overlord will not be accessible from the Router (and this console).`,
        );
      }

Java version check (doctor-checks.tsx:99-109):

      // Check for Java 8u92+, 11, or 17
      if (
        properties['java.specification.version'] &&
        properties['java.specification.version'] !== '1.8' &&
        properties['java.specification.version'] !== '11' &&
        properties['java.specification.version'] !== '17'
      ) {
        controls.addSuggestion(
          `It looks like are running Java ${properties['java.runtime.version']}. Druid officially supports Java 8u92+, 11, or 17`,
        );
      }

Historical capacity thresholds -- 90% warn, 95% error (doctor-checks.tsx:333-347):

      for (const historicalFill of historicalFills) {
        if (historicalFill.fill > 95) {
          controls.addIssue(
            `Historical "${historicalFill.historical}" appears to be over 95% full (is ${formatFill(
              historicalFill,
            )}%). Increase capacity.`,
          );
        } else if (historicalFill.fill > 90) {
          controls.addSuggestion(
            `Historical "${historicalFill.historical}" appears to be over 90% full (is ${formatFill(
              historicalFill,
            )}%)`,
          );
        }
      }

Server-side filter limitation comment (doctor-checks.tsx:318):

        // Note: for some reason adding ` AND "curr_size" * 100.0 / "max_size" > 90` to the
        // filter does not work as of this writing Apr 8, 2024

Compaction recommendation -- 100MB threshold (doctor-checks.tsx:370):

  HAVING "num_segments" > 1 AND "total_size" > 1 AND "avg_segment_size_in_time_chunk" < 100000000

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment