Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Apache Druid Explore Compare Query Strategy

From Leeroopedia




Knowledge Sources
Domains SQL Query Generation, Data Exploration, Web Console, Time Comparison
Last Updated 2026-02-10 10:00 GMT

Overview

The Druid explore view dynamically chooses between two SQL comparison strategies (JOIN-based for time splits, filtered aggregation for dimension splits) and implements resource-safe sub-query limits that divide the total budget evenly across comparison branches.

Description

The explore view's makeTableQueryAndHints() function in table-query.ts generates SQL queries that can include period-over-period comparisons (e.g., current week vs. previous week). The query strategy adapts based on the nature of the split columns and the cluster's resource limits.

Two comparison strategies:

  1. JOIN strategy (join): Used when split columns include a time dimension (e.g., __time). The current and previous time periods produce separate sub-queries that are FULL JOINed on the time-shifted grouping columns. This is necessary because time-shifting the __time column requires a fundamentally different WHERE clause for each period.
  2. Filtered strategy (filtered): Used for dimension-only splits (no time column). Instead of separate sub-queries, a single query uses filtered aggregations where each measure is computed with a different time-range filter. This is more efficient because it scans the data once.

The auto strategy (default) inspects the split columns: if any is a timestamp, it selects join; otherwise, it selects filtered.

Resource protection:

The DRUID_DEFAULT_TOTAL_SUB_QUERY_LIMIT is 100,000 rows. When multiple comparisons are active (e.g., comparing to 1 week ago and 1 month ago), the safe sub-query limit is calculated as:

safeSubQueryLimit = floor(totalSubQueryLimit / (compares.length + 1))

This ensures that the main query plus all comparison sub-queries together do not exceed the total sub-query budget. Without this, a single comparison could consume all resources and starve the others.

Multiple value handling:

When show-columns (non-aggregated columns displayed alongside aggregated measures) have multiple values per group, five modes control the displayed value:

  • null -- show NULL when multiple values exist
  • empty -- show empty string
  • latest -- show the latest value by __time (default passthrough)
  • latestNonNull -- show latest non-null value, using a filtered LATEST_BY
  • count -- show "Multiple values (N)" with the distinct count

Top values limiting:

DEFAULT_TOP_VALUES_K = 5000 limits the number of dimension values considered in comparisons, preventing unbounded cardinality from blowing up join sizes.

inFunctionThreshold override:

In the segments view, the inFunctionThreshold query context parameter is dynamically set to pageSize + 1 to prevent Druid from converting large IN clauses into JOINs on inline tables, which would change query semantics.

Usage

Apply these heuristics when:

  • Understanding why the explore view generates JOIN queries vs. single-table queries
  • Debugging slow comparison queries (check whether sub-query limits are being hit)
  • Adding new comparison types or extending the multiple value handling modes
  • Investigating why certain queries use inFunctionThreshold overrides
  • Tuning totalSubQueryLimit for clusters with different resource profiles

The Insight (Rule of Thumb)

  • Action: Use JOIN-based comparisons for time-bucketed splits and filtered aggregations for dimension splits; divide the total sub-query row limit evenly across all comparison branches (main + N comparisons); cap dimension cardinality at 5,000 values.
  • Value: Produces correct comparison results regardless of split type while respecting cluster resource limits. The strategy selection is automatic and transparent to users. The even division of sub-query limits prevents any single comparison from monopolizing resources.
  • Trade-off: JOIN comparisons are more expensive than filtered aggregations (multiple scans vs. one scan), but are necessary for time-shifted data. The 5,000 top-values limit may truncate high-cardinality dimensions. The even division of sub-query limits is a simple heuristic that does not account for asymmetric data distributions.

Reasoning

Time-based comparisons fundamentally require different time-range WHERE clauses for each period, making a JOIN between the current and shifted periods the only correct approach. Dimension-based comparisons, however, group by the same dimensions in all periods -- they differ only in the time filter applied to each aggregation -- so filtered aggregations within a single query are both correct and more efficient.

The sub-query limit division follows the principle of fair resource allocation: with a 100,000-row budget and 2 comparisons (main + 2 compares = 3 branches), each gets approximately 33,333 rows. This prevents a pathological case where one comparison with high cardinality consumes all available sub-query rows and causes others to return incomplete results.

The inFunctionThreshold override in the segments view is necessary because the view constructs IN clauses with up to pageSize interval values. If this exceeds the default threshold of 100, Druid's planner converts the IN to a JOIN on an inline table, which changes the query plan and can produce different performance characteristics.

Code Evidence

Default sub-query limit and top-values K (table-query.ts:52-56):

const DRUID_DEFAULT_TOTAL_SUB_QUERY_LIMIT = 100000;

const COMMON_NAME = 'common';
const TOP_VALUES_NAME = 'top_values';
export const DEFAULT_TOP_VALUES_K = 5000;

Auto strategy selection -- join for time, filtered for dimensions (table-query.ts:407-413):

    const effectiveCompareStrategy =
      compareStrategy === 'auto'
        ? splitColumns.some(isTimestamp)
          ? 'join'
          : 'filtered'
        : compareStrategy;

Safe sub-query limit calculation (table-query.ts:643):

  const safeSubQueryLimit = Math.floor(totalSubQueryLimit / (compares.length + 1));

Multiple value handling modes (table-query.ts:126-154):

  let elseEx: SqlExpression | undefined;
  switch (mode) {
    case 'null':
      elseEx = SqlLiteral.NULL;
      break;

    case 'empty':
      elseEx = SqlLiteral.create('');
      break;

    case 'latestNonNull':
      elseEx = SqlFunction.simple(
        'LATEST_BY',
        [showColumn.expression, C('__time')],
        showColumn.expression.isNotNull(),
      );
      break;

    case 'count':
      elseEx = SqlFunction.simple('CONCAT', [
        'Multiple values (',
        SqlFunction.countDistinct(showColumn.expression),
        ')',
      ]);
      break;

    default:
      // latest: nothing to do
      break;
  }

inFunctionThreshold override in segments view (segments-view.tsx:372-373):

            // This is needed because there might be an IN filter with {pageSize} intervals,
            // the number of which exceeds the default inFunctionThreshold, set it to something
            // greater than the {pageSize}
            sqlQueryContext.inFunctionThreshold = pageSize + 1;

MultipleValueMode type definition (table-query.ts:44):

export type MultipleValueMode = 'null' | 'empty' | 'latest' | 'latestNonNull' | 'count';

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment