Heuristic:Apache Druid Explore Compare Query Strategy
| Knowledge Sources | |
|---|---|
| Domains | SQL Query Generation, Data Exploration, Web Console, Time Comparison |
| Last Updated | 2026-02-10 10:00 GMT |
Overview
The Druid explore view dynamically chooses between two SQL comparison strategies (JOIN-based for time splits, filtered aggregation for dimension splits) and implements resource-safe sub-query limits that divide the total budget evenly across comparison branches.
Description
The explore view's makeTableQueryAndHints() function in table-query.ts generates SQL queries that can include period-over-period comparisons (e.g., current week vs. previous week). The query strategy adapts based on the nature of the split columns and the cluster's resource limits.
Two comparison strategies:
- JOIN strategy (
join): Used when split columns include a time dimension (e.g.,__time). The current and previous time periods produce separate sub-queries that are FULL JOINed on the time-shifted grouping columns. This is necessary because time-shifting the__timecolumn requires a fundamentally different WHERE clause for each period. - Filtered strategy (
filtered): Used for dimension-only splits (no time column). Instead of separate sub-queries, a single query uses filtered aggregations where each measure is computed with a different time-range filter. This is more efficient because it scans the data once.
The auto strategy (default) inspects the split columns: if any is a timestamp, it selects join; otherwise, it selects filtered.
Resource protection:
The DRUID_DEFAULT_TOTAL_SUB_QUERY_LIMIT is 100,000 rows. When multiple comparisons are active (e.g., comparing to 1 week ago and 1 month ago), the safe sub-query limit is calculated as:
safeSubQueryLimit = floor(totalSubQueryLimit / (compares.length + 1))
This ensures that the main query plus all comparison sub-queries together do not exceed the total sub-query budget. Without this, a single comparison could consume all resources and starve the others.
Multiple value handling:
When show-columns (non-aggregated columns displayed alongside aggregated measures) have multiple values per group, five modes control the displayed value:
null-- show NULL when multiple values existempty-- show empty stringlatest-- show the latest value by__time(default passthrough)latestNonNull-- show latest non-null value, using a filteredLATEST_BYcount-- show "Multiple values (N)" with the distinct count
Top values limiting:
DEFAULT_TOP_VALUES_K = 5000 limits the number of dimension values considered in comparisons, preventing unbounded cardinality from blowing up join sizes.
inFunctionThreshold override:
In the segments view, the inFunctionThreshold query context parameter is dynamically set to pageSize + 1 to prevent Druid from converting large IN clauses into JOINs on inline tables, which would change query semantics.
Usage
Apply these heuristics when:
- Understanding why the explore view generates JOIN queries vs. single-table queries
- Debugging slow comparison queries (check whether sub-query limits are being hit)
- Adding new comparison types or extending the multiple value handling modes
- Investigating why certain queries use
inFunctionThresholdoverrides - Tuning
totalSubQueryLimitfor clusters with different resource profiles
The Insight (Rule of Thumb)
- Action: Use JOIN-based comparisons for time-bucketed splits and filtered aggregations for dimension splits; divide the total sub-query row limit evenly across all comparison branches (main + N comparisons); cap dimension cardinality at 5,000 values.
- Value: Produces correct comparison results regardless of split type while respecting cluster resource limits. The strategy selection is automatic and transparent to users. The even division of sub-query limits prevents any single comparison from monopolizing resources.
- Trade-off: JOIN comparisons are more expensive than filtered aggregations (multiple scans vs. one scan), but are necessary for time-shifted data. The 5,000 top-values limit may truncate high-cardinality dimensions. The even division of sub-query limits is a simple heuristic that does not account for asymmetric data distributions.
Reasoning
Time-based comparisons fundamentally require different time-range WHERE clauses for each period, making a JOIN between the current and shifted periods the only correct approach. Dimension-based comparisons, however, group by the same dimensions in all periods -- they differ only in the time filter applied to each aggregation -- so filtered aggregations within a single query are both correct and more efficient.
The sub-query limit division follows the principle of fair resource allocation: with a 100,000-row budget and 2 comparisons (main + 2 compares = 3 branches), each gets approximately 33,333 rows. This prevents a pathological case where one comparison with high cardinality consumes all available sub-query rows and causes others to return incomplete results.
The inFunctionThreshold override in the segments view is necessary because the view constructs IN clauses with up to pageSize interval values. If this exceeds the default threshold of 100, Druid's planner converts the IN to a JOIN on an inline table, which changes the query plan and can produce different performance characteristics.
Code Evidence
Default sub-query limit and top-values K (table-query.ts:52-56):
const DRUID_DEFAULT_TOTAL_SUB_QUERY_LIMIT = 100000;
const COMMON_NAME = 'common';
const TOP_VALUES_NAME = 'top_values';
export const DEFAULT_TOP_VALUES_K = 5000;
Auto strategy selection -- join for time, filtered for dimensions (table-query.ts:407-413):
const effectiveCompareStrategy =
compareStrategy === 'auto'
? splitColumns.some(isTimestamp)
? 'join'
: 'filtered'
: compareStrategy;
Safe sub-query limit calculation (table-query.ts:643):
const safeSubQueryLimit = Math.floor(totalSubQueryLimit / (compares.length + 1));
Multiple value handling modes (table-query.ts:126-154):
let elseEx: SqlExpression | undefined;
switch (mode) {
case 'null':
elseEx = SqlLiteral.NULL;
break;
case 'empty':
elseEx = SqlLiteral.create('');
break;
case 'latestNonNull':
elseEx = SqlFunction.simple(
'LATEST_BY',
[showColumn.expression, C('__time')],
showColumn.expression.isNotNull(),
);
break;
case 'count':
elseEx = SqlFunction.simple('CONCAT', [
'Multiple values (',
SqlFunction.countDistinct(showColumn.expression),
')',
]);
break;
default:
// latest: nothing to do
break;
}
inFunctionThreshold override in segments view (segments-view.tsx:372-373):
// This is needed because there might be an IN filter with {pageSize} intervals,
// the number of which exceeds the default inFunctionThreshold, set it to something
// greater than the {pageSize}
sqlQueryContext.inFunctionThreshold = pageSize + 1;
MultipleValueMode type definition (table-query.ts:44):
export type MultipleValueMode = 'null' | 'empty' | 'latest' | 'latestNonNull' | 'count';