Heuristic:Apache Druid Sampler Limitations And Workarounds
| Knowledge Sources | |
|---|---|
| Domains | Data Ingestion, Web Console, Sampling |
| Last Updated | 2026-02-10 10:00 GMT |
Overview
The Druid web console sampler API has several undocumented limitations that require workarounds (type conversion, lookup substitution, two-pass sampling) to prevent silent data loss and misleading error messages during the ingestion wizard flow.
Description
The sampler endpoint (/druid/indexer/v1/sampler) is central to the Druid web console's ingestion wizard, but it operates under constraints that differ from actual ingestion. These constraints have accumulated as tribal knowledge in the codebase, marked with comments like "Hack Hack Hack" and self-described workaround functions. The key issues are:
- Type incompatibility: The sampler cannot process
index_parallelspec types. All parallel specs must be silently downgraded toindexbefore submission. - Null column loss: Without explicitly setting
keepNullColumns: trueon the input format, the sampler drops columns that contain only null values, which causes schema detection to miss columns. - Lookup unavailability: Lookups are not loaded in the Overlord process where the sampler runs, so any transform expression referencing
lookup()will fail. The console replaces these with placeholder concatenation expressions. - Raw data capture for streaming: For non-fixed-format sources like Kafka and Kinesis, a regex pattern
([\\s\\S]*)is used to capture entire rows as raw data, since the actual format is not yet known at the connect step. - Two-pass transform sampling: When transforms are present, the sampler cannot auto-detect dimensions that include transform outputs. A first "hack" pass detects base dimensions, then transform-derived dimensions are appended for a second pass.
Usage
Apply these heuristics when:
- Building or modifying the ingestion wizard flow in the web console
- Debugging why the sampler returns different results than actual ingestion
- Investigating missing columns, lookup errors, or type errors during the sampling phase
- Extending the sampler to support new input source or format types
The Insight (Rule of Thumb)
- Action: Always convert
index_paralleltoindexbefore sending specs to the sampler; always setkeepNullColumns: trueon input formats; replace lookup expressions with placeholder concatenations during sampling. - Value: Prevents cryptic sampler errors and silent data loss that would confuse users during the ingestion wizard. The user sees a representative preview of their data even though the sampler operates in a degraded environment.
- Trade-off: The sampled preview may not perfectly match production behavior (lookups show placeholder text, parallelism is hidden), but this is preferable to outright errors or missing columns.
Reasoning
The Overlord, which hosts the sampler endpoint, is a lightweight coordination process. It does not load lookups (those live on Brokers/Historicals), does not support index_parallel tasks internally, and has a simplified schema discovery path. These architectural facts force the web console to patch around the sampler's limitations client-side before submitting the sample request.
The keepNullColumns setting is critical because schema discovery in Druid normally drops null-only columns for storage efficiency; in the context of sampling, dropping those columns means the user never sees them and cannot configure them. The two-pass approach for transforms is required because the sampler's useSchemaDiscovery does not account for columns created by transforms -- those must be explicitly listed in the dimensionsSpec.
Code Evidence
Type conversion -- index_parallel to index (sampler.ts:227-241):
/**
This is a hack to deal with the fact that the sampler can not deal with the index_parallel type
*/
function fixSamplerTypes(sampleSpec: SampleSpec): SampleSpec {
let samplerType: string = getSpecType(sampleSpec);
if (samplerType === 'index_parallel') {
samplerType = 'index';
}
sampleSpec = deepSet(sampleSpec, 'type', samplerType);
sampleSpec = deepSet(sampleSpec, 'spec.type', samplerType);
sampleSpec = deepSet(sampleSpec, 'spec.ioConfig.type', samplerType);
sampleSpec = deepSet(sampleSpec, 'spec.tuningConfig.type', samplerType);
return sampleSpec;
}
keepNullColumns enforcement (sampler.ts:220-224):
// In order to prevent potential data loss null columns should be kept by the sampler and shown in the ingestion flow
if (ioConfig.inputFormat) {
ioConfig = deepSet(ioConfig, 'inputFormat.keepNullColumns', true);
}
Default sampler config (sampler.ts:54-57):
const BASE_SAMPLER_CONFIG: SamplerConfig = {
numRows: 500,
timeoutMs: 15000,
};
Lookup replacement with placeholder (sampler.ts:663-706):
/**
* Lookups do not work in the sampler because they are not loaded in the Overlord
* to prevent the user from getting an error like "Unknown lookup [lookup name]" we
* change the lookup expression to a placeholder
*
* lookup("x", 'lookup_name') => concat('lookup_name', '[', "x", '] -- This is a placeholder, lookups are not supported in sampling')
* lookup("x", 'lookup_name', 'replaceValue') => nvl(concat('lookup_name', '[', "x", '] -- This is a placeholder, lookups are not supported in sampling'), 'replaceValue')
*/
export function changeLookupInExpressionsSampling(druidExpression: string): string {
if (!druidExpression.includes('lookup')) return druidExpression;
const parsedDruidExpression = SqlExpression.maybeParse(druidExpression);
if (parsedDruidExpression) {
return String(
parsedDruidExpression.walk(ex => {
if (ex instanceof SqlFunction && ex.getEffectiveFunctionName() === 'LOOKUP') {
if (ex.numArgs() < 2 || ex.numArgs() > 3) return SqlExpression.parse('null');
const concat = F(
'concat',
ex.getArg(1),
'[',
ex.getArg(0),
'] -- This is a placeholder, lookups are not supported in sampling',
);
const replaceMissingValueWith = ex.getArg(2);
if (!replaceMissingValueWith) return concat;
return F('nvl', concat, replaceMissingValueWith);
}
return ex;
}),
);
}
// If we can not parse the expression as SQL then bash it with a regexp
return druidExpression.replace(/lookup\s*\(([^)]+)\)/g, (_, argString: string) => {
const args = argString.trim().split(/\s*,\s*/);
if (args.length < 2 || args.length > 3) return 'null';
const concat = `concat(${args[1]},'[',${args[0]},'] -- This is a placeholder, lookups are not supported in sampling')`;
const replaceMissingValueWith = args[2];
if (!replaceMissingValueWith) return concat;
return `nvl(${concat},${replaceMissingValueWith})`;
});
}
Whole-row regex capture for non-fixed-format sources (sampler.ts:243-248):
const WHOLE_ROW_INPUT_FORMAT: InputFormat = {
type: 'regex',
pattern: '([\\s\\S]*)', // Match the entire line, every single character
listDelimiter: '56616469-6de2-9da4-efb8-8f416e6e6965', // Just a UUID to disable the list delimiter
columns: ['raw'],
};
Two-pass transform sampling with "Hack Hack Hack" (sampler.ts:482-525):
// Extra step to simulate auto-detecting dimension with transforms
let specialDimensionSpec: DimensionsSpec = {
useSchemaDiscovery: true,
forceSegmentSortByTime,
};
if (transforms && transforms.length) {
const sampleSpecHack: SampleSpec = {
type: samplerType,
spec: {
ioConfig: deepGet(spec, 'spec.ioConfig'),
dataSchema: {
dataSource: 'sample',
timestampSpec,
dimensionsSpec: {
useSchemaDiscovery: true,
},
granularitySpec: {
rollup: false,
},
},
},
samplerConfig: BASE_SAMPLER_CONFIG,
};
const sampleResponseHack = await postToSampler(
applyCache(sampleSpecHack, cacheRows),
'transform-pre',
);
specialDimensionSpec = deepSet(
specialDimensionSpec,
'dimensions',
dedupe(
(
guessDimensionsFromSampleResponse(sampleResponseHack) as (DimensionSpec | string)[]
).concat(getDimensionNamesFromTransforms(transforms)),
getDimensionSpecName,
),
);
}
// ...
dimensionsSpec: specialDimensionSpec, // Hack Hack Hack