Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Apache Druid Sampler Limitations And Workarounds

From Leeroopedia



Knowledge Sources
Domains Data Ingestion, Web Console, Sampling
Last Updated 2026-02-10 10:00 GMT

Overview

The Druid web console sampler API has several undocumented limitations that require workarounds (type conversion, lookup substitution, two-pass sampling) to prevent silent data loss and misleading error messages during the ingestion wizard flow.

Description

The sampler endpoint (/druid/indexer/v1/sampler) is central to the Druid web console's ingestion wizard, but it operates under constraints that differ from actual ingestion. These constraints have accumulated as tribal knowledge in the codebase, marked with comments like "Hack Hack Hack" and self-described workaround functions. The key issues are:

  1. Type incompatibility: The sampler cannot process index_parallel spec types. All parallel specs must be silently downgraded to index before submission.
  2. Null column loss: Without explicitly setting keepNullColumns: true on the input format, the sampler drops columns that contain only null values, which causes schema detection to miss columns.
  3. Lookup unavailability: Lookups are not loaded in the Overlord process where the sampler runs, so any transform expression referencing lookup() will fail. The console replaces these with placeholder concatenation expressions.
  4. Raw data capture for streaming: For non-fixed-format sources like Kafka and Kinesis, a regex pattern ([\\s\\S]*) is used to capture entire rows as raw data, since the actual format is not yet known at the connect step.
  5. Two-pass transform sampling: When transforms are present, the sampler cannot auto-detect dimensions that include transform outputs. A first "hack" pass detects base dimensions, then transform-derived dimensions are appended for a second pass.

Usage

Apply these heuristics when:

  • Building or modifying the ingestion wizard flow in the web console
  • Debugging why the sampler returns different results than actual ingestion
  • Investigating missing columns, lookup errors, or type errors during the sampling phase
  • Extending the sampler to support new input source or format types

The Insight (Rule of Thumb)

  • Action: Always convert index_parallel to index before sending specs to the sampler; always set keepNullColumns: true on input formats; replace lookup expressions with placeholder concatenations during sampling.
  • Value: Prevents cryptic sampler errors and silent data loss that would confuse users during the ingestion wizard. The user sees a representative preview of their data even though the sampler operates in a degraded environment.
  • Trade-off: The sampled preview may not perfectly match production behavior (lookups show placeholder text, parallelism is hidden), but this is preferable to outright errors or missing columns.

Reasoning

The Overlord, which hosts the sampler endpoint, is a lightweight coordination process. It does not load lookups (those live on Brokers/Historicals), does not support index_parallel tasks internally, and has a simplified schema discovery path. These architectural facts force the web console to patch around the sampler's limitations client-side before submitting the sample request.

The keepNullColumns setting is critical because schema discovery in Druid normally drops null-only columns for storage efficiency; in the context of sampling, dropping those columns means the user never sees them and cannot configure them. The two-pass approach for transforms is required because the sampler's useSchemaDiscovery does not account for columns created by transforms -- those must be explicitly listed in the dimensionsSpec.

Code Evidence

Type conversion -- index_parallel to index (sampler.ts:227-241):

/**
  This is a hack to deal with the fact that the sampler can not deal with the index_parallel type
 */
function fixSamplerTypes(sampleSpec: SampleSpec): SampleSpec {
  let samplerType: string = getSpecType(sampleSpec);
  if (samplerType === 'index_parallel') {
    samplerType = 'index';
  }

  sampleSpec = deepSet(sampleSpec, 'type', samplerType);
  sampleSpec = deepSet(sampleSpec, 'spec.type', samplerType);
  sampleSpec = deepSet(sampleSpec, 'spec.ioConfig.type', samplerType);
  sampleSpec = deepSet(sampleSpec, 'spec.tuningConfig.type', samplerType);
  return sampleSpec;
}

keepNullColumns enforcement (sampler.ts:220-224):

  // In order to prevent potential data loss null columns should be kept by the sampler and shown in the ingestion flow
  if (ioConfig.inputFormat) {
    ioConfig = deepSet(ioConfig, 'inputFormat.keepNullColumns', true);
  }

Default sampler config (sampler.ts:54-57):

const BASE_SAMPLER_CONFIG: SamplerConfig = {
  numRows: 500,
  timeoutMs: 15000,
};

Lookup replacement with placeholder (sampler.ts:663-706):

/**
 * Lookups do not work in the sampler because they are not loaded in the Overlord
 * to prevent the user from getting an error like "Unknown lookup [lookup name]" we
 * change the lookup expression to a placeholder
 *
 * lookup("x", 'lookup_name') => concat('lookup_name', '[', "x", '] -- This is a placeholder, lookups are not supported in sampling')
 * lookup("x", 'lookup_name', 'replaceValue') => nvl(concat('lookup_name', '[', "x", '] -- This is a placeholder, lookups are not supported in sampling'), 'replaceValue')
 */
export function changeLookupInExpressionsSampling(druidExpression: string): string {
  if (!druidExpression.includes('lookup')) return druidExpression;

  const parsedDruidExpression = SqlExpression.maybeParse(druidExpression);
  if (parsedDruidExpression) {
    return String(
      parsedDruidExpression.walk(ex => {
        if (ex instanceof SqlFunction && ex.getEffectiveFunctionName() === 'LOOKUP') {
          if (ex.numArgs() < 2 || ex.numArgs() > 3) return SqlExpression.parse('null');
          const concat = F(
            'concat',
            ex.getArg(1),
            '[',
            ex.getArg(0),
            '] -- This is a placeholder, lookups are not supported in sampling',
          );

          const replaceMissingValueWith = ex.getArg(2);
          if (!replaceMissingValueWith) return concat;
          return F('nvl', concat, replaceMissingValueWith);
        }
        return ex;
      }),
    );
  }

  // If we can not parse the expression as SQL then bash it with a regexp
  return druidExpression.replace(/lookup\s*\(([^)]+)\)/g, (_, argString: string) => {
    const args = argString.trim().split(/\s*,\s*/);
    if (args.length < 2 || args.length > 3) return 'null';
    const concat = `concat(${args[1]},'[',${args[0]},'] -- This is a placeholder, lookups are not supported in sampling')`;
    const replaceMissingValueWith = args[2];
    if (!replaceMissingValueWith) return concat;
    return `nvl(${concat},${replaceMissingValueWith})`;
  });
}

Whole-row regex capture for non-fixed-format sources (sampler.ts:243-248):

const WHOLE_ROW_INPUT_FORMAT: InputFormat = {
  type: 'regex',
  pattern: '([\\s\\S]*)', // Match the entire line, every single character
  listDelimiter: '56616469-6de2-9da4-efb8-8f416e6e6965', // Just a UUID to disable the list delimiter
  columns: ['raw'],
};

Two-pass transform sampling with "Hack Hack Hack" (sampler.ts:482-525):

  // Extra step to simulate auto-detecting dimension with transforms
  let specialDimensionSpec: DimensionsSpec = {
    useSchemaDiscovery: true,
    forceSegmentSortByTime,
  };
  if (transforms && transforms.length) {
    const sampleSpecHack: SampleSpec = {
      type: samplerType,
      spec: {
        ioConfig: deepGet(spec, 'spec.ioConfig'),
        dataSchema: {
          dataSource: 'sample',
          timestampSpec,
          dimensionsSpec: {
            useSchemaDiscovery: true,
          },
          granularitySpec: {
            rollup: false,
          },
        },
      },
      samplerConfig: BASE_SAMPLER_CONFIG,
    };

    const sampleResponseHack = await postToSampler(
      applyCache(sampleSpecHack, cacheRows),
      'transform-pre',
    );

    specialDimensionSpec = deepSet(
      specialDimensionSpec,
      'dimensions',
      dedupe(
        (
          guessDimensionsFromSampleResponse(sampleResponseHack) as (DimensionSpec | string)[]
        ).concat(getDimensionNamesFromTransforms(transforms)),
        getDimensionSpecName,
      ),
    );
  }

  // ...
  dimensionsSpec: specialDimensionSpec, // Hack Hack Hack

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment