Principle:Apache Druid Streaming Schema Spec
| Knowledge Sources | |
|---|---|
| Domains | Streaming_Ingestion, Schema_Design |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
A streaming-specific schema configuration principle that defines parsing, timestamps, transforms, filters, and schema for continuously ingested streaming data.
Description
Streaming Schema and Spec Configuration reuses the same sampler-based pipeline as batch ingestion (parsing → timestamp → transform → filter → schema) but with streaming-specific additions:
- Streaming metadata columns: Optional columns derived from message metadata (Kafka timestamps, headers, keys; Kinesis partition keys)
- Supervisor idle config: Settings for handling gaps in stream activity
- Streaming tuning: Parameters specific to streaming indexing (task duration, completion timeout, etc.)
The wizard steps are identical to batch Steps 3-7 (sampleForParser through sampleForSchema), using the same sampler API with cached streaming data.
Usage
Use this principle after streaming source connection succeeds. The configuration steps mirror the batch workflow but produce a supervisor spec instead of a task spec.
Theoretical Basis
Streaming schema configuration follows the same incremental sampler refinement pattern as batch, with streaming extensions:
sampleForParser(spec, cacheRows) → Parsed streaming messages
sampleForTimestamp(spec, cacheRows) → __time extraction
sampleForTransform(spec, cacheRows) → Derived columns
sampleForFilter(spec, cacheRows) → Row filtering
sampleForSchema(spec, cacheRows) → Final schema with dims + metrics
Streaming-specific fields:
ioConfig.type: 'kafka' | 'kinesis'
inputFormat.type: 'kafka' | 'kinesis' (wrapper format)
inputFormat.valueFormat: actual data format (json, csv, etc.)
inputFormat.headerFormat, keyFormat: optional metadata parsing