Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Workflow:Apache Druid Batch Data Ingestion

From Leeroopedia


Knowledge Sources
Domains Data_Engineering, ETL, Real_Time_Analytics
Last Updated 2026-02-10 10:00 GMT

Overview

End-to-end process for loading batch data into Apache Druid using the web console's multi-step ingestion wizard, from source connection through schema definition to task submission.

Description

This workflow covers the classic batch data ingestion path in the Druid web console. The Load Data wizard guides users through an 11-step process that transforms raw data from various sources (local files, S3, HDFS, HTTP, or Druid segments) into queryable Druid datasources. Each step previews a sample of up to 500 rows via the Druid sampler API, allowing interactive configuration of parsing, timestamp extraction, transformations, filtering, schema definition, partitioning, tuning, and publication settings before submitting the final ingestion spec as a batch task.

Key capabilities:

  • Support for multiple input sources (local disk, Amazon S3, Google Cloud Storage, Azure, HDFS, HTTP, inline data, Druid reindexing)
  • Automatic format detection for JSON, CSV, TSV, Avro, Parquet, ORC, and regex patterns
  • Interactive data sampling with real-time preview at every step
  • Auto-detection of dimensions and metrics from sample data
  • Configurable partitioning strategies (dynamic, hash-based, range-based)
  • Full ingestion spec review and manual editing before submission

Usage

Execute this workflow when you need to load a finite batch of data from files or object storage into a Druid datasource. This is the recommended path for first-time users, one-off data loads, or when you need fine-grained visual control over every aspect of the ingestion specification. It is not suitable for continuous streaming ingestion (use the Streaming Ingestion Management workflow instead) or when you prefer SQL-based ingestion syntax (use the SQL Based Data Ingestion workflow).

Execution Steps

Step 1: Source Type Selection

Select the type of data source and ingestion method. The wizard presents options for batch ingestion (local disk, Amazon S3, Azure, Google Cloud Storage, HDFS, HTTP, inline data, or Druid reindexing) and streaming ingestion (Apache Kafka, Amazon Kinesis). For batch workflows, choosing a source type determines the input source configuration form displayed in the next step.

Key considerations:

  • Batch sources produce one-time index tasks; streaming sources create supervisors
  • Local disk requires the data to be accessible on the Druid server filesystem
  • S3, GCS, and Azure require appropriate IAM or credential configuration
  • Reindexing from Druid allows re-processing existing segments with new settings

Step 2: Source Connection

Configure the connection parameters for the selected input source. Provide file paths, URIs, bucket names, or inline data as appropriate. The wizard calls the Druid sampler API with the input source configuration to retrieve a preview of raw data rows, validating connectivity and access.

Key considerations:

  • File path patterns and wildcards are supported for multi-file ingestion
  • The sampler retrieves up to 500 sample rows for preview
  • Connection errors surface immediately with actionable error messages
  • For S3/cloud sources, temporary credentials or assumed roles may be configured

Step 3: Data Parsing

Configure the data format parser. The wizard auto-detects the format (JSON, CSV, TSV, Avro OCF, Parquet, ORC, or regex) and displays parsed columns from the sample data. Users can adjust parser settings such as delimiters, header handling, and multi-value dimensions.

Key considerations:

  • Auto-detection works for most standard formats; manual override is available
  • CSV/TSV parsing supports custom delimiters, list delimiters, and header row skipping
  • JSON parsing handles both flat and nested structures
  • The parsed preview shows column names, types, and sample values

Step 4: Timestamp Configuration

Select or configure the primary timestamp column (__time). Every Druid datasource requires a time column for segment partitioning. The wizard offers automatic detection of timestamp columns, manual column selection, or use of a constant timestamp value.

Key considerations:

  • Druid uses ISO 8601 or Joda time format patterns for timestamp parsing
  • If no natural timestamp exists, a constant value (e.g., 2000-01-01) can be used
  • Incorrect timestamp configuration leads to data landing in wrong time partitions
  • The preview validates that timestamps parse correctly across all sample rows

Step 5: Data Transformation

Define optional column transformations using Druid expressions. Transformations create new derived columns or modify existing ones before indexing. The wizard provides an expression editor and previews transformation results on the sample data.

Key considerations:

  • Transformations use Druid's native expression language
  • Common use cases include string concatenation, type casting, and conditional logic
  • Transformations execute before filtering, so filtered rows still incur transformation cost
  • Each transformation must specify an output column name and expression

Step 6: Data Filtering

Apply optional row-level filters to exclude unwanted data before indexing. Filters reduce the volume of ingested data by applying boolean conditions on column values. The wizard previews which sample rows pass or fail each filter.

Key considerations:

  • Filters support selector, regex, range, and boolean combinations (AND/OR/NOT)
  • Filtering happens after transformation, so transformed columns are available
  • Aggressive filtering at ingestion time reduces storage and improves query performance
  • Preview shows matched vs. filtered-out row counts

Step 7: Schema Definition

Configure the datasource schema including dimensions, metrics, and rollup settings. The wizard auto-detects dimension types (string, long, float, double) and suggests appropriate metric aggregators. Users can enable or disable rollup, add/remove dimensions, and configure metric aggregations.

Key considerations:

  • Rollup combines rows with identical dimension values, reducing storage
  • String dimensions support multi-value if configured
  • Metric aggregators (count, sum, min, max, hyperUnique, thetaSketch) define pre-aggregated columns
  • Schema changes after ingestion require reindexing existing data

Step 8: Partitioning Configuration

Set the segment granularity (time-based partitioning) and secondary partitioning strategy. Segment granularity determines how data is split across time-based segments. Secondary partitioning further divides segments within each time chunk.

Key considerations:

  • Segment granularity options: SECOND, MINUTE, HOUR, DAY, WEEK, MONTH, YEAR, ALL
  • Secondary partitioning strategies: dynamic (default), hash-based, or range-based
  • Smaller granularity produces more segments but enables finer time-based pruning
  • Hash and range partitioning improve query performance for specific access patterns

Step 9: Tuning Parameters

Configure performance tuning parameters for the ingestion task. Settings include maximum rows in memory, maximum bytes in memory, maximum rows per segment, task count for parallel ingestion, and index specification (bitmap type, compression).

Key considerations:

  • maxRowsInMemory controls memory usage during ingestion (default varies by task type)
  • Parallel indexing with forceGuaranteedRollup enables optimal segment sizing
  • Index compression (LZ4 vs. zstd) affects storage size vs. decompression speed
  • Task resource allocation depends on available cluster capacity

Step 10: Publication Settings

Configure the target datasource name and append/overwrite mode. Set whether the ingestion should append to an existing datasource or overwrite (replace) existing data for the specified time intervals.

Key considerations:

  • Datasource name must be unique within the cluster unless appending
  • Overwrite mode replaces all segments within the ingested time range
  • Append mode adds new segments alongside existing ones
  • The datasource name becomes the primary identifier for all future queries

Step 11: Spec Review and Submission

Review the complete ingestion specification as JSON. The wizard displays the full spec with syntax highlighting and allows manual editing. Once satisfied, submit the task to the Druid Overlord for execution.

What happens:

  • The full ingestion spec JSON is displayed for review
  • Manual edits are validated in real-time
  • Submission posts the spec to the Druid indexer task API
  • The user is redirected to the Tasks view to monitor execution progress
  • Task status transitions: RUNNING → SUCCESS or FAILED

Execution Diagram

GitHub URL

Workflow Repository