Workflow:Apache Druid SQL Based Data Ingestion

Knowledge Sources	Apache Druid SQL-based Ingestion MSQ Reference EXTERN Function
Domains	Data_Engineering, SQL, Real_Time_Analytics
Last Updated	2026-02-10 10:00 GMT

Overview

End-to-end process for ingesting data into Apache Druid using SQL INSERT/REPLACE statements powered by the Multi-Stage Query (MSQ) engine, providing a SQL-native alternative to classic JSON-based ingestion specs.

Description

This workflow covers the modern SQL-based data ingestion path in the Druid web console. Instead of building a JSON ingestion spec through a multi-step wizard, users write or generate SQL INSERT INTO or REPLACE INTO statements that read from external data sources using the EXTERN function. The SQL Data Loader view provides a guided three-step process (source selection, format configuration, schema definition) that generates the SQL query automatically, while the Workbench view allows advanced users to write ingestion SQL directly. The MSQ engine executes these queries as distributed multi-stage tasks.

Key capabilities:

SQL-native syntax for data ingestion (INSERT INTO, REPLACE INTO)
EXTERN function for reading external data (S3, local files, HTTP, inline)
Automatic SQL query generation from guided wizard steps
Multi-stage distributed execution with real-time progress monitoring
Configurable task parallelism (maxNumTasks context parameter)
Rollup analysis for compression ratio estimation
Resume capability for interrupted ingestion flows

Usage

Execute this workflow when you prefer SQL syntax for data ingestion, need to leverage SQL transformations during ingestion, or want to use the simplified SQL Data Loader wizard. This is the recommended path for users familiar with SQL who want concise, readable ingestion definitions. It requires the MSQ extension to be loaded on the Druid cluster.

Execution Steps

Step 1: Input Source Selection

Choose the external data source to ingest from. The SQL Data Loader provides a guided interface for selecting input source types (local file, Amazon S3, Google Cloud Storage, Azure, HDFS, HTTP) and configuring connection parameters. Sample data is retrieved via the Druid sampler API to validate connectivity.

Key considerations:

Each input source maps to an EXTERN() function call in the generated SQL
Local file sources reference paths accessible on the Druid server
Cloud storage sources require credential configuration
The wizard persists state to localStorage for session recovery

Step 2: Input Format Configuration

Configure the data format for the selected source. The wizard auto-detects the format (JSON, CSV, TSV, Parquet, Avro, ORC) and displays parsed sample data. Users refine parser settings and verify that columns are correctly identified.

Key considerations:

Format detection examines sample data headers and structure
CSV/TSV settings include delimiter, quote character, and header detection
Parquet and ORC files carry their own schema, reducing configuration needed
The configured format translates to EXTERN() function format parameters

Step 3: Schema and Column Configuration

Define the target schema by selecting, renaming, retyping, or transforming columns. The wizard displays a preview table with the parsed data, allowing column-level editing. Users can add computed columns via SQL expressions, remove unnecessary columns, and configure the primary timestamp column.

Key considerations:

Column expressions use standard Druid SQL syntax
The __time column must be defined (explicitly or via TIME_PARSE)
Rollup analysis estimates the compression ratio if rollup is enabled
Column type changes affect storage format and query performance

Step 4: Destination Configuration

Set the target datasource name, ingestion mode (INSERT for append, REPLACE for overwrite), segment granularity, and clustering dimensions. This step finalizes the SQL statement structure.

Key considerations:

INSERT INTO appends data; REPLACE INTO overwrites matching time ranges
Segment granularity (PARTITIONED BY) determines time-based segment boundaries
CLUSTERED BY dimensions optimize segment layout for common query patterns
The generated SQL is fully editable before submission

Step 5: SQL Review and Task Submission

Review the generated INSERT INTO or REPLACE INTO SQL statement. The complete query is displayed in an editor where users can make final adjustments. The query context (maxNumTasks, execution mode) is configured, then the task is submitted to the MSQ engine via the async SQL statements API.

Key considerations:

The async API returns a query ID for tracking execution status
maxNumTasks controls the degree of distributed parallelism
Cluster capacity is validated before submission (CapacityAlert)
Task execution is monitored via polling the statements status endpoint

Step 6: Ingestion Monitoring

Monitor the multi-stage query execution in real time. The progress dialog shows task status, stage progression, rows processed, and any errors encountered. Users can cancel the task if needed.

What happens:

The MSQ engine breaks the query into multiple stages (scan, sort, merge, publish)
Each stage reports progress as a percentage of input processed
On success, new segments are published and become queryable
On failure, detailed error messages identify the root cause

Execution Diagram

GitHub URL

Workflow Repository