Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Workflow:Apache Druid SQL Based Data Ingestion

From Leeroopedia


Knowledge Sources
Domains Data_Engineering, SQL, Real_Time_Analytics
Last Updated 2026-02-10 10:00 GMT

Overview

End-to-end process for ingesting data into Apache Druid using SQL INSERT/REPLACE statements powered by the Multi-Stage Query (MSQ) engine, providing a SQL-native alternative to classic JSON-based ingestion specs.

Description

This workflow covers the modern SQL-based data ingestion path in the Druid web console. Instead of building a JSON ingestion spec through a multi-step wizard, users write or generate SQL INSERT INTO or REPLACE INTO statements that read from external data sources using the EXTERN function. The SQL Data Loader view provides a guided three-step process (source selection, format configuration, schema definition) that generates the SQL query automatically, while the Workbench view allows advanced users to write ingestion SQL directly. The MSQ engine executes these queries as distributed multi-stage tasks.

Key capabilities:

  • SQL-native syntax for data ingestion (INSERT INTO, REPLACE INTO)
  • EXTERN function for reading external data (S3, local files, HTTP, inline)
  • Automatic SQL query generation from guided wizard steps
  • Multi-stage distributed execution with real-time progress monitoring
  • Configurable task parallelism (maxNumTasks context parameter)
  • Rollup analysis for compression ratio estimation
  • Resume capability for interrupted ingestion flows

Usage

Execute this workflow when you prefer SQL syntax for data ingestion, need to leverage SQL transformations during ingestion, or want to use the simplified SQL Data Loader wizard. This is the recommended path for users familiar with SQL who want concise, readable ingestion definitions. It requires the MSQ extension to be loaded on the Druid cluster.

Execution Steps

Step 1: Input Source Selection

Choose the external data source to ingest from. The SQL Data Loader provides a guided interface for selecting input source types (local file, Amazon S3, Google Cloud Storage, Azure, HDFS, HTTP) and configuring connection parameters. Sample data is retrieved via the Druid sampler API to validate connectivity.

Key considerations:

  • Each input source maps to an EXTERN() function call in the generated SQL
  • Local file sources reference paths accessible on the Druid server
  • Cloud storage sources require credential configuration
  • The wizard persists state to localStorage for session recovery

Step 2: Input Format Configuration

Configure the data format for the selected source. The wizard auto-detects the format (JSON, CSV, TSV, Parquet, Avro, ORC) and displays parsed sample data. Users refine parser settings and verify that columns are correctly identified.

Key considerations:

  • Format detection examines sample data headers and structure
  • CSV/TSV settings include delimiter, quote character, and header detection
  • Parquet and ORC files carry their own schema, reducing configuration needed
  • The configured format translates to EXTERN() function format parameters

Step 3: Schema and Column Configuration

Define the target schema by selecting, renaming, retyping, or transforming columns. The wizard displays a preview table with the parsed data, allowing column-level editing. Users can add computed columns via SQL expressions, remove unnecessary columns, and configure the primary timestamp column.

Key considerations:

  • Column expressions use standard Druid SQL syntax
  • The __time column must be defined (explicitly or via TIME_PARSE)
  • Rollup analysis estimates the compression ratio if rollup is enabled
  • Column type changes affect storage format and query performance

Step 4: Destination Configuration

Set the target datasource name, ingestion mode (INSERT for append, REPLACE for overwrite), segment granularity, and clustering dimensions. This step finalizes the SQL statement structure.

Key considerations:

  • INSERT INTO appends data; REPLACE INTO overwrites matching time ranges
  • Segment granularity (PARTITIONED BY) determines time-based segment boundaries
  • CLUSTERED BY dimensions optimize segment layout for common query patterns
  • The generated SQL is fully editable before submission

Step 5: SQL Review and Task Submission

Review the generated INSERT INTO or REPLACE INTO SQL statement. The complete query is displayed in an editor where users can make final adjustments. The query context (maxNumTasks, execution mode) is configured, then the task is submitted to the MSQ engine via the async SQL statements API.

Key considerations:

  • The async API returns a query ID for tracking execution status
  • maxNumTasks controls the degree of distributed parallelism
  • Cluster capacity is validated before submission (CapacityAlert)
  • Task execution is monitored via polling the statements status endpoint

Step 6: Ingestion Monitoring

Monitor the multi-stage query execution in real time. The progress dialog shows task status, stage progression, rows processed, and any errors encountered. Users can cancel the task if needed.

What happens:

  • The MSQ engine breaks the query into multiple stages (scan, sort, merge, publish)
  • Each stage reports progress as a percentage of input processed
  • On success, new segments are published and become queryable
  • On failure, detailed error messages identify the root cause

Execution Diagram

GitHub URL

Workflow Repository