Principle:Apache Druid Source Connection

Knowledge Sources	Apache Druid Druid Sampler API
Domains	Data_Ingestion, Data_Sampling
Last Updated	2026-02-10 00:00 GMT

Overview

A data connectivity verification principle that validates source accessibility and retrieves sample data for previewing before full ingestion.

Description

Source Connection establishes and validates connectivity to an external data source by sending a sampling request to the Druid Sampler API. The sampler attempts to read a small number of rows from the configured input source (S3, HTTP, local file, Kafka topic, etc.) and returns raw data that serves as the foundation for all subsequent ingestion configuration steps.

This principle ensures that:

The data source is reachable and credentials are valid
Sample data is available for format detection and schema inference
A cache key is returned that enables subsequent sampler calls to reuse the fetched data without re-reading the source

For Druid reindexing sources, the connection step additionally queries the existing datasource's metadata (column names, aggregators, rollup settings) using scan and segmentMetadata queries.

Usage

Use this principle immediately after source type selection. It is the mandatory validation gate: no ingestion workflow can proceed without a successful source connection that returns sample data.

Theoretical Basis

The source connection follows a sample-and-cache pattern:

SampleRequest(inputSource, inputFormat, samplerConfig) → POST /druid/indexer/v1/sampler
Response → { data: SampleEntry[], cacheKey?: string }

For reindexing:
  Additionally: scan query → column names
  Additionally: segmentMetadata query → aggregators, rollup info

The cache key enables incremental refinement: subsequent sampling calls (parsing, timestamp, transform, filter, schema) can use cached data instead of re-reading from the source, making the wizard responsive even with slow external sources.

Related Pages

Implemented By

Implementation:Apache_Druid_SampleForConnect

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment