Principle:Apache Druid Source Connection
| Knowledge Sources | |
|---|---|
| Domains | Data_Ingestion, Data_Sampling |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
A data connectivity verification principle that validates source accessibility and retrieves sample data for previewing before full ingestion.
Description
Source Connection establishes and validates connectivity to an external data source by sending a sampling request to the Druid Sampler API. The sampler attempts to read a small number of rows from the configured input source (S3, HTTP, local file, Kafka topic, etc.) and returns raw data that serves as the foundation for all subsequent ingestion configuration steps.
This principle ensures that:
- The data source is reachable and credentials are valid
- Sample data is available for format detection and schema inference
- A cache key is returned that enables subsequent sampler calls to reuse the fetched data without re-reading the source
For Druid reindexing sources, the connection step additionally queries the existing datasource's metadata (column names, aggregators, rollup settings) using scan and segmentMetadata queries.
Usage
Use this principle immediately after source type selection. It is the mandatory validation gate: no ingestion workflow can proceed without a successful source connection that returns sample data.
Theoretical Basis
The source connection follows a sample-and-cache pattern:
SampleRequest(inputSource, inputFormat, samplerConfig) → POST /druid/indexer/v1/sampler
Response → { data: SampleEntry[], cacheKey?: string }
For reindexing:
Additionally: scan query → column names
Additionally: segmentMetadata query → aggregators, rollup info
The cache key enables incremental refinement: subsequent sampling calls (parsing, timestamp, transform, filter, schema) can use cached data instead of re-reading from the source, making the wizard responsive even with slow external sources.