Principle:Apache Druid Data Parsing

Knowledge Sources	Apache Druid Druid Input Formats
Domains	Data_Ingestion, Data_Parsing
Last Updated	2026-02-10 00:00 GMT

Overview

A data interpretation principle that applies a configured input format to transform raw bytes into structured columnar records.

Description

Data Parsing takes the raw sample data retrieved during source connection and applies an inputFormat specification to parse it into structured rows with named columns. Druid supports multiple input formats including JSON, CSV, TSV, Parquet, ORC, Avro, and regular expression-based formats.

The parsing step uses the Druid Sampler API with cached data to avoid re-reading the source. The parser is configured in the ioConfig.inputFormat section of the ingestion spec. Schema discovery mode (useSchemaDiscovery: true) automatically detects dimension types from the parsed data.

Usage

Use this principle after source connection succeeds and raw data is available. It is the second configuration step in the ingestion wizard, required before timestamp extraction, transforms, or schema definition can occur.

Theoretical Basis

Data parsing follows a format detection and application pattern:

RawData + InputFormat → ParsedRows[{column: value}]

InputFormat types:
  json    → JSON object per line
  csv     → Comma-separated with optional header
  tsv     → Tab-separated with optional header
  parquet → Columnar binary format
  orc     → Optimized Row Columnar format
  avro    → Schema-based binary format
  regex   → Regular expression extraction

The sampler applies the format server-side and returns both the raw input and parsed output for each row, enabling the user to verify correctness.

Related Pages

Implemented By

Implementation:Apache_Druid_SampleForParser

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment