Principle:Apache Druid Data Parsing
| Knowledge Sources | |
|---|---|
| Domains | Data_Ingestion, Data_Parsing |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
A data interpretation principle that applies a configured input format to transform raw bytes into structured columnar records.
Description
Data Parsing takes the raw sample data retrieved during source connection and applies an inputFormat specification to parse it into structured rows with named columns. Druid supports multiple input formats including JSON, CSV, TSV, Parquet, ORC, Avro, and regular expression-based formats.
The parsing step uses the Druid Sampler API with cached data to avoid re-reading the source. The parser is configured in the ioConfig.inputFormat section of the ingestion spec. Schema discovery mode (useSchemaDiscovery: true) automatically detects dimension types from the parsed data.
Usage
Use this principle after source connection succeeds and raw data is available. It is the second configuration step in the ingestion wizard, required before timestamp extraction, transforms, or schema definition can occur.
Theoretical Basis
Data parsing follows a format detection and application pattern:
RawData + InputFormat → ParsedRows[{column: value}]
InputFormat types:
json → JSON object per line
csv → Comma-separated with optional header
tsv → Tab-separated with optional header
parquet → Columnar binary format
orc → Optimized Row Columnar format
avro → Schema-based binary format
regex → Regular expression extraction
The sampler applies the format server-side and returns both the raw input and parsed output for each row, enabling the user to verify correctness.