Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Apache Druid Data Parsing

From Leeroopedia


Knowledge Sources
Domains Data_Ingestion, Data_Parsing
Last Updated 2026-02-10 00:00 GMT

Overview

A data interpretation principle that applies a configured input format to transform raw bytes into structured columnar records.

Description

Data Parsing takes the raw sample data retrieved during source connection and applies an inputFormat specification to parse it into structured rows with named columns. Druid supports multiple input formats including JSON, CSV, TSV, Parquet, ORC, Avro, and regular expression-based formats.

The parsing step uses the Druid Sampler API with cached data to avoid re-reading the source. The parser is configured in the ioConfig.inputFormat section of the ingestion spec. Schema discovery mode (useSchemaDiscovery: true) automatically detects dimension types from the parsed data.

Usage

Use this principle after source connection succeeds and raw data is available. It is the second configuration step in the ingestion wizard, required before timestamp extraction, transforms, or schema definition can occur.

Theoretical Basis

Data parsing follows a format detection and application pattern:

RawData + InputFormat → ParsedRows[{column: value}]

InputFormat types:
  json    → JSON object per line
  csv     → Comma-separated with optional header
  tsv     → Tab-separated with optional header
  parquet → Columnar binary format
  orc     → Optimized Row Columnar format
  avro    → Schema-based binary format
  regex   → Regular expression extraction

The sampler applies the format server-side and returns both the raw input and parsed output for each row, enabling the user to verify correctness.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment