Heuristic:Heibaiying BigData Notes Hive ORC Parquet Storage Tip

Knowledge Sources	BigData-Notes Hive Storage Formats
Domains	Data_Warehouse, Optimization
Last Updated	2026-02-10 10:00 GMT

Overview

Use ORC or Parquet columnar storage formats instead of plain text files in Hive for significantly better query performance and storage efficiency.

Description

Hive supports multiple storage formats: TextFile (default), SequenceFile, RCFile, ORC, and Parquet. Among these, ORC (Optimized Row Columnar) and Parquet are columnar formats that provide dramatically better performance for analytical queries. ORC includes built-in lightweight indexing (Row Group Index and Bloom Filter) and supports predicate pushdown. Parquet is widely adopted across the Hadoop ecosystem (Spark, Impala, Presto) and offers excellent compression ratios. Plain text files offer no performance optimization and should be avoided in production data warehouses.

Usage

Use this heuristic when designing Hive table schemas or optimizing query performance. Apply when:

Creating new Hive tables for analytical workloads
Migrating existing text-based tables to columnar format
Optimizing storage costs (ORC/Parquet compress 3-10x better than text)
Queries scan only a subset of columns (columnar formats skip unneeded columns)

The Insight (Rule of Thumb)

Action: Specify `STORED AS ORC` or `STORED AS PARQUET` in Hive `CREATE TABLE` statements.
Value: ORC for Hive-native workloads with built-in indexing. Parquet for cross-ecosystem compatibility (Spark, Impala).
Trade-off: Slight overhead on write (encoding/compression) but major savings on read. Text files are write-simple but read-slow.
Optimization: Combine with partitioning (`PARTITIONED BY`) and bucketing (`CLUSTERED BY`) for maximum query performance.

Additional Hive optimization tips:

Set `hive.map.aggr=true` for map-side pre-aggregation (trades memory for efficiency)
Use `DISTRIBUTE BY` for controlled partitioning of intermediate data
Avoid natural joins (use explicit `JOIN ON` syntax)
Partition count should balance between too few (large scans) and too many (filesystem overhead)

Reasoning

Columnar formats store data by column rather than by row, which means analytical queries that only access a few columns out of many can skip reading irrelevant data entirely. ORC and Parquet both use run-length encoding, dictionary encoding, and block-level compression, achieving 3-10x compression over text. ORC's built-in indexing allows the query engine to skip entire row groups that don't match filter predicates, further reducing I/O. The BigData-Notes Hive documentation explicitly recommends ORC and Parquet as having the "best comprehensive performance" among all supported formats.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment