Heuristic:Heibaiying BigData Notes Hive ORC Parquet Storage Tip
| Knowledge Sources | |
|---|---|
| Domains | Data_Warehouse, Optimization |
| Last Updated | 2026-02-10 10:00 GMT |
Overview
Use ORC or Parquet columnar storage formats instead of plain text files in Hive for significantly better query performance and storage efficiency.
Description
Hive supports multiple storage formats: TextFile (default), SequenceFile, RCFile, ORC, and Parquet. Among these, ORC (Optimized Row Columnar) and Parquet are columnar formats that provide dramatically better performance for analytical queries. ORC includes built-in lightweight indexing (Row Group Index and Bloom Filter) and supports predicate pushdown. Parquet is widely adopted across the Hadoop ecosystem (Spark, Impala, Presto) and offers excellent compression ratios. Plain text files offer no performance optimization and should be avoided in production data warehouses.
Usage
Use this heuristic when designing Hive table schemas or optimizing query performance. Apply when:
- Creating new Hive tables for analytical workloads
- Migrating existing text-based tables to columnar format
- Optimizing storage costs (ORC/Parquet compress 3-10x better than text)
- Queries scan only a subset of columns (columnar formats skip unneeded columns)
The Insight (Rule of Thumb)
- Action: Specify `STORED AS ORC` or `STORED AS PARQUET` in Hive `CREATE TABLE` statements.
- Value: ORC for Hive-native workloads with built-in indexing. Parquet for cross-ecosystem compatibility (Spark, Impala).
- Trade-off: Slight overhead on write (encoding/compression) but major savings on read. Text files are write-simple but read-slow.
- Optimization: Combine with partitioning (`PARTITIONED BY`) and bucketing (`CLUSTERED BY`) for maximum query performance.
Additional Hive optimization tips:
- Set `hive.map.aggr=true` for map-side pre-aggregation (trades memory for efficiency)
- Use `DISTRIBUTE BY` for controlled partitioning of intermediate data
- Avoid natural joins (use explicit `JOIN ON` syntax)
- Partition count should balance between too few (large scans) and too many (filesystem overhead)
Reasoning
Columnar formats store data by column rather than by row, which means analytical queries that only access a few columns out of many can skip reading irrelevant data entirely. ORC and Parquet both use run-length encoding, dictionary encoding, and block-level compression, achieving 3-10x compression over text. ORC's built-in indexing allows the query engine to skip entire row groups that don't match filter predicates, further reducing I/O. The BigData-Notes Hive documentation explicitly recommends ORC and Parquet as having the "best comprehensive performance" among all supported formats.