Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Heibaiying BigData Notes Hive Partitioning and Bucketing

From Leeroopedia


Knowledge Sources
Domains Data_Warehouse, Big_Data
Last Updated 2026-02-10 10:00 GMT

Overview

Partitioning and bucketing are Hive's two primary physical data organization strategies that enable query optimization through data pruning and efficient data distribution.

Description

Partitioning divides a table's data into discrete subdirectories on HDFS based on the values of one or more designated columns. Each unique combination of partition column values produces a separate directory. When a query includes a filter on a partition column, Hive performs partition pruning -- it skips all directories that do not match the filter, dramatically reducing the amount of data scanned.

For example, a table partitioned by year and month would have directories like: /warehouse/table/year=2025/month=01/, /warehouse/table/year=2025/month=02/, etc.

Bucketing (also called clustering) distributes data within a table or partition into a fixed number of files by applying a hash function to a specified column. Each hash bucket is stored as a separate file. Bucketing provides:

  • Efficient sampling: Queries can read a single bucket instead of the entire table.
  • Map-side joins: When two tables are bucketed on the same column with the same number of buckets, Hive can perform joins by matching corresponding bucket files, avoiding a full shuffle.
  • Sorted merge joins: When buckets are also sorted, Hive can exploit sort-merge join algorithms for even greater efficiency.

Partitioning and bucketing can be combined: a table can be partitioned by date and bucketed by user ID within each partition, achieving both coarse-grained pruning and fine-grained data distribution.

Usage

Use partitioning when:

  • Queries frequently filter on specific columns (e.g., date, region, status).
  • The partition column has moderate cardinality (hundreds to low thousands of distinct values).
  • Data naturally arrives in time-based or category-based batches.

Use bucketing when:

  • Tables are frequently joined on a specific column and you want to enable map-side joins.
  • You need efficient random sampling of the dataset.
  • The column used for filtering has very high cardinality (millions of values), making partitioning impractical.

Avoid partitioning on high-cardinality columns, as this creates an excessive number of small directories (the "small files problem"), degrading HDFS NameNode performance and query planning time.

Theoretical Basis

Partitioning in Hive implements a form of horizontal partitioning from relational database theory, where rows are divided into disjoint subsets based on column predicates. The query optimizer exploits partition metadata to eliminate irrelevant subsets before any data is read, achieving what is known as partition elimination or partition pruning.

Bucketing applies hash partitioning principles. Given a column value v and n buckets, the bucket assignment is:

bucket_id = hash(v) mod n

This guarantees uniform distribution (assuming a good hash function) and co-location of identical keys in the same file. The theoretical basis for map-side joins on bucketed tables relies on the property that all rows with the same join key reside in corresponding buckets across both tables, enabling a local (per-bucket) join without a global shuffle.

-- Partitioned table
CREATE TABLE events (
    event_id    INT,
    user_id     INT,
    event_type  STRING
)
PARTITIONED BY (event_date STRING)
STORED AS ORC;

-- Bucketed and sorted table
CREATE TABLE user_actions (
    action_id   INT,
    user_id     INT,
    action_type STRING,
    timestamp   BIGINT
)
CLUSTERED BY (user_id) SORTED BY (timestamp) INTO 32 BUCKETS
STORED AS ORC;

The choice of bucket count should consider the expected data volume: each bucket file should be large enough to benefit from HDFS block-level parallelism (typically 128-256 MB per file) but numerous enough to support parallel map tasks.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment