Implementation:Heibaiying BigData Notes Hive Partition and Bucket Config

Knowledge Sources	BigData-Notes Hive Language Manual
Domains	Data_Warehouse, Big_Data
Last Updated	2026-02-10 10:00 GMT

Overview

Concrete tool for configuring table partitioning and bucketing strategies in Apache Hive to optimize query performance and data organization.

Description

Hive provides two complementary DDL mechanisms for physically organizing table data on HDFS:

Partitioning uses the PARTITIONED BY clause to split table data into subdirectories based on column values. Partitions can be added statically (specifying values explicitly) or dynamically (derived from query data during INSERT). Partition management includes adding, dropping, and listing partitions via ALTER TABLE and SHOW PARTITIONS.

Bucketing uses the CLUSTERED BY clause to hash-distribute rows into a fixed number of files within each partition (or within the table if unpartitioned). The optional SORTED BY clause ensures that data within each bucket file is sorted, enabling sort-merge join optimizations. Bucketing requires setting hive.enforce.bucketing=true to ensure the correct number of reducers during INSERT operations.

Usage

Use partitioning for time-series data (daily, monthly), geographic regions, or any column that is commonly used in WHERE clauses. Use bucketing for columns used in JOIN conditions or when efficient sampling is needed. Combine both for large analytical tables that benefit from both partition pruning and bucket-level join optimization.

Code Reference

Source Location

Repository: BigData-Notes
File: notes/Hive分区表和分桶表.md (Lines 1-171)

Signature

-- Create a partitioned table
CREATE TABLE table_name (col_name data_type, ...)
PARTITIONED BY (partition_col data_type [COMMENT 'comment'], ...)
[STORED AS file_format];

-- Create a bucketed table
CREATE TABLE table_name (col_name data_type, ...)
CLUSTERED BY (col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)]
INTO num_buckets BUCKETS
[STORED AS file_format];

-- Add a partition
ALTER TABLE table_name ADD [IF NOT EXISTS]
  PARTITION (partition_col=value, ...) [LOCATION 'hdfs_path'];

-- Drop a partition
ALTER TABLE table_name DROP [IF EXISTS]
  PARTITION (partition_col=value, ...);

-- List partitions
SHOW PARTITIONS table_name;

-- Enable dynamic partitioning
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;

-- Enable bucketing enforcement
SET hive.enforce.bucketing=true;

Import

N/A (HiveQL statements executed via Hive CLI or Beeline)

I/O Contract

Inputs

Name	Type	Required	Description
table_name	STRING	Yes	Name of the table to partition or bucket
PARTITIONED BY	Column definitions	No (for bucketing)	One or more columns that define the partitioning scheme; these columns are not part of the data files but are encoded in directory names
CLUSTERED BY	Column names	No (for partitioning)	One or more columns used as the hash key for bucket assignment
SORTED BY	Column names + order	No	Columns and sort direction within each bucket file
INTO N BUCKETS	INT	Required with CLUSTERED BY	Number of bucket files to create per partition
Partition values	Typed values	Yes (for ADD/DROP PARTITION)	Specific values for each partition column
LOCATION	STRING (HDFS path)	No	Custom HDFS path for a specific partition

Outputs

Name	Type	Description
Partition directories	HDFS directories	Subdirectories named `col=value` under the table directory, one per distinct partition value combination
Bucket files	HDFS files	Numbered files (000000_0, 000001_0, ...) within each partition directory, one per bucket
Partition list	Result set	Output of SHOW PARTITIONS listing all partition specifications
Metastore entries	Metadata	Partition and bucketing metadata registered in the Hive metastore

Usage Examples

-- Create a partitioned table by year and month
CREATE TABLE sales (
    sale_id     BIGINT,
    customer_id INT,
    product_id  INT,
    amount      DECIMAL(10,2)
)
PARTITIONED BY (sale_year INT, sale_month INT)
STORED AS ORC;

-- Add partitions manually
ALTER TABLE sales ADD IF NOT EXISTS
  PARTITION (sale_year=2025, sale_month=1)
  PARTITION (sale_year=2025, sale_month=2);

-- Load data into a specific partition
LOAD DATA INPATH '/staging/sales_2025_01.orc'
INTO TABLE sales PARTITION (sale_year=2025, sale_month=1);

-- Dynamic partition insert from a source table
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;

INSERT OVERWRITE TABLE sales PARTITION (sale_year, sale_month)
SELECT sale_id, customer_id, product_id, amount,
       YEAR(sale_date) AS sale_year,
       MONTH(sale_date) AS sale_month
FROM raw_sales;

-- List all partitions
SHOW PARTITIONS sales;

-- Create a bucketed and sorted table
SET hive.enforce.bucketing=true;

CREATE TABLE user_sessions (
    session_id  STRING,
    user_id     INT,
    page_views  INT,
    duration    INT,
    start_time  TIMESTAMP
)
CLUSTERED BY (user_id) SORTED BY (start_time ASC) INTO 64 BUCKETS
STORED AS ORC;

-- Insert data respecting bucketing
INSERT OVERWRITE TABLE user_sessions
SELECT session_id, user_id, page_views, duration, start_time
FROM raw_sessions;

-- Combine partitioning and bucketing
CREATE TABLE event_log (
    event_id    BIGINT,
    user_id     INT,
    event_type  STRING,
    payload     STRING
)
PARTITIONED BY (event_date STRING)
CLUSTERED BY (user_id) INTO 32 BUCKETS
STORED AS ORC;

-- Drop a partition
ALTER TABLE sales DROP IF EXISTS PARTITION (sale_year=2024, sale_month=1);

Related Pages

Implements Principle

Principle:Heibaiying_BigData_Notes_Hive_Partitioning_and_Bucketing

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment