Implementation:Heibaiying BigData Notes Hive Partition and Bucket Config
| Knowledge Sources | |
|---|---|
| Domains | Data_Warehouse, Big_Data |
| Last Updated | 2026-02-10 10:00 GMT |
Overview
Concrete tool for configuring table partitioning and bucketing strategies in Apache Hive to optimize query performance and data organization.
Description
Hive provides two complementary DDL mechanisms for physically organizing table data on HDFS:
Partitioning uses the PARTITIONED BY clause to split table data into subdirectories based on column values. Partitions can be added statically (specifying values explicitly) or dynamically (derived from query data during INSERT). Partition management includes adding, dropping, and listing partitions via ALTER TABLE and SHOW PARTITIONS.
Bucketing uses the CLUSTERED BY clause to hash-distribute rows into a fixed number of files within each partition (or within the table if unpartitioned). The optional SORTED BY clause ensures that data within each bucket file is sorted, enabling sort-merge join optimizations. Bucketing requires setting hive.enforce.bucketing=true to ensure the correct number of reducers during INSERT operations.
Usage
Use partitioning for time-series data (daily, monthly), geographic regions, or any column that is commonly used in WHERE clauses. Use bucketing for columns used in JOIN conditions or when efficient sampling is needed. Combine both for large analytical tables that benefit from both partition pruning and bucket-level join optimization.
Code Reference
Source Location
- Repository: BigData-Notes
- File:
notes/Hive分区表和分桶表.md(Lines 1-171)
Signature
-- Create a partitioned table
CREATE TABLE table_name (col_name data_type, ...)
PARTITIONED BY (partition_col data_type [COMMENT 'comment'], ...)
[STORED AS file_format];
-- Create a bucketed table
CREATE TABLE table_name (col_name data_type, ...)
CLUSTERED BY (col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)]
INTO num_buckets BUCKETS
[STORED AS file_format];
-- Add a partition
ALTER TABLE table_name ADD [IF NOT EXISTS]
PARTITION (partition_col=value, ...) [LOCATION 'hdfs_path'];
-- Drop a partition
ALTER TABLE table_name DROP [IF EXISTS]
PARTITION (partition_col=value, ...);
-- List partitions
SHOW PARTITIONS table_name;
-- Enable dynamic partitioning
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
-- Enable bucketing enforcement
SET hive.enforce.bucketing=true;
Import
N/A (HiveQL statements executed via Hive CLI or Beeline)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| table_name | STRING | Yes | Name of the table to partition or bucket |
| PARTITIONED BY | Column definitions | No (for bucketing) | One or more columns that define the partitioning scheme; these columns are not part of the data files but are encoded in directory names |
| CLUSTERED BY | Column names | No (for partitioning) | One or more columns used as the hash key for bucket assignment |
| SORTED BY | Column names + order | No | Columns and sort direction within each bucket file |
| INTO N BUCKETS | INT | Required with CLUSTERED BY | Number of bucket files to create per partition |
| Partition values | Typed values | Yes (for ADD/DROP PARTITION) | Specific values for each partition column |
| LOCATION | STRING (HDFS path) | No | Custom HDFS path for a specific partition |
Outputs
| Name | Type | Description |
|---|---|---|
| Partition directories | HDFS directories | Subdirectories named col=value under the table directory, one per distinct partition value combination
|
| Bucket files | HDFS files | Numbered files (000000_0, 000001_0, ...) within each partition directory, one per bucket |
| Partition list | Result set | Output of SHOW PARTITIONS listing all partition specifications |
| Metastore entries | Metadata | Partition and bucketing metadata registered in the Hive metastore |
Usage Examples
-- Create a partitioned table by year and month
CREATE TABLE sales (
sale_id BIGINT,
customer_id INT,
product_id INT,
amount DECIMAL(10,2)
)
PARTITIONED BY (sale_year INT, sale_month INT)
STORED AS ORC;
-- Add partitions manually
ALTER TABLE sales ADD IF NOT EXISTS
PARTITION (sale_year=2025, sale_month=1)
PARTITION (sale_year=2025, sale_month=2);
-- Load data into a specific partition
LOAD DATA INPATH '/staging/sales_2025_01.orc'
INTO TABLE sales PARTITION (sale_year=2025, sale_month=1);
-- Dynamic partition insert from a source table
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
INSERT OVERWRITE TABLE sales PARTITION (sale_year, sale_month)
SELECT sale_id, customer_id, product_id, amount,
YEAR(sale_date) AS sale_year,
MONTH(sale_date) AS sale_month
FROM raw_sales;
-- List all partitions
SHOW PARTITIONS sales;
-- Create a bucketed and sorted table
SET hive.enforce.bucketing=true;
CREATE TABLE user_sessions (
session_id STRING,
user_id INT,
page_views INT,
duration INT,
start_time TIMESTAMP
)
CLUSTERED BY (user_id) SORTED BY (start_time ASC) INTO 64 BUCKETS
STORED AS ORC;
-- Insert data respecting bucketing
INSERT OVERWRITE TABLE user_sessions
SELECT session_id, user_id, page_views, duration, start_time
FROM raw_sessions;
-- Combine partitioning and bucketing
CREATE TABLE event_log (
event_id BIGINT,
user_id INT,
event_type STRING,
payload STRING
)
PARTITIONED BY (event_date STRING)
CLUSTERED BY (user_id) INTO 32 BUCKETS
STORED AS ORC;
-- Drop a partition
ALTER TABLE sales DROP IF EXISTS PARTITION (sale_year=2024, sale_month=1);