Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Heibaiying BigData Notes Hive Partition and Bucket Config

From Leeroopedia


Knowledge Sources
Domains Data_Warehouse, Big_Data
Last Updated 2026-02-10 10:00 GMT

Overview

Concrete tool for configuring table partitioning and bucketing strategies in Apache Hive to optimize query performance and data organization.

Description

Hive provides two complementary DDL mechanisms for physically organizing table data on HDFS:

Partitioning uses the PARTITIONED BY clause to split table data into subdirectories based on column values. Partitions can be added statically (specifying values explicitly) or dynamically (derived from query data during INSERT). Partition management includes adding, dropping, and listing partitions via ALTER TABLE and SHOW PARTITIONS.

Bucketing uses the CLUSTERED BY clause to hash-distribute rows into a fixed number of files within each partition (or within the table if unpartitioned). The optional SORTED BY clause ensures that data within each bucket file is sorted, enabling sort-merge join optimizations. Bucketing requires setting hive.enforce.bucketing=true to ensure the correct number of reducers during INSERT operations.

Usage

Use partitioning for time-series data (daily, monthly), geographic regions, or any column that is commonly used in WHERE clauses. Use bucketing for columns used in JOIN conditions or when efficient sampling is needed. Combine both for large analytical tables that benefit from both partition pruning and bucket-level join optimization.

Code Reference

Source Location

  • Repository: BigData-Notes
  • File: notes/Hive分区表和分桶表.md (Lines 1-171)

Signature

-- Create a partitioned table
CREATE TABLE table_name (col_name data_type, ...)
PARTITIONED BY (partition_col data_type [COMMENT 'comment'], ...)
[STORED AS file_format];

-- Create a bucketed table
CREATE TABLE table_name (col_name data_type, ...)
CLUSTERED BY (col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)]
INTO num_buckets BUCKETS
[STORED AS file_format];

-- Add a partition
ALTER TABLE table_name ADD [IF NOT EXISTS]
  PARTITION (partition_col=value, ...) [LOCATION 'hdfs_path'];

-- Drop a partition
ALTER TABLE table_name DROP [IF EXISTS]
  PARTITION (partition_col=value, ...);

-- List partitions
SHOW PARTITIONS table_name;

-- Enable dynamic partitioning
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;

-- Enable bucketing enforcement
SET hive.enforce.bucketing=true;

Import

N/A (HiveQL statements executed via Hive CLI or Beeline)

I/O Contract

Inputs

Name Type Required Description
table_name STRING Yes Name of the table to partition or bucket
PARTITIONED BY Column definitions No (for bucketing) One or more columns that define the partitioning scheme; these columns are not part of the data files but are encoded in directory names
CLUSTERED BY Column names No (for partitioning) One or more columns used as the hash key for bucket assignment
SORTED BY Column names + order No Columns and sort direction within each bucket file
INTO N BUCKETS INT Required with CLUSTERED BY Number of bucket files to create per partition
Partition values Typed values Yes (for ADD/DROP PARTITION) Specific values for each partition column
LOCATION STRING (HDFS path) No Custom HDFS path for a specific partition

Outputs

Name Type Description
Partition directories HDFS directories Subdirectories named col=value under the table directory, one per distinct partition value combination
Bucket files HDFS files Numbered files (000000_0, 000001_0, ...) within each partition directory, one per bucket
Partition list Result set Output of SHOW PARTITIONS listing all partition specifications
Metastore entries Metadata Partition and bucketing metadata registered in the Hive metastore

Usage Examples

-- Create a partitioned table by year and month
CREATE TABLE sales (
    sale_id     BIGINT,
    customer_id INT,
    product_id  INT,
    amount      DECIMAL(10,2)
)
PARTITIONED BY (sale_year INT, sale_month INT)
STORED AS ORC;

-- Add partitions manually
ALTER TABLE sales ADD IF NOT EXISTS
  PARTITION (sale_year=2025, sale_month=1)
  PARTITION (sale_year=2025, sale_month=2);

-- Load data into a specific partition
LOAD DATA INPATH '/staging/sales_2025_01.orc'
INTO TABLE sales PARTITION (sale_year=2025, sale_month=1);

-- Dynamic partition insert from a source table
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;

INSERT OVERWRITE TABLE sales PARTITION (sale_year, sale_month)
SELECT sale_id, customer_id, product_id, amount,
       YEAR(sale_date) AS sale_year,
       MONTH(sale_date) AS sale_month
FROM raw_sales;

-- List all partitions
SHOW PARTITIONS sales;

-- Create a bucketed and sorted table
SET hive.enforce.bucketing=true;

CREATE TABLE user_sessions (
    session_id  STRING,
    user_id     INT,
    page_views  INT,
    duration    INT,
    start_time  TIMESTAMP
)
CLUSTERED BY (user_id) SORTED BY (start_time ASC) INTO 64 BUCKETS
STORED AS ORC;

-- Insert data respecting bucketing
INSERT OVERWRITE TABLE user_sessions
SELECT session_id, user_id, page_views, duration, start_time
FROM raw_sessions;

-- Combine partitioning and bucketing
CREATE TABLE event_log (
    event_id    BIGINT,
    user_id     INT,
    event_type  STRING,
    payload     STRING
)
PARTITIONED BY (event_date STRING)
CLUSTERED BY (user_id) INTO 32 BUCKETS
STORED AS ORC;

-- Drop a partition
ALTER TABLE sales DROP IF EXISTS PARTITION (sale_year=2024, sale_month=1);

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment