Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Heibaiying BigData Notes Hive Load and Insert Data

From Leeroopedia


Knowledge Sources
Domains Data_Warehouse, Big_Data
Last Updated 2026-02-10 10:00 GMT

Overview

Concrete tool for loading data from files and inserting query results into Hive tables provided by Apache Hive.

Description

Hive provides two primary DML mechanisms for populating tables with data:

LOAD DATA performs a filesystem-level operation that moves (from HDFS) or copies (from local) files into a table's data directory. No MapReduce or Tez job is triggered, making it a fast bulk ingestion method. The data files must already be formatted to match the table's SerDe and storage format.

INSERT INTO / INSERT OVERWRITE executes a SELECT query and writes the results into a target table or partition. This triggers a distributed computation job (MapReduce or Tez) and supports arbitrary transformations, filtering, and aggregation during the load process. The multi-table insert variant allows a single source scan to populate multiple target tables in one pass.

INSERT OVERWRITE DIRECTORY exports query results to an HDFS or local filesystem directory, enabling data extraction from Hive for external consumption.

Usage

Use LOAD DATA for initial bulk ingestion of pre-formatted files. Use INSERT INTO/OVERWRITE for ETL transformations, populating derived tables, or loading partitioned tables with dynamically determined partition values. Use INSERT OVERWRITE DIRECTORY to export data for downstream systems.

Code Reference

Source Location

  • Repository: BigData-Notes
  • File: notes/Hive常用DML操作.md (Lines 1-332)

Signature

-- Load data from local filesystem
LOAD DATA LOCAL INPATH 'local_path'
[OVERWRITE] INTO TABLE table_name
[PARTITION (col=value, ...)];

-- Load data from HDFS
LOAD DATA INPATH 'hdfs_path'
[OVERWRITE] INTO TABLE table_name
[PARTITION (col=value, ...)];

-- Insert query results into a table
INSERT INTO TABLE table_name [PARTITION (col=value, ...)]
SELECT ... FROM source_table [WHERE ...];

-- Overwrite table data with query results
INSERT OVERWRITE TABLE table_name [PARTITION (col=value, ...)]
SELECT ... FROM source_table [WHERE ...];

-- Multi-table insert from a single source
FROM source_table
INSERT INTO TABLE target1 [PARTITION (...)]
    SELECT col1, col2 WHERE condition1
INSERT INTO TABLE target2 [PARTITION (...)]
    SELECT col3, col4 WHERE condition2;

-- Export to HDFS directory
INSERT OVERWRITE DIRECTORY 'hdfs_path'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
SELECT ... FROM source_table;

-- Export to local directory
INSERT OVERWRITE LOCAL DIRECTORY 'local_path'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
SELECT ... FROM source_table;

Import

N/A (HiveQL statements executed via Hive CLI or Beeline)

I/O Contract

Inputs

Name Type Required Description
LOCAL INPATH / INPATH STRING (file path) Yes (for LOAD) Path to the source data file or directory; LOCAL means the local filesystem of the client machine, otherwise HDFS
table_name STRING Yes Target table to load data into
PARTITION Column=value pairs No Specific partition to load into; required for partitioned tables unless using dynamic partitioning
OVERWRITE Keyword No When specified, existing data in the target table/partition is replaced; without it, data is appended
SELECT query HiveQL Yes (for INSERT) Query defining the data to insert, including any transformations, filters, or aggregations
DIRECTORY path STRING Yes (for export) Target HDFS or local directory for exported data
ROW FORMAT SerDe config No (for export) Output format for exported data

Outputs

Name Type Description
Data files HDFS files Files written to the target table's HDFS directory (or partition subdirectory)
Exported files HDFS/local files Query results written to the specified directory when using INSERT OVERWRITE DIRECTORY
Job statistics Console output MapReduce/Tez job counters and statistics (for INSERT operations only)

Usage Examples

-- Load a CSV file from local filesystem
LOAD DATA LOCAL INPATH '/home/data/products.csv'
INTO TABLE products;

-- Load an ORC file from HDFS, overwriting existing data in a partition
LOAD DATA INPATH '/staging/orders_2025_01.orc'
OVERWRITE INTO TABLE orders PARTITION (order_month='2025-01');

-- Insert transformed data into a summary table
INSERT INTO TABLE daily_revenue
SELECT
    order_date,
    region,
    COUNT(DISTINCT order_id) AS order_count,
    SUM(amount) AS total_revenue,
    AVG(amount) AS avg_order_value
FROM orders
WHERE order_date = '2025-01-15'
GROUP BY order_date, region;

-- Overwrite a partition with fresh data
INSERT OVERWRITE TABLE monthly_stats PARTITION (month='2025-01')
SELECT
    category,
    COUNT(*) AS item_count,
    SUM(sales) AS total_sales
FROM transactions
WHERE trans_month = '2025-01'
GROUP BY category;

-- Multi-table insert: split raw events into typed tables
FROM raw_events
INSERT INTO TABLE click_events PARTITION (event_date)
    SELECT user_id, page_url, click_time, event_date
    WHERE event_type = 'click'
INSERT INTO TABLE purchase_events PARTITION (event_date)
    SELECT user_id, product_id, amount, event_date
    WHERE event_type = 'purchase'
INSERT INTO TABLE login_events PARTITION (event_date)
    SELECT user_id, ip_address, login_time, event_date
    WHERE event_type = 'login';

-- Export query results to HDFS as CSV
INSERT OVERWRITE DIRECTORY '/exports/customer_report'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
SELECT customer_id, name, total_spent
FROM customer_summary
WHERE total_spent > 5000
ORDER BY total_spent DESC;

-- Export to local filesystem
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/local_export'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
SELECT * FROM products WHERE category = 'Electronics';

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment