Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Heibaiying BigData Notes Hive Data Loading

From Leeroopedia
Revision as of 18:17, 16 February 2026 by Admin (talk | contribs) (Auto-imported from principles/Heibaiying_BigData_Notes_Hive_Data_Loading.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Data_Warehouse, Big_Data
Last Updated 2026-02-10 10:00 GMT

Overview

Hive data loading encompasses the mechanisms for ingesting data from local filesystems and HDFS into Hive tables, as well as inserting query results into target tables.

Description

Apache Hive provides multiple pathways for loading data into tables, each with different semantics and performance characteristics:

LOAD DATA is a pure file-level operation. It moves (or copies) files from a source path into the target table's HDFS directory. No data transformation, validation, or MapReduce execution occurs during a LOAD operation. This makes it extremely fast but also means that the data must already be in the format expected by the table's SerDe and storage configuration.

  • LOAD DATA LOCAL INPATH: Copies files from the local filesystem of the machine running the Hive client into the table's HDFS directory.
  • LOAD DATA INPATH: Moves files from an HDFS path into the table's HDFS directory. The source files are removed from their original location.
  • The OVERWRITE keyword replaces existing data in the table (or partition); without it, files are appended.

INSERT INTO/OVERWRITE executes a query (typically a SELECT statement) and writes the results into the target table. This triggers a MapReduce or Tez job to process the data. INSERT operations support:

  • INSERT INTO: Appends results to existing data.
  • INSERT OVERWRITE: Replaces all existing data in the target table or partition.
  • Multi-table insert: A single source scan can populate multiple target tables in one pass, reducing I/O overhead.

INSERT OVERWRITE DIRECTORY exports query results to an HDFS or local filesystem directory, enabling data extraction from Hive.

Usage

Use LOAD DATA when:

  • Ingesting pre-formatted files (CSV, TSV, JSON) that match the table schema.
  • Performing initial bulk data loads where speed is critical and no transformation is needed.
  • Moving data from staging HDFS directories into Hive-managed tables.

Use INSERT INTO/OVERWRITE when:

  • Transforming, filtering, or aggregating data during the load process.
  • Populating derived tables from existing Hive tables.
  • Loading data into partitioned tables based on column values derived from the source.

Use Multi-table INSERT when:

  • A single source dataset feeds multiple target tables (e.g., routing raw data into fact and dimension tables simultaneously).

Theoretical Basis

The two loading mechanisms represent fundamentally different approaches in the ETL (Extract, Transform, Load) paradigm:

LOAD DATA follows an ELT (Extract, Load, Transform) pattern -- data is first loaded as-is and transformations happen in subsequent queries. This leverages Hive's schema-on-read: the physical data does not need to conform to any schema at load time; interpretation happens at query time. The operation is O(1) with respect to data volume since it only manipulates filesystem metadata (moving file pointers).

INSERT ... SELECT follows a traditional ETL pattern -- transformation and loading happen in a single step. The computational cost is proportional to the input data size, as a distributed job processes every record.

-- Load a local file into a table
LOAD DATA LOCAL INPATH '/home/data/employees.csv'
INTO TABLE employees;

-- Load an HDFS file, overwriting existing data
LOAD DATA INPATH '/staging/sales_2025.orc'
OVERWRITE INTO TABLE sales PARTITION (year='2025');

-- Insert query results into a target table
INSERT INTO TABLE daily_summary
SELECT region, COUNT(*), SUM(amount)
FROM transactions
WHERE trans_date = '2025-01-15'
GROUP BY region;

-- Multi-table insert from a single scan
FROM raw_events
INSERT INTO TABLE user_events
    SELECT user_id, event_type, ts WHERE event_category = 'user'
INSERT INTO TABLE system_events
    SELECT system_id, event_type, ts WHERE event_category = 'system';

The OVERWRITE semantics provide idempotent load behavior: re-running the same load produces the same result regardless of how many times it is executed. This property is essential for reliable data pipelines that may need to retry failed operations.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment