Principle:Heibaiying BigData Notes Hive Table Definition
| Knowledge Sources | |
|---|---|
| Domains | Data_Warehouse, Big_Data |
| Last Updated | 2026-02-10 10:00 GMT |
Overview
Table definition in Hive encompasses specifying column schemas, row formats, storage formats, and choosing between internal (managed) and external table types.
Description
A table in Hive is a structured dataset stored on HDFS whose schema is registered in the Hive metastore. Defining a table requires specifying its columns and data types, the serialization/deserialization format (SerDe), the storage file format, and the table type.
Hive distinguishes between two fundamental table types:
- Internal (Managed) Tables: Hive owns the data. When the table is dropped, Hive deletes the underlying HDFS data directory. These are suitable for intermediate or derived data that Hive fully controls.
- External Tables: Hive only manages the metadata. The underlying HDFS data is not deleted when the table is dropped. These are ideal for shared datasets or raw ingestion directories where data is produced and consumed by multiple systems.
Hive supports a rich set of data types:
- Primitive types: TINYINT, SMALLINT, INT, BIGINT, FLOAT, DOUBLE, DECIMAL, STRING, VARCHAR, CHAR, BOOLEAN, TIMESTAMP, DATE, BINARY
- Complex types: ARRAY (ordered collection), MAP (key-value pairs), STRUCT (named fields), UNIONTYPE (heterogeneous types)
Row format defines how columns are delimited within each record. The most common approach uses DELIMITED format with field, collection, and map key terminators. For more complex serialization, custom SerDe classes (e.g., JSON SerDe, RegEx SerDe) can be specified.
Storage format determines how data is physically stored on HDFS:
- TextFile: Plain text, human-readable, but inefficient for large-scale analytics.
- ORC (Optimized Row Columnar): Columnar format with built-in compression, predicate pushdown, and ACID support.
- Parquet: Columnar format popular in the Hadoop ecosystem, efficient for nested data structures.
- SequenceFile: Hadoop's binary key-value format with block compression support.
Usage
Use table definition when:
- Creating a new data landing area for ingested files (external table).
- Defining derived or aggregated datasets fully managed by Hive (internal table).
- Specifying how Hive should interpret the row format of delimited, JSON, or other structured files.
- Choosing a columnar storage format (ORC, Parquet) for analytical query performance.
Theoretical Basis
Hive table definition follows the relational model's concept of a relation schema -- a named set of attributes, each with a defined domain (data type). However, Hive extends this with schema-on-read semantics: the physical file does not enforce the schema; instead, the schema is applied when data is read. This allows flexible evolution and tolerates semi-structured data.
The distinction between managed and external tables maps to the concept of data ownership. In managed mode, the system (Hive) is the sole authority over the data lifecycle. In external mode, Hive acts as a metadata overlay on data managed externally, following the principle of separation of concerns.
-- Internal (managed) table with ORC storage
CREATE TABLE orders (
order_id INT,
customer_id INT,
order_date TIMESTAMP,
total_amount DECIMAL(10,2),
items ARRAY<STRING>,
attributes MAP<STRING, STRING>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY ','
MAP KEYS TERMINATED BY ':'
STORED AS ORC;
-- External table pointing to existing HDFS data
CREATE EXTERNAL TABLE raw_logs (
log_line STRING
)
STORED AS TEXTFILE
LOCATION '/data/raw/logs';
The choice of storage format involves a trade-off between write performance (TextFile is fastest to write) and read performance (columnar formats like ORC and Parquet enable projection pushdown and predicate pruning, significantly reducing I/O).