Implementation:Heibaiying BigData Notes Hive Select Query

Knowledge Sources	BigData-Notes Hive Language Manual
Domains	Data_Warehouse, Big_Data
Last Updated	2026-02-10 10:00 GMT

Overview

Concrete tool for querying data using HiveQL SELECT statements, including filtering, aggregation, sorting, and joins provided by Apache Hive.

Description

The SELECT statement is the primary data retrieval mechanism in HiveQL. It supports the full range of SQL query capabilities adapted for distributed execution: column projection, row filtering with WHERE, grouping and aggregation with GROUP BY and HAVING, result ordering, and multi-table joins.

Hive provides four distinct sorting/distribution clauses that control how results are ordered and distributed across reducers:

ORDER BY: Global total ordering using a single reducer.
SORT BY: Local ordering within each reducer.
DISTRIBUTE BY: Hash-based distribution of rows across reducers without sorting.
CLUSTER BY: Combined DISTRIBUTE BY and SORT BY on the same columns.

Hive supports six join types: INNER JOIN, LEFT OUTER JOIN, RIGHT OUTER JOIN, FULL OUTER JOIN, CROSS JOIN, and LEFT SEMI JOIN. The query optimizer automatically selects the join strategy (shuffle join, map-side join, bucket map join, or sort-merge-bucket join) based on table sizes, bucketing configuration, and session settings.

Usage

Use SELECT queries for ad-hoc data analysis, building ETL pipelines, generating reports, and performing data validation checks. The choice of sorting clause and join type significantly impacts performance and should be aligned with the specific query requirements.

Code Reference

Source Location

Repository: BigData-Notes
File: notes/Hive数据查询详解.md (Lines 1-397)

Signature

-- Basic SELECT syntax
SELECT [ALL | DISTINCT] col1 [AS alias1], col2, ...
FROM table_name [alias]
[WHERE condition]
[GROUP BY col1, col2, ...]
[HAVING group_condition]
[ORDER BY col [ASC|DESC], ...]
[SORT BY col [ASC|DESC], ...]
[DISTRIBUTE BY col, ...]
[CLUSTER BY col, ...]
[LIMIT n];

-- JOIN syntax
SELECT ...
FROM table1 [alias1]
[INNER | LEFT OUTER | RIGHT OUTER | FULL OUTER | CROSS] JOIN table2 [alias2]
  ON alias1.col = alias2.col
[WHERE ...];

-- LEFT SEMI JOIN (EXISTS equivalent)
SELECT ...
FROM table1
LEFT SEMI JOIN table2 ON table1.col = table2.col;

Import

N/A (HiveQL statements executed via Hive CLI or Beeline)

I/O Contract

Inputs

Name	Type	Required	Description
table_name	STRING	Yes	Source table(s) to query; supports aliasing
Column list	Column names or expressions	Yes	Columns to project; supports `*` for all columns, expressions, and aggregate functions (COUNT, SUM, AVG, MIN, MAX)
WHERE condition	Boolean expression	No	Row-level filter applied before aggregation
GROUP BY columns	Column names	No	Columns to group by for aggregation
HAVING condition	Boolean expression	No	Group-level filter applied after aggregation
ORDER BY / SORT BY	Column names + direction	No	Sorting specification; ORDER BY for global sort, SORT BY for per-reducer sort
DISTRIBUTE BY	Column names	No	Hash-based row distribution across reducers
CLUSTER BY	Column names	No	Combined distribution and per-reducer sorting
JOIN ... ON	Table + condition	No	Join specification with join type and equi-join condition
LIMIT	INT	No	Maximum number of rows to return

Outputs

Name	Type	Description
Result set	Rows and columns	Query results matching the projection, filtering, and aggregation criteria
Job statistics	Console output	MapReduce/Tez job counters including rows read, rows written, and execution time

Usage Examples

-- Basic query with filtering and aggregation
SELECT
    region,
    COUNT(*) AS order_count,
    SUM(amount) AS total_revenue,
    AVG(amount) AS avg_order_value
FROM orders
WHERE order_date BETWEEN '2025-01-01' AND '2025-01-31'
GROUP BY region
HAVING SUM(amount) > 50000
ORDER BY total_revenue DESC
LIMIT 10;

-- DISTRIBUTE BY and SORT BY for controlled distribution
SELECT user_id, action_type, action_time
FROM user_actions
DISTRIBUTE BY user_id
SORT BY action_time ASC;

-- CLUSTER BY (equivalent to DISTRIBUTE BY + SORT BY on same column)
SELECT department, employee_id, salary
FROM employees
CLUSTER BY department;

-- Inner join between two tables
SELECT
    o.order_id,
    c.customer_name,
    o.amount,
    o.order_date
FROM orders o
INNER JOIN customers c ON o.customer_id = c.customer_id
WHERE o.order_date >= '2025-01-01';

-- Left outer join with multiple tables
SELECT
    o.order_id,
    c.customer_name,
    p.product_name,
    o.quantity,
    o.unit_price
FROM orders o
LEFT OUTER JOIN customers c ON o.customer_id = c.customer_id
LEFT OUTER JOIN products p ON o.product_id = p.product_id
WHERE o.status = 'completed';

-- Full outer join
SELECT
    COALESCE(a.user_id, b.user_id) AS user_id,
    a.total_purchases,
    b.total_returns
FROM purchase_summary a
FULL OUTER JOIN return_summary b ON a.user_id = b.user_id;

-- Left semi join (EXISTS equivalent)
SELECT e.employee_id, e.name, e.department
FROM employees e
LEFT SEMI JOIN active_departments d ON e.department = d.dept_name;

-- Map-side join hint for small table
SELECT /*+ MAPJOIN(d) */
    f.order_id, f.amount, d.region_name
FROM fact_orders f
INNER JOIN dim_regions d ON f.region_id = d.region_id;

-- Subquery in WHERE clause
SELECT employee_id, name, salary
FROM employees
WHERE department IN (
    SELECT dept_name FROM departments WHERE location = 'New York'
);

Related Pages

Implements Principle

Principle:Heibaiying_BigData_Notes_Hive_Query_Execution

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment