Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Heibaiying BigData Notes Hive Select Query

From Leeroopedia


Knowledge Sources
Domains Data_Warehouse, Big_Data
Last Updated 2026-02-10 10:00 GMT

Overview

Concrete tool for querying data using HiveQL SELECT statements, including filtering, aggregation, sorting, and joins provided by Apache Hive.

Description

The SELECT statement is the primary data retrieval mechanism in HiveQL. It supports the full range of SQL query capabilities adapted for distributed execution: column projection, row filtering with WHERE, grouping and aggregation with GROUP BY and HAVING, result ordering, and multi-table joins.

Hive provides four distinct sorting/distribution clauses that control how results are ordered and distributed across reducers:

  • ORDER BY: Global total ordering using a single reducer.
  • SORT BY: Local ordering within each reducer.
  • DISTRIBUTE BY: Hash-based distribution of rows across reducers without sorting.
  • CLUSTER BY: Combined DISTRIBUTE BY and SORT BY on the same columns.

Hive supports six join types: INNER JOIN, LEFT OUTER JOIN, RIGHT OUTER JOIN, FULL OUTER JOIN, CROSS JOIN, and LEFT SEMI JOIN. The query optimizer automatically selects the join strategy (shuffle join, map-side join, bucket map join, or sort-merge-bucket join) based on table sizes, bucketing configuration, and session settings.

Usage

Use SELECT queries for ad-hoc data analysis, building ETL pipelines, generating reports, and performing data validation checks. The choice of sorting clause and join type significantly impacts performance and should be aligned with the specific query requirements.

Code Reference

Source Location

  • Repository: BigData-Notes
  • File: notes/Hive数据查询详解.md (Lines 1-397)

Signature

-- Basic SELECT syntax
SELECT [ALL | DISTINCT] col1 [AS alias1], col2, ...
FROM table_name [alias]
[WHERE condition]
[GROUP BY col1, col2, ...]
[HAVING group_condition]
[ORDER BY col [ASC|DESC], ...]
[SORT BY col [ASC|DESC], ...]
[DISTRIBUTE BY col, ...]
[CLUSTER BY col, ...]
[LIMIT n];

-- JOIN syntax
SELECT ...
FROM table1 [alias1]
[INNER | LEFT OUTER | RIGHT OUTER | FULL OUTER | CROSS] JOIN table2 [alias2]
  ON alias1.col = alias2.col
[WHERE ...];

-- LEFT SEMI JOIN (EXISTS equivalent)
SELECT ...
FROM table1
LEFT SEMI JOIN table2 ON table1.col = table2.col;

Import

N/A (HiveQL statements executed via Hive CLI or Beeline)

I/O Contract

Inputs

Name Type Required Description
table_name STRING Yes Source table(s) to query; supports aliasing
Column list Column names or expressions Yes Columns to project; supports * for all columns, expressions, and aggregate functions (COUNT, SUM, AVG, MIN, MAX)
WHERE condition Boolean expression No Row-level filter applied before aggregation
GROUP BY columns Column names No Columns to group by for aggregation
HAVING condition Boolean expression No Group-level filter applied after aggregation
ORDER BY / SORT BY Column names + direction No Sorting specification; ORDER BY for global sort, SORT BY for per-reducer sort
DISTRIBUTE BY Column names No Hash-based row distribution across reducers
CLUSTER BY Column names No Combined distribution and per-reducer sorting
JOIN ... ON Table + condition No Join specification with join type and equi-join condition
LIMIT INT No Maximum number of rows to return

Outputs

Name Type Description
Result set Rows and columns Query results matching the projection, filtering, and aggregation criteria
Job statistics Console output MapReduce/Tez job counters including rows read, rows written, and execution time

Usage Examples

-- Basic query with filtering and aggregation
SELECT
    region,
    COUNT(*) AS order_count,
    SUM(amount) AS total_revenue,
    AVG(amount) AS avg_order_value
FROM orders
WHERE order_date BETWEEN '2025-01-01' AND '2025-01-31'
GROUP BY region
HAVING SUM(amount) > 50000
ORDER BY total_revenue DESC
LIMIT 10;

-- DISTRIBUTE BY and SORT BY for controlled distribution
SELECT user_id, action_type, action_time
FROM user_actions
DISTRIBUTE BY user_id
SORT BY action_time ASC;

-- CLUSTER BY (equivalent to DISTRIBUTE BY + SORT BY on same column)
SELECT department, employee_id, salary
FROM employees
CLUSTER BY department;

-- Inner join between two tables
SELECT
    o.order_id,
    c.customer_name,
    o.amount,
    o.order_date
FROM orders o
INNER JOIN customers c ON o.customer_id = c.customer_id
WHERE o.order_date >= '2025-01-01';

-- Left outer join with multiple tables
SELECT
    o.order_id,
    c.customer_name,
    p.product_name,
    o.quantity,
    o.unit_price
FROM orders o
LEFT OUTER JOIN customers c ON o.customer_id = c.customer_id
LEFT OUTER JOIN products p ON o.product_id = p.product_id
WHERE o.status = 'completed';

-- Full outer join
SELECT
    COALESCE(a.user_id, b.user_id) AS user_id,
    a.total_purchases,
    b.total_returns
FROM purchase_summary a
FULL OUTER JOIN return_summary b ON a.user_id = b.user_id;

-- Left semi join (EXISTS equivalent)
SELECT e.employee_id, e.name, e.department
FROM employees e
LEFT SEMI JOIN active_departments d ON e.department = d.dept_name;

-- Map-side join hint for small table
SELECT /*+ MAPJOIN(d) */
    f.order_id, f.amount, d.region_name
FROM fact_orders f
INNER JOIN dim_regions d ON f.region_id = d.region_id;

-- Subquery in WHERE clause
SELECT employee_id, name, salary
FROM employees
WHERE department IN (
    SELECT dept_name FROM departments WHERE location = 'New York'
);

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment