Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Heibaiying BigData Notes HBase Data Reading

From Leeroopedia


Knowledge Sources
Domains NoSQL, Big_Data
Last Updated 2026-02-10 10:00 GMT

Overview

HBase provides two primary read mechanisms: Get for retrieving a single row by its exact row key, and Scan for iterating over a range of rows with optional server-side filters for predicate pushdown.

Description

Reading data from HBase involves two fundamentally different access patterns:

1. Get (Point Read):

A Get retrieves a single row identified by its exact row key. It returns a Result object containing all cells (or a subset of cells if column family/qualifier restrictions are applied) for that row. Get operations are:

  • O(1) in the average case -- HBase uses a block index and Bloom filters to locate the target row efficiently.
  • Atomic -- The returned Result represents a consistent snapshot of the row.
  • Well-suited for random access patterns such as key-value lookups.

2. Scan (Range Read):

A Scan iterates over a range of rows, optionally bounded by start and end row keys. It returns a ResultScanner that lazily fetches rows in sorted order. Scans support:

  • Row key range -- Specify startRow (inclusive) and stopRow (exclusive) to limit the scan range.
  • Full table scan -- Omit start and stop rows to iterate over the entire table.
  • Server-side filters -- Attach a FilterList to push predicates to the RegionServer, reducing data transferred over the network.

Server-side filtering (predicate pushdown):

Filters are evaluated on the RegionServer before results are sent to the client. This is a critical optimization for large datasets because it:

  • Reduces network bandwidth consumption.
  • Decreases client-side memory usage.
  • Leverages the RegionServer's proximity to the data on HDFS.

Common filter types include RowFilter, SingleColumnValueFilter, PrefixFilter, PageFilter, and FilterList (for composing multiple filters with AND/OR logic).

Usage

  • Use Get when you know the exact row key and need a single row.
  • Use Scan with row key range when you need a contiguous range of rows.
  • Use Scan with filters when you need to select rows based on column values or complex criteria.
  • Use full table Scan sparingly, as it reads all data and can be expensive on large tables.

Theoretical Basis

The read path in HBase merges data from multiple sources:

Client Get/Scan request
    |
    v
RegionServer
    |-- Read from MemStore (in-memory, most recent writes)
    |-- Read from BlockCache (LRU cache of recently read HFile blocks)
    |-- Read from HFiles on HDFS (persistent sorted files)
    |
    v
Merge results (most recent version wins by timestamp)
    |
    v
Apply server-side filters (if any)
    |
    v
Return Result / ResultScanner to client

Get vs Scan performance characteristics:

Operation Latency Throughput Use Case
Get Low (single row) Low volume per call Point lookups by row key
Scan (bounded) Medium High (streaming) Range queries, analytics
Scan (full table) High Very high (full I/O) Export, migration

The FilterList supports two composition modes:

  • MUST_PASS_ALL -- equivalent to logical AND; a row must satisfy all filters.
  • MUST_PASS_ONE -- equivalent to logical OR; a row must satisfy at least one filter.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment