Principle:Heibaiying BigData Notes HBase Data Reading
| Knowledge Sources | |
|---|---|
| Domains | NoSQL, Big_Data |
| Last Updated | 2026-02-10 10:00 GMT |
Overview
HBase provides two primary read mechanisms: Get for retrieving a single row by its exact row key, and Scan for iterating over a range of rows with optional server-side filters for predicate pushdown.
Description
Reading data from HBase involves two fundamentally different access patterns:
1. Get (Point Read):
A Get retrieves a single row identified by its exact row key. It returns a Result object containing all cells (or a subset of cells if column family/qualifier restrictions are applied) for that row. Get operations are:
- O(1) in the average case -- HBase uses a block index and Bloom filters to locate the target row efficiently.
- Atomic -- The returned Result represents a consistent snapshot of the row.
- Well-suited for random access patterns such as key-value lookups.
2. Scan (Range Read):
A Scan iterates over a range of rows, optionally bounded by start and end row keys. It returns a ResultScanner that lazily fetches rows in sorted order. Scans support:
- Row key range -- Specify
startRow(inclusive) andstopRow(exclusive) to limit the scan range. - Full table scan -- Omit start and stop rows to iterate over the entire table.
- Server-side filters -- Attach a
FilterListto push predicates to the RegionServer, reducing data transferred over the network.
Server-side filtering (predicate pushdown):
Filters are evaluated on the RegionServer before results are sent to the client. This is a critical optimization for large datasets because it:
- Reduces network bandwidth consumption.
- Decreases client-side memory usage.
- Leverages the RegionServer's proximity to the data on HDFS.
Common filter types include RowFilter, SingleColumnValueFilter, PrefixFilter, PageFilter, and FilterList (for composing multiple filters with AND/OR logic).
Usage
- Use Get when you know the exact row key and need a single row.
- Use Scan with row key range when you need a contiguous range of rows.
- Use Scan with filters when you need to select rows based on column values or complex criteria.
- Use full table Scan sparingly, as it reads all data and can be expensive on large tables.
Theoretical Basis
The read path in HBase merges data from multiple sources:
Client Get/Scan request
|
v
RegionServer
|-- Read from MemStore (in-memory, most recent writes)
|-- Read from BlockCache (LRU cache of recently read HFile blocks)
|-- Read from HFiles on HDFS (persistent sorted files)
|
v
Merge results (most recent version wins by timestamp)
|
v
Apply server-side filters (if any)
|
v
Return Result / ResultScanner to client
Get vs Scan performance characteristics:
| Operation | Latency | Throughput | Use Case |
|---|---|---|---|
| Get | Low (single row) | Low volume per call | Point lookups by row key |
| Scan (bounded) | Medium | High (streaming) | Range queries, analytics |
| Scan (full table) | High | Very high (full I/O) | Export, migration |
The FilterList supports two composition modes:
- MUST_PASS_ALL -- equivalent to logical AND; a row must satisfy all filters.
- MUST_PASS_ONE -- equivalent to logical OR; a row must satisfy at least one filter.