Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Heibaiying BigData Notes HBase Data Insertion

From Leeroopedia


Knowledge Sources
Domains NoSQL, Big_Data
Last Updated 2026-02-10 10:00 GMT

Overview

HBase inserts data using Put operations that specify a row key, column family, qualifier, and value, with writes first persisted to the Write-Ahead Log (WAL) and then buffered in a MemStore for eventual flush to disk.

Description

Data insertion in HBase follows a cell-oriented model. Each cell is uniquely identified by the combination of:

  • Row key -- the primary identifier for the row, stored as a byte array.
  • Column family -- the logical grouping of columns (must exist in the table schema).
  • Column qualifier -- the specific column name within the family (does not need to be predefined).
  • Timestamp -- automatically assigned if not specified; used for versioning.

A Put object represents one or more cell mutations for a single row. Multiple column-qualifier/value pairs can be added to a single Put, allowing efficient batch insertion of multiple cells within the same row.

Write path mechanics:

  1. The client creates a Put object and populates it with cell data.
  2. The client calls table.put(put), which sends the mutation to the appropriate RegionServer.
  3. The RegionServer writes the mutation to the Write-Ahead Log (WAL) on HDFS for durability.
  4. The mutation is then written to the MemStore, an in-memory sorted buffer.
  5. When the MemStore reaches a configured threshold, it is flushed to an HFile on HDFS.

All values in HBase are stored as byte arrays. The Bytes utility class provides conversion methods such as Bytes.toBytes(String) for converting Java types to byte arrays.

Usage

Put operations are used whenever data needs to be written to HBase:

  • Single cell writes -- inserting or updating one column value for a row.
  • Multi-cell writes -- inserting multiple columns in one row in a single Put to minimize RPC calls.
  • Bulk loading -- although for very large datasets, the HBase bulk load mechanism (using HFiles directly) is preferred over individual Puts.

Since HBase does not distinguish between insert and update operations, a Put to an existing cell simply creates a new version of that cell.

Theoretical Basis

The HBase write path is designed for write-heavy workloads with the following guarantees:

Client Put
    |
    v
RegionServer
    |-- 1. Write to WAL (append-only log on HDFS) -> durability guarantee
    |-- 2. Write to MemStore (in-memory sorted buffer) -> read availability
    |
    v (when MemStore is full)
Flush to HFile (immutable sorted file on HDFS) -> persistent storage

Key properties:

  • Atomicity -- All mutations within a single Put (even across multiple columns) are atomic at the row level.
  • Durability -- The WAL ensures that acknowledged writes survive RegionServer failures.
  • No read-before-write -- Puts do not need to read existing data, making writes extremely fast.
  • Sorted storage -- Data is maintained in row key order both in the MemStore and in HFiles, enabling efficient range scans.

The Bytes.toBytes() conversion is necessary because HBase stores everything as raw bytes, providing schema flexibility at the cost of requiring explicit serialization.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment