Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Heibaiying BigData Notes HBase Java CRUD Operations

From Leeroopedia


Knowledge Sources
Domains Big_Data, NoSQL, HBase
Last Updated 2026-02-10 10:00 GMT

Overview

End-to-end process for performing CRUD operations on Apache HBase tables using the Java API, from establishing cluster connections to reading, writing, and deleting data with filters.

Description

This workflow covers the complete lifecycle of HBase data operations via the Java API. It starts with configuring the HBase connection using Zookeeper coordinates, creating a reusable Connection singleton, and obtaining Admin and Table interfaces. The process then covers table creation with column family specifications, data insertion using Put operations, data retrieval using Get (single row) and Scan (range queries) with optional filters, and data deletion using Delete operations. The workflow addresses both HBase 1.x and 2.x API differences and emphasizes proper resource management and thread safety patterns.

Usage

Execute this workflow when you need to build a Java application that interacts with an HBase cluster for real-time random read/write access to large datasets. This is appropriate for applications requiring low-latency CRUD operations on column-family-oriented data, such as user profile stores, time-series databases, or event logging systems.

Execution Steps

Step 1: Configure HBase Connection

Create an HBaseConfiguration object and set the Zookeeper quorum address and client port. These settings tell the HBase client how to discover the cluster topology through Zookeeper coordination.

Key considerations:

  • Set hbase.zookeeper.quorum to the Zookeeper ensemble addresses
  • Set hbase.zookeeper.property.clientPort (default 2181)
  • Configuration can also be loaded from hbase-site.xml on the classpath
  • Connection creation is expensive; maintain a singleton per application

Step 2: Create Connection and Obtain Interfaces

Establish a Connection to the HBase cluster using ConnectionFactory. From the Connection, obtain an Admin interface for DDL operations and Table interfaces for DML operations. The Connection is thread-safe and long-lived, while Admin and Table are lightweight and not thread-safe.

What happens:

  • ConnectionFactory.createConnection(config) creates a cluster connection
  • connection.getAdmin() returns an Admin for table management
  • connection.getTable(TableName) returns a Table for data operations
  • Connection internally manages a pool of connections to RegionServers

Step 3: Create Tables with Column Families

Use the Admin interface to create tables with specified column families. Define the TableName, create ColumnFamilyDescriptor objects for each column family, build a TableDescriptor, and call createTable(). Check for table existence before creation.

Key considerations:

  • Column families must be defined at table creation time
  • Keep column family count small (typically 1-3) for performance
  • Column family properties (compression, bloom filter, TTL) can be configured
  • HBase 2.x uses builder pattern (TableDescriptorBuilder) instead of HTableDescriptor

Step 4: Insert Data with Put Operations

Create Put objects specifying the row key, then add column values with column family, qualifier, and value. Submit single puts or batch puts to the Table interface for writing to HBase.

What happens:

  • Create Put with row key bytes
  • Add cells: put.addColumn(family, qualifier, value)
  • Submit with table.put(put) for single row or table.put(list) for batch
  • All values are stored as byte arrays using Bytes utility class
  • HBase automatically versions cells with timestamps

Step 5: Read Data with Get and Scan

Retrieve single rows using Get operations or scan row ranges using Scan operations. Apply optional filters (SingleColumnValueFilter, PrefixFilter, etc.) to narrow results. Process results by iterating over cells to extract row keys, column families, qualifiers, timestamps, and values.

What happens:

  • Get: retrieve a specific row by row key
  • Scan: iterate over a range of rows (startRow to stopRow)
  • Filters narrow results at the server side for efficiency
  • FilterList combines multiple filters with AND/OR logic
  • ResultScanner provides an iterator over matching rows

Step 6: Delete Data and Close Resources

Remove rows, column families, or specific columns using Delete operations. Disable and drop tables using the Admin interface when no longer needed. Close Table and Admin interfaces after use, and close the Connection on application shutdown.

Key considerations:

  • Delete operations can target entire rows, column families, or specific qualifiers
  • Tables must be disabled before they can be deleted
  • Close Table and Admin in try-with-resources or finally blocks
  • Connection should be closed on application shutdown
  • Proper resource cleanup prevents connection leaks to Zookeeper

Execution Diagram

GitHub URL

Workflow Repository