Workflow:Heibaiying BigData Notes HBase Java CRUD Operations
| Knowledge Sources | |
|---|---|
| Domains | Big_Data, NoSQL, HBase |
| Last Updated | 2026-02-10 10:00 GMT |
Overview
End-to-end process for performing CRUD operations on Apache HBase tables using the Java API, from establishing cluster connections to reading, writing, and deleting data with filters.
Description
This workflow covers the complete lifecycle of HBase data operations via the Java API. It starts with configuring the HBase connection using Zookeeper coordinates, creating a reusable Connection singleton, and obtaining Admin and Table interfaces. The process then covers table creation with column family specifications, data insertion using Put operations, data retrieval using Get (single row) and Scan (range queries) with optional filters, and data deletion using Delete operations. The workflow addresses both HBase 1.x and 2.x API differences and emphasizes proper resource management and thread safety patterns.
Usage
Execute this workflow when you need to build a Java application that interacts with an HBase cluster for real-time random read/write access to large datasets. This is appropriate for applications requiring low-latency CRUD operations on column-family-oriented data, such as user profile stores, time-series databases, or event logging systems.
Execution Steps
Step 1: Configure HBase Connection
Create an HBaseConfiguration object and set the Zookeeper quorum address and client port. These settings tell the HBase client how to discover the cluster topology through Zookeeper coordination.
Key considerations:
- Set hbase.zookeeper.quorum to the Zookeeper ensemble addresses
- Set hbase.zookeeper.property.clientPort (default 2181)
- Configuration can also be loaded from hbase-site.xml on the classpath
- Connection creation is expensive; maintain a singleton per application
Step 2: Create Connection and Obtain Interfaces
Establish a Connection to the HBase cluster using ConnectionFactory. From the Connection, obtain an Admin interface for DDL operations and Table interfaces for DML operations. The Connection is thread-safe and long-lived, while Admin and Table are lightweight and not thread-safe.
What happens:
- ConnectionFactory.createConnection(config) creates a cluster connection
- connection.getAdmin() returns an Admin for table management
- connection.getTable(TableName) returns a Table for data operations
- Connection internally manages a pool of connections to RegionServers
Step 3: Create Tables with Column Families
Use the Admin interface to create tables with specified column families. Define the TableName, create ColumnFamilyDescriptor objects for each column family, build a TableDescriptor, and call createTable(). Check for table existence before creation.
Key considerations:
- Column families must be defined at table creation time
- Keep column family count small (typically 1-3) for performance
- Column family properties (compression, bloom filter, TTL) can be configured
- HBase 2.x uses builder pattern (TableDescriptorBuilder) instead of HTableDescriptor
Step 4: Insert Data with Put Operations
Create Put objects specifying the row key, then add column values with column family, qualifier, and value. Submit single puts or batch puts to the Table interface for writing to HBase.
What happens:
- Create Put with row key bytes
- Add cells: put.addColumn(family, qualifier, value)
- Submit with table.put(put) for single row or table.put(list) for batch
- All values are stored as byte arrays using Bytes utility class
- HBase automatically versions cells with timestamps
Step 5: Read Data with Get and Scan
Retrieve single rows using Get operations or scan row ranges using Scan operations. Apply optional filters (SingleColumnValueFilter, PrefixFilter, etc.) to narrow results. Process results by iterating over cells to extract row keys, column families, qualifiers, timestamps, and values.
What happens:
- Get: retrieve a specific row by row key
- Scan: iterate over a range of rows (startRow to stopRow)
- Filters narrow results at the server side for efficiency
- FilterList combines multiple filters with AND/OR logic
- ResultScanner provides an iterator over matching rows
Step 6: Delete Data and Close Resources
Remove rows, column families, or specific columns using Delete operations. Disable and drop tables using the Admin interface when no longer needed. Close Table and Admin interfaces after use, and close the Connection on application shutdown.
Key considerations:
- Delete operations can target entire rows, column families, or specific qualifiers
- Tables must be disabled before they can be deleted
- Close Table and Admin in try-with-resources or finally blocks
- Connection should be closed on application shutdown
- Proper resource cleanup prevents connection leaks to Zookeeper