Page 54 - Building Big Data Applications
P. 54
48 Building Big Data Applications
API
RegionServer
Master Write Ahead
HFile MemStore
Log
Zookeeper HDFS
FIGURE 2.14 HBASE components. Image sourcedGeorge Lars, @HUG Talk.
The HBase client is a program API that can be executed from any language like
Java or Cþþ to access HBASE
ZookeeperdHBASE uses Zookeeper to coordinate all the activities between master
and region servers
How does HBASE internally manage all the communication between Zookeeper,
master servers, and region servers? HBASE maintains two special catalog tables named
ROOT and META. It maintains the current list, state, and location of all regions afloat on
the cluster in these two catalogs. ROOT table contains the list of META table regions, and
META table contains the list of all userspace regions. Entries in ROOT and META tables
are keyed by region names, where a region name is made of the table name the region
belongs to, the region’s start row, its time of creation, and a hash key value. Rowkeys are
sorted by default and finding the region that hosts a particular row is a matter of a lookup
to find the first entry where the key is greater than or equal to that of the requested
rowkey. AS regions are split or deleted or disabled, the ROOT and META tables are
constantly refreshed and thus the changes are immediately reflected to user requests.
Clients connect to the ZooKeeper and get the access information to the ROOT. The
ROOT provides information about the META, which points to the region whose scope
covers that of the requested row. The client then gets all the data about the region, user
space, the column family, and the location details by doing a lookup on the META table.
Post the initial interaction with the master, the client directly starts working with the
hosting region server.
HBASE Clients cache all the information they gather traversing ROOT and META, by
caching locations as well as the userspace, the region start and stop rows. The cached
data provides all the details about the regions and the data available there, avoiding
round trips to read the META table. In a normal mode of operation, clients continue to
use the cached entries as they perform tasks, until there is a failure or abort. When a
failure happens, it is normally due to the movement of the region itself causing the cache