Devs are from Venus, Ops are from Mars, Big Data: HBase

January 21st, 2015 by

Devs are from Venus, Ops are from Mars - Leveraging Big DataIf you’re just joining this column, it is one aspect of a response to the gap between how development and how operations view technology and measure their success – it is wholly possible for development and operations to be individually successful, but for the organization to fail.

So, what can we do to better align development and operations so that they can speak the same language and work towards the success of the organization as a whole? This article series attempts to address a portion of this problem by presenting operation teams insight into how specific architecture and development decisions affect the day-to-day operational requirements of an application.

The current article series is reviewing Big Data and the various solutions that have been built to capture, manage, and analyze very large amounts of data. Unlike the relational databases of the past, Big Data is not one-size-fits-all, but rather individual solutions have been built that address specific problem domains. The last handful of articles reviewed Hadoop, its infrastructure, and how to analyze data in Hadoop using MapReduce. This article changes subjects, but only slightly: it reviews HBase, which is also known as the Hadoop database.

Introduction to HBase

As mentioned at the onset of this series, the Hadoop family of technologies were built as open source implementations of three Google:

The Hadoop Distributed File System (HDFS) is an open source implementation of the GFS and defines how data is distributed across a cluster of commodity machines; MapReduce is a functional programming paradigm for analyzing data that is distributed across an HDFS cluster; and HBase is an implementation of BigTable, the subject of this article.

Google defined BigTable as a distributed storage system for managing structured data that is designed to scale to very large sizes: petabytes of data across thousands of commodity machines. So, HBase is a NoSQL database that runs on top of Hadoop as a distributed and scalable big data store that supports both real time key/value queries as well as off-line MapReduce jobs. Because HBase runs on top of Hadoop, it can leverage Hadoop’s distributed capabilities as well as its ability to execute work across thousands of machines.

HBase is not a relational database and, as such, it manages its data quite differently. It supports a four-dimensional data model in which each “cell” is represented by the following four coordinates, shown in figure 1:

  • Row Key: each row has a unique row key; while it is treated internally as a byte array, it does not have any formally defined data type.
  • Column Family: the data contained in a row is partitioned into column families; each row has the same set of column families, but across rows, each column family does not need to maintain the same column qualifiers. This may be confusing at first because the column family “demographics” in one row make have column qualifiers such as “name” and “address” and the same column family “demographics” in another row may have column qualifiers such as “longitude” and “latitude”, but they are all part of the same table and column family. This is not common, but permissible by HBase.
  • Column Qualifier: column families are comprised of column qualifiers, or in our database vernacular, just “columns” themselves.
  • Version: each column can have a configurable number of versions; if you request the data contained in a column without specifying a version then you will receive the latest version, but you can request older versions as you need them.

figure-1

Figure 1 HBase’s Four-Dimensional Data Model

To think about this four-dimensional data model in relational terms, a row is accessible by its row key, which is similar to its primary key. Each row is somewhat analogous to a schema and each column family is roughly equivalent to a table in that schema. But then each column can contain multiple versions, which does not have a relational equivalent.

The key to designing a good HBase data model is to think about how the data is going to be accessed. As mentioned above, data in HBase can be accessed in two ways:

  • Data can be accessed through its row key, as in a key/value store, or through a table scan for a subset of row keys
  • Data can be accessed in an offline/batch manner by executing MapReduce jobs

This dual approach to accessing data is probably the most compelling reason I have adopted HBase in my architectures: I have the ability to perform traditional Hadoop MapReduce jobs against my data, but I also have the ability to access data in real-time. But with that said, being successful at accessing data in real-time requires that we put forethought into the design of our row keys!

Real-time Access

Real-time data access in HBase is accomplished in one of two ways:

  • Retrieve the single row for a specified row key
  • Retrieve all rows for a range of row keys

HBase behaves as a key/value store in the first case: if you know the row key then you can request the column families, columns, and versions associated with that row key. Note that you can refine your search to only return specific columns from specific column families. But what do you do if you do not know your specific row key?

Table scans allow you to access a range of rows, but you do need to know the row key to start from and the row key to end at. The best strategy in this scenario is to define row keys such that the row key itself conveys some value. You do need to be cognizant that row keys need to be distributed across your entire HBase cluster (HDFS) so they should probably not be a raw user name, but combining the hash of a user identifier with a time stamp can provide you with just such flexibility.

In this case, prefixing the key with a hash of the user identifier means that you can request a start key of USERID-0 and an end key of USERID+1-0 and retrieve all records for that user. Other tables should implement a similar strategy with identifiers applicable to that table’s data.

The challenge with HBase is that you can only query for data based on the row key, but if you can live with that constraint then you can gain the capability of storing petabytes of data across thousands of commodity machines.

MapReduce

In addition to accessing real-time data through row keys, you also have the ability to access data via MapReduce jobs. Everything that was presented in the past couple articles about MapReduce is applicable to HBase.

HBase provides helper utilities for executing MapReduce jobs that allow HBase to assume any of the following three roles:

  • HBase can be the data source (the source of the data you’re analyzing)
  • HBase can be a data sink (the destination to which your output will be written)
  • Both a data source and a data sink

The specific MapReduce process in HBase is as follows:

  1. Provide a method that executes a table scan of rows to analyze. This table scan can return all rows or a subset of rows
  2. The mapper is passed the row key for the row and a Results object provides the mapper with access to all of the column families, qualifiers, and versions for that row.
  3. The mapper performs its business logic and emits a key and a value
  4. Hadoop groups all of the same keys together and constructs a list of values associated with that key.
  5. The reducer is passed the key emitted by the mapper and the list of values.
  6. The reducer performs its business logic and emits its key and the results of its analysis.

The only real difference between traditional Hadoop MapReduce jobs and HBase MapReduce jobs is that you write a table scan as input (assuming that HBase is being used as a data source) that feeds your mappers. The mapping and reducing logic remains the same.

Deployment

HBase, because it runs on top of Hadoop, has many of the same deployment considerations. Specifically Hadoop includes the following components:

  • Hadoop Name Node and Secondary Name Node: the Hadoop infrastructure that identifies the location of all data across all data nodes. The secondary name node is a backup of the name node that can be manually used to replace the name node in the event that it goes down.
  • Hadoop Job Tracker: the Hadoop infrastructure for executing MapReduce jobs. Jobs that are sent to the Job Tracker are distributed to the appropriate Task Trackers for execution.
  • ZooKeeper: maintains online configuration state
  • Data Nodes: Hadoop nodes that maintain and manage HDFS data and that execute MapReduce jobs
  • Task Trackers: Hadoop nodes that execute MapReduce tasks; note that you do not need to include Task Trackers if you do not intend to execute MapReduce jobs in your use case.

Additionally, HBase adds the following components:

  • HBase Master: monitors all of the Region Servers and is the interface for all meta-data changes
  • Region Servers: entry-point for HBase clients to query for data

There are multiple options to consider when setting up an HBase environment, which change based on your needs. Chapter 9 of Manning’s HBase in Action has a very good discussion about the deployment requirements for different types of environments: I diagrammed their solutions and present a brief summary here.

figure-2

Figure 2 Cluster Deployment for a prototype (under 10 servers) environment

In a small environment, you can collocate the Hadoop processes (Name Node and Job Tracker) on the same machine as the Zookeeper and the HBase Master node. The secondary name node should be on a separate machine just in case the main node fails. Each slave node will contain a Hadoop data node and a task tracker (if you opt to enable MapReduce) as well as the HBase Region Server. Clients can then execute queries against the region servers or submit MapReduce jobs to the JobTracker.

Figure 3 shows a sample deployment environment for a small cluster of 10-20 servers.

figure-3

Figure 3 Cluster Deployment for a small (10-20 servers) environment

As the size of your environment increases, you are going to want to separate out the HBase Master node from the Hadoop nodes (name node and job tracker.) You also want to add another HBase Master just in case the main server goes down. Finally, the HBase Master node and the Zookeeper node are not very intense processes so you can feel free to collocate them together. Note that we only have one Zookeeper in this environment because Zookeeper requires an odd number of servers and we don’t quite need three instances yet.

Figure 4 shows a sample deployment environment for a medium cluster of up to 50 servers.

figure-4

Figure 4 Cluster Deployment for a medium (up to 50 servers) environment

In a medium environment we separate out the Hadoop name node from the job tracker node and then we increase the Zookeeper deployment to three nodes. We don’t necessarily need to have three HBase Master nodes, two will suffice, but if we have an additional machine that is only running Zookeeper then why not? And note that we had to increase the Zookeeper deployment from one server to three because it requires an odd number of servers.

You may also notice that if we are running MapReduce that we took this opportunity to separate out the task tracker from the data node. This hurts our MapReduce environment because we lose the locality of the task tracker to the node that contains the data it is operating on, but the task tracker can use a lot of system resources that can hinder the performance of the region server and hence impact your service-level agreements for real-time queries.

Finally, figure 5 shows a sample deployment environment for a large cluster of more than 50 servers.

figure-5

Figure 5 Cluster Deployment for a large (more than 50 servers) environment

In a large deployment we increase the number of HBase Master and Zookeeper nodes to five and we increase the quality and capabilities of the Hadoop name node and job tracker nodes.

Conclusion

This article reviewed the tenants underscoring Google’s Big Table paper and presented HBase as an open source implementation of Big Table. It reviewed the four-dimensional data model that HBase exposes and then discussed the two ways that you can access data in HBase: as a key/value store or by using MapReduce. Finally, this article reviewed deployment considerations and suggested some cluster configurations based on the size of your environment.

I highly suggest that if you are going to manage a production HBase environment that you review Manning’s HBase in Action, and in particular chapter 9. And I’m going to urge my partner in this series, Eric Wright, to share any advice that he may have from is own personal experience.

Leave a Reply

Your email address will not be published. Required fields are marked *