If you’re just joining this series, it is one aspect of a response to the gap between how development and operations view technology and measure their success – where it’s possible for dev and ops to be individually successful, but for the organization to fail. So how can we better align development and operations so they speak the same language and work towards the success of the organization as a whole? This series attempts to address a portion of this problem by presenting ops teams insight into how specific architecture and development decisions affect the day-to-day operational requirements of an application.
This article starts a new series about Big Data. You’ve probably heard it said that “Data is King”, and what that means is that the more meaningful data we can capture and analyze, the more of a positive impact it can have on our business and our users. A study conducted by MGI estimated that by fully leveraging big data retailers could increase their operating margin by up to 60%, that US Healthcare could drive efficiency and quality to the tune of $300B in value every year, that the governments in Europe could save $149B USD in operational efficiency, and that users of “services enabled by personal-location data could capture $600B in consumer surplus.” These are big numbers!
The question for us, however, is what does big data mean to developers and to those responsible for the operation of the applications and infrastructure required to maintain this data? ATKearney cites a statistic that an estimated 2.5 zettabytes of data was generated in 2012 alone – and by 2020 we will reach nearly 45 zettabytes of stored data. One zettabyte is approximately equal to a billion terabytes!
Big Data is real and the problem of effectively analyzing this data is complex. This series of articles begins with an overview of the different Big Data providers and the problems that each is intended to solve and then presents an overview of each of the major Big Data solutions, their problem domains, and a cursory look at the operational requirements to manage each. At that point I’ll turn it over to my partner in this series, Eric Wright, to get into the operational meat of these behemoth data stores!
Big Data Overview
Gartner defines Big Data as “high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization.” Wikipedia describes Big Data with the following characteristics:
- Volume: the quantity of data that is produced
- Variety: the different categories of data being produced
- Velocity: how quickly data is captured and saved
- Variability: how consistent, or inconsistent, is the data
- Veracity: the quality of the data
- Complexity: how detailed are the requirements to analyze data before deriving value
From these characteristics we can ascertain that different business needs are going to require different strategies and different solutions for managing and analyzing data.
Big Data Solutions
In the early days of data storage we derived a model that was a one-size-fits-all: relational databases. Relational databases were built based on the concept of entities and relationships between those entities. Relationships could be defined as one-to-one, one-to-many, or many-to-many and there was a prescribed way of normalizing data (reducing or eliminating duplicate data) and optimizing relationships. Different vendors emerged that provided different strategies for managing the performance of data queries and for scaling the amount of data that could be stored, but the core data representation was consistent.
As Big Data emerged, however, different vendors approached data storage differently to address different problem domains. The following list summarizes some of the major Big Data vendors and their solutions:
In the following we’ll provide a brief history and overview of these Big Data solutions and describes the problems they were designed to solve.
Hadoop was created by Doug Cutting and Mike Cafarella in 2005, initially to support Yahoo’s Nutch search engine project. It was built as an open source implementation to two Google papers: MapReduce and Google File System. Google File System is a distributed file system developed by Google used in its search engine to distribute files across large clusters of commodity machines. MapReduce is a functional algorithm used to analyze data contained in files stored in the Google File System that emphasizes sending analysis code across the network to the data it is intended to analyze. In short, GFS and MapReduce provide a very fast and efficient means for analyzing very large sets of data.
Hadoop can execute MapReduce queries on demand, but the most common use case for Hadoop is offline analysis. As such, it is found in many data centers as the engine for business intelligence and analytics.
While Hadoop implements Google File System and MapReduce, HBase, also known as the Hadoop database, implements yet a third Google paper: Bigtable. This paper defines Bigtable as, “a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers”. HBase is a database that runs on top of the Hadoop Distributed File System (HDFS, or GFS if you like) and supports analysis of structured data stored as documents or objects, rather than as raw files. It has the benefit of being able to efficiently execute MapReduce analysis against data for offline analysis, but also supports real-time queries using a key/value paradigm: records can be directly accessed via their row keys or a set of records can be returned via a table scan of a range of row keys. It organizes its data in column families: a row key references a column family and a column family contains columns, each of which contains values (and each value can have a certain number of versions). You can think of column families as being similar to tables in an RDBMS.
There are many uses for HBase, but one use case that I have implemented myself is to leverage HBase for scenarios when you want both online and offline analysis. Intelligent design of HBase row keys makes this possible!
In part II of this article we’ll look at the rest of the technologies we listed, beginning with MongoDB. The next several articles will review each in greater detail, describe how they solve their intended problem, and describe, from a cursory view, how they are deployed and managed in production.