If you’re just joining this column, it is one aspect of a response to the gap between how development and how operations view technology and measure their success – it is wholly possible for development and operations to be individually successful, but for the organization to fail.
So, what can we do to better align development and operations so that they can speak the same language and work towards the success of the organization as a whole? This article series attempts to address a portion of this problem by presenting operation teams insight into how specific architecture and development decisions affect the day-to-day operational requirements of an application.
The current article series is reviewing Big Data and the various solutions that have been built to capture, manage, and analyze very large amounts of data. Unlike the relational databases of the past, Big Data is not one-size-fits-all, but rather individual solutions have been built that address specific problem domains.
Last time we started a two-part review of Hadoop. The previous article presented an overview of Hadoop, its architecture, how data is imported into its Hadoop Distributed File System (HDFS), and how data is analyzed. This article dives deeper into data analysis using MapReduce.
Introduction to MapReduce
As you might surmise, analyzing very large amounts of data requires a different programming paradigm than loading a handful of records from a database. To support this, Google defined a new programming paradigm called MapReduce. MapReduce is divided into two steps:
- Mapper: given an input, emits a key and a value
- Reducer: analyzes all of the data associated with a specific key
As we saw in the previous articles about Hadoop, Hadoop provides all of the infrastructure required to distribute your mappers to the machines that have the data they should analyze, it performs correlation to group keys together, and it delivers the mapped keys to the reducer. Just as a reminder, figure 1 shows the organization of Hadoop, its data, and its data execution.
Figure 1 Hadoop Internal Structure
The Hadoop Master Node contains two core processes: the JobTracker and the NameNode and each slave node contains two processes: the TaskTracker and the DataNode. The JobTracker is responsible for distributing map and reduce jobs to TaskTrackers for execution. The NameNode knows where all data is location across the Hadoop cluster, which is managed on each slave node by the DataNode.
Many years ago, Kernighan and Ritchie created a book entitled “The C Programming Language” and it started with an example called “Hello, World”, as an example of how to output the words “Hello, World” to the system output. The concept was that when learning a new programming language, you should learn the minimum amount of work required to do something meaningful. We have taken that strategy and implemented it nearly every introductory programming textbook, so I thought it might be a good idea to do the same thing here (although my goal is to show you how MapReduce works, not how to write MapReduce applications, so no coding, I promise!)
One of the simplest examples for which Hadoop is known, and is considered its “Hello, World” example is the word count application. For those of you who would like to take the code for a spin, you can this example in an article that I wrote for InformIT.com. The word count application is provided with a large piece of text and outputs the number of times each word in the text appears. For example, the code in the companion Hadoop article took the book Moby Dick, which is available in text for on Project Gutenberg, and counts all of the words in it. In this case, the word “a” appeared 4687 times while the word “your” appeared 251 times.
Let’s design such an application. This input will be a large text file, but Hadoop will break it up into individual lines for us before passing it to our mapper. The mapper will convert the line of text to lower case (so that we match words regardless of case) and it will break the line up into individual words. Then it will output the word that it sees and the number of times it sees it (which will be “1” in every case). It might seem strange to output “1” every time, but because of how Hadoop will correlate things for us, it won’t really help to make any optimization on a line-by-line basis.
Next, Hadoop will correlate all words together and create a list of values (all of the values will be “1”) and pass each word and its list of values to a reduce operation. The reducer will then iterate over all of the values and sum up all of the number in the value list. All of this is shown in figure 2.
Figure 2 MapReduce Work Count Application
Figure 2 takes the first sentence from Charles Dickens’ Tale of Two Cities: “it was the best of times, it was the worst of times”. This sentence is sent to the mapper that outputs each word with a value of “1”. Hadoop then correlates the words together and builds a list of values for each word. The result is that we have each word as the “key” and the value is a list of “1”s. The reducer then iterates over the list of “1”s and computes the sum of all of the “1”s. From this we can see that “best” and “worst” only appeared once, but the other words appeared twice.
You might be wondering that if we know that all of the values in the value list are “1”, why don’t we just take the number of elements in the list rather than iterating through the list to create a running sum? That is a very good question, but there is a very good reason for not making this optimization: local reductions. As you might surmise, if we were to execute this code across hundreds or even thousands of machines it would be very chatty to send all of that data across the network.
An optimization that we often make is called a “local reduction”, also known as a “combination”. A combiner is a task that performs a reduce operation on a local machine before sending the combined results back to Hadoop to be sent to the reducer. Most of the time we can leverage our reducer and use it as a combiner, but only if we are cognizant that our reducer may run on the same key more than once, otherwise our results may not be accurate.
Intermediate MapReduce: Parsing Log Files
The previous word count example was about as simple of a MapReduce application that you could write: a simple key, a simple value, and minimal processing. Let’s step it up a little bit and design an application that you might actually use, one that can give us the number of page views on our website per hour.
This problem is another counting problem, but the difference is that we want to partition our data differently, namely by the date and hour of day. In my third article on Hadoop on InformIT.com, I built a MapReduce application to do this, but let me summarize the strategy here.
The following is an excerpt from my Apache Web Server’s log file:
220.127.116.11 - - [16/Dec/2012:05:32:50 -0500] "GET / HTTP/1.1" 200 14791 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"
What we need to do is create a key by extracting the date “16/Dec/2012:05:32:50 -0500” into its date and hour: December 16, 2012 at 5am, and then emit that key with a value of “1”, meaning that we saw one log entry for this hour. Hadoop will then handle correlating the keys and sending your reducer all unique keys (dates and hours) with a list of values (“1”s). The reducer does exactly what it did in the word count application: it iterates over the list of values and creates a running sum. The output will look something like the following:
2012-11-18 T 16:00:00.000 1
2012-11-18 T 17:00:00.000 21
2012-11-18 T 18:00:00.000 3
2012-11-18 T 19:00:00.000 4
2012-11-18 T 20:00:00.000 5
2012-11-18 T 21:00:00.000 21
2012-12-17 T 14:00:00.000 30
2012-12-17 T 15:00:00.000 60
2012-12-17 T 16:00:00.000 40
2012-12-17 T 17:00:00.000 20
2012-12-17 T 18:00:00.000 8
2012-12-17 T 19:00:00.000 31
2012-12-17 T 20:00:00.000 5
2012-12-17 T 21:00:00.000 21
If you read through the code in the InformIT article, you’ll see that the challenging part was in parsing the log entry and extracting the date and then in building a custom Date key class to use to group page views together. The rest of the processing was almost identical to the word count example.
The point to take away from this section is that when parsing log files, the important planning is in your key design, or more specifically, identifying the data points that you want to capture. But counting is only one thing that Hadoop can do well. The next section introduces you to MapReduce design patterns and points you to resources you can leverage to learn how to build more complex MapReduce applications.
MapReduce Design Patterns
As MapReduce has been used in the industry to solve different problems, the domains of those problems have been identified and design patterns, or best practices, have been developed to solve common problems. O’Reilly has a great book, entitled MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems, that summarizes these patterns and approaches. I am not going to attempt to summarize a book in 500 words, but I wanted you to be aware of the types of problems that MapReduce can solve, and MapReduce Design Patterns does a good job of summarizing those problems:
- Summarization: this set of design patterns calculate numerical summarizations of your data. We already saw how we might count things to derive data such as page views per hour, but we might want to do more complex summarizations such and determine the minimum or maximum time that a user spent on your website, the average number of page views per user per visit, or a count of the number of unique page views per hour. The summarization patterns help you by providing strategies for handling averages, minimums and maximums, medians and standard deviations, and so forth.
- Filtering: filtering patterns help you identify a subset of your data. For example, you might extract small subsets of data like finding the top 10 visitors to your site or you might perform a full de-duplication effort and identify all duplicate records.
- Data Organization: data is captured and stored in Hadoop, but many times the organization or partitioning of that data is not exactly what you need in order to properly analyze it. These patterns define ways that you can restructure or reorganize your data so that it can be better analyzed. Additionally, over time, your analytic needs may change so analysis that worked last year may not necessarily work this year. Data organization design patterns can help you address this evolution in your analytic needs.
- Join: your data may be very interesting in and of itself, but it may be more interesting when combined with other data. For example, your web logs provide a rich amount of information about what your users are doing, but it might be better combined with your user SQL database to understand what specific users are doing with respect to their historical behavior. The join design patterns help you combine data from multiple sources to derive insight.
- Metapatterns: as you might surmise from the collection of pattern descriptions provided above, there are many “point” solutions to solving specific problems. But if your problem is particularly complex, you might find it beneficial to integrate multiple patterns together. The metapatterns are patterns of patterns that describe ways to combine multiple patterns to solve complex problems.
- Input and Output: the final set of patterns are strategies to customize the way that Hadoop loads and stores data. The default way that Hadoop leverages the HDFS for data load and data storage may work for you, but if it does not meet your needs then there are strategies to override and customize this behavior.
We have covered quite a bit of ground in this article, but I thought it was important for you to understand how MapReduce applications work and the types of problems that they solve so that you can assess your environmental requirements. Furthermore, in addition to using MapReduce in Hadoop applications, other Big Data implementations, such as HBase and MongoDB, make use of Map Reduce, so this foundational understanding will benefit you in other situations.
This article completes our cursory review of Hadoop. The first articles described Hadoop from an infrastructural and mechanical approach (how it does what it does) and this article reviewed Hadoop from an application approach by describing how to analyze data contained in Hadoop. The next article will review Hadoop’s database cousin: HBase.