MapReduce is the name of both (i) a distributed processing programming model provided by the Apache Foundation, and (ii) a functional processing technique. It consists of two steps: the map()
function and the reduce()
function. Map()
converts each element in a data set into a modified element. Reduce()
takes subsets of modified elements and aggregates each into a single element. The idea is that the map()
and reduce()
functions can be distributed easily among multiple processors—facilitating the processing of large data sets.
Let’s dive into the MapReduce concept in greater detail.
MapReduce — The Concept
The idea of MapReduce is to help you processing data. Both the functions map()
and reduce()
are well-known in functional programming. Both are set operations that manipulate an existing set of elements and create a new set with the result of the manipulation. You transform one set into another by using the map()
function, and you transform one set into another by using the reduce()
function.
So, how do both functions manipulate an existing set?
- map(S, f) — The function takes a set of (key,value) pairs S and another function f as an input. It then transforms each element (k,v) in S into a new tuple by applying the function f(k,v). Mathematically, we transform each element as follows: for all (k,v) in S: (k,v) –> f(k,v).
- reduce(S, g) — The reduce function takes a set of (key, value) pairs and transforms all values associated to a given key into a single (key’, value’) tuple. It serves as an aggregator function that aggregates a bunch of (key, value) pairs into a single (key’, value’) pair. Mathematically, it transforms a number of (key, value) pairs with the same key into a single one as follows: k, valueSet from S –> k, g(valueSet).
On a high-level, we first manipulate each element from the original set using the map() function, then we group the elements in the set (for example, by key) and aggregate them together before reducing all grouped values to a single element.
Do you find this kind of abstract? You’re not alone! Let’s dive into a practical example next!
MapReduce — The Example
Here’s how the MapReduce concept plays out when aggregating the word counts of a number of documents and words:
Google processes huge data sets. For example, the search engine needs to know how often words occur in web documents.
The data sets are too large for a single computer. As a single computer has only eight CPUs, processing the data takes forever. Hence, Google uses hundreds of computers in parallel.
But writing parallel programs is hard. You must tell each of the hundreds of computers exactly what data to process and what to send to other computers. Writing these parallel programs, again and again, is expensive. What would you do, if a human experts cost you more than $100k per year?
Right, you would fix it. And so does Google. They developed a parallel processing system called MapReduce in 2008. With MapReduce, writing parallel programs is simple. You define two functions map() and reduce(). Then, the MapReduce system takes care about distributing the data and the function executions. The MapReduce system is even resilient against the crash of many computers.
- The map() function transforms input data into output data. For example, it transforms a web document into a collection of (word, frequency) tuples. A document that contains the word “Google” ten times is mapped to the tuple (“Google”, 10).
- The reduce() function takes tuples with the same key and maps them to a new value. For example, it takes all (“Google”, x ) tuples and creates a new tuple that sums over all values x. In other words, it reduces the keyword “Google” to a single value. This single value is the number of occurrences of the word “Google” in all web documents.
MapReduce — The System (Idea)
The idea of the MapReduce system proposed by Google in a 2004 paper is simple: provide the user a straightforward interface, the map() and reduce() functions, and thus allow them to execute arbitrary algorithms in a distributed environment on a potentially large number of computers.
There are two significant advantages of distributed processing:
- Speed: If you have 1000 computers among which you can divide the work, you can potentially accelerate the computation by factor 1000. While no distributed system can achieve such a level of scalability, a speedup would be 100x for 1000x computers can be justified due to saving the time of the programmer or the users that wait for the result. In distributed systems terminology, you’d say that distribution can minimizing latency.
- Failure tolerance: If you run your computation on one computer and this computer crashes, the user won’t see any result. As it turns out, the likelihood of a single computer failure can become quite high, say 0.1% or so—especially if the computation takes a long time. Say, the user of your application is a doctor in surgery—now you cannot afford to risk that a single computer failure causes the failure of a surgery that needs the computation results. Thus, the simple idea of distribution is to use multiple machines. If you have 1000 machines and your system won’t fail until more than 500 machines have failed, the likelihood of your system becoming unresponsive is very low. Mathematically, the likelihood that 500 machines fail is more like 0.01**500, which is a very small number.
However, creating distributed systems is hard. You need to ensure that all computers can communicate with each other via TCP etc. Any computer may fail at any time, so you need to catch all the possible exceptions that may occur. You must safepoint the computation so that if a computer fail, you can get the result from the stable storage. And many more challenges await you.
To help programmers create robust, failure-tolerant distributed systems, Google’s engineers developed the MapReduce system that extends the programming model described above by a few features to facilitate distribution of computation.
Here’s a system overview they provide in their 2004 paper:
Here’s a rough sketch of the steps involved:
- Multiple workers run on different machines. One worker serves as the master of the system.
- The master assigns the computation of the map() functions to some workers and the computation of the reduce() function to other workers.
- The input files are split and distributed among the map() workers.
- The map() workers calculate the map() function result and write it to stable storage (intermediate files) to ensure failure tolerance. If any of those fail, another worker can take over by reading these files on stable storage.
- The reduce() workers read the result from the stable storage (files) and calculate the result to the final output files.
- The user can then read the final output from there.
Thus, the user of the MapReduce system must only provide the input files and the map() and reduce() functions. They don’t care about the distribution aspect—which greatly improves the usability of the system.