Google processes huge data sets. For example, the search engine needs to know how often words occur in web documents.
The data sets are too large for a single computer. As a single computer has only eight CPUs, processing the data takes forever. Hence, Google uses hundreds of computers in parallel.
But writing parallel programs is hard. You must tell each of the hundreds of computers exactly what data to process and what to send to other computers. Writing these parallel programs, again and again, is expensive. What would you do, if a human experts cost you more than $100k per year?
Right, you would fix it. And so does Google. They developed a parallel processing system called MapReduce in 2008. With MapReduce, writing parallel programs is simple. You define two functions map() and reduce(). Then, the MapReduce system takes care about distributing the data and the function executions. The MapReduce system is even resilient against the crash of many computers.
The map() function transforms input data into output data. For example, it transforms a web document into a collection of (word, frequency) tuples. A document that contains the word “Google” ten times is mapped to the tuple (“Google”, 10).
The reduce() function takes tuples with the same key and maps them to a new value. For example, it takes all (“Google”, x ) tuples and creates a new tuple that sums over all values x. In other words, it reduces the keyword “Google” to a single value. This single value is the number of occurrences of the word “Google” in all web documents.