MapReduce Framework - How Does It Work?

A four part  Big Data Series:

Part 1: Big Data, GFS and MapReduce – Google’s Historic Contributions
Part 2: Hadoop - The Open Source Approach To Big Data Processing You Ought To Know
Part 3: MapReduce - The Big Data Crunching Framework
Part 4: MapReduce Framework - How Does It Work?

We understand that MapReduce eats up the Big Data behemoth one bite at a time- but how does it really take those bites?

To further help explain the working of MapReduce and HDFS, we take an illustrative example problem in which we need to count the number of occurrences of words in a given input set. Let the two input text files be:

File 1:  it was a moonless night.
File 2: the night was cold.

Since we have to count the occurrences of each word, the two files can be processed independently.  The job, therefore, divides the word count problem into two smaller chunks viz., file 1 and file 2. 

File 1 goes to map 1 and file 2 goes to map 2.  Map1 and map2 executes in parallel.

Map 1 gives the intermediate output as:

It              1
Was           1
A               1
Moonless   1
Night         1

Map 2 gives the intermediate record as:

The              1
Night            1
Was              1
Cold              1

These intermediate records are then processed by the reduce task.

** The final output produced by reduce task is as:**

It                1    
Was                2
A                  1
Moonless          1
Night              2
The                1
Cold              1

Now imagine, that we have a 1000 page book, each page can be treated as a file so that we have file 1, file 2  ….. file 1000.

These 1000 files are processed in parallel by the map function, generating a thousand record of (key, value) pairs that is (word, number of occurrences) pairs.  

The 1000 records produced by the map function are intermediate records.  They are processed further by the reduce function which combines these 1000 intermediate records and produce the final output comprising of all the words as keys and number of occurrences as the value.

If one can treat a single line as a file and if each of the 1000 pages has 20 lines, in total we will be having 20000 lines and therefore 20000 parallel tasks will be generated by the job.  These 20000 tasks can then be processed in parallel by 20000 map functions running on 20000 nodes (CPUs) on a Hadoop cluster giving out 20000 intermediate records which are then reduced giving us the final output as before.

(By the way, that will be quite an underutilized bunch of CPUs, but for simplicity sake, let’s be happy with that at this point)

In this case, however, the processing time will be much smaller as each map task is limited to a single line of text only, and can therefore be processed in, say, 0.1 second.  If the reducer takes another another 1.9 seconds, the CPU processing will be completed in just about 2 seconds, ignoring the overheads involved.

That is the power and Beauty of Hadoop - doing huge jobs at mind blowing speeds on economical, commodity machines, usually running free Linux at the web scale!

Today, Hadoop has become a foundational technology.  A number of new projects at Apache Software Foundation have spawned out of Apache Hadoop- these include YARN, Apache HCatalog, Apache Ambari, Apache Hive (courtesy Facebook), Apache HBase (courtesy Powerset- a small startup), Apache Pig (courtesy Yahoo), besides many others. Hadoop is going strong and it’s eco-system is healthy, upbeat and thriving.

A child’s stuffed toy elephant called Hadoop, a spark of genius from his father, Doug Cutting, tons of human perspiration and the tsunami of Big Data is tamed-salute to the team spirit and the human courage.