MapReduce - The Big Data Crunching Framework

A four part  Big Data Series:

Part 1: Big Data, GFS and MapReduce – Google’s Historic Contributions
Part 2: Hadoop - The Open Source Approach To Big Data Processing You Ought To Know
Part 3: MapReduce - The Big Data Crunching Framework
Part 4: MapReduce Framework - How Does It Work?

So how do you eat an elephant? Of course one bite at a time and that is exactly what MapReduce Framework does to the behemoth of Big Data.

Introduced by Google, it essentially takes a big job and breaks it down into smaller, mutually independent chunks of work. Since each of these tasks are mutually independent, they can be processed on different machines independently at the same time. That is in parallel.  The outputs of these machines working in parallel are combined together to give the final output for the original job.

The distinguishing feature of this framework is that it can be run on commodity machines, thus making it economically attractive as compared to bigger and more expensive parallel custom solutions.  

Inspired by Google’s proprietary MapReduce, it’s open source equivalent was developed by the Apache Software Foundation and is an essential part of the Hadoop project.  

Hadoop MapReduce has 3 major components that perform the core functions besides many other components performing important support functions such as scheduling, monitoring, failure recovery etc.

The core components are:

1.  MapReduce Job

The job component does it work at the very first stage. It divides the input data set into smaller independent tasks.  These tasks have to be independent otherwise they can’t be run in parallel.  

2.  MapReduce Map

The output of the job i.e. smaller, independent chunks of work, are processed by map tasks. As these chunks are independent, map tasks are able to process them in a purely parallel manner.  Maps transform the input records into intermediate records.  

3.  MapReduce Reduce Task

The output of the map task, i.e., intermediate records, are sorted by the framework.  This sorted output is then fed to the reduce task. The reduce task transforms the sorted intermediate records to the final output which is then stored in the file system.  

This is just the gist of MapReduce framework.


In Hadoop, MapReduce framework is layered on top of HDFS.  Typically, compute nodes and the storage nodes are the same, that is MapReduce and HDFS run on the same nodes (same CPUs). This way computation is taken to where the data is and the end result is a high net bandwidth across the cluster.

HDFS and MapReduce work in synergy. Through HDFS a distributed fault tolerant storage facility is obtained and using MapReduce on top of it provides an economical parallel computing cluster running on commodity hardware.