Big Data, GFS and MapReduce - Google's Historic Contributions

A four part  Big Data Series:

Part 1: Big Data, GFS and MapReduce – Google’s Historic Contributions
Part 2: Hadoop - The Open Source Approach To Big Data Processing You Ought To Know
Part 3: MapReduce - The Big Data Crunching Framework
Part 4: MapReduce Framework - How Does It Work?

Big Data is the new buzzword; a name given to the enormous amount of data that is piling up over time  at an unprecedented rate- and that rate is increasing explosively.  This ever increasing volume of data is rather difficult to store, manage and analyze using common data base tools. 

This data can be clearly divided into two major categories: human produced and machine generated.  

Human produced data include:

  • emails
  • blog posts
  • comments
  • social media interactions
  • music 
  • videos, etc.  

Machine generated data comes from:

  • various kinds of server logs
  • medical instrumentation such as X-rays, ECG, EEG, MRI, CT scans, etc., 
  • weather monitoring sensors
  • ocean mapping instrumentations
  • seismic sensors
  • various environmental and pollution monitoring systems
  • sub-soil data logs related to oil and mineral explorations
  • business transactions at all levels -retail to wholesale
  • banking and finance sectors
  • personal and activity monitoring devices and gadgets 
  • from space stations and space exploration vehicles
  • radio telescopes
  • research like large hadron colliders 
  • self quantification
  • wearable technologies
  • Internet of Things
  • Plant and Industrial Networks, etc.  

Combined together we are moving deeper into the data-age; producing either ourselves or through various machines and sensors a data tsunami the likes of which mankind has never witnessed before.

Big Data Characteristics

Big data is not only about the enormous size of the data; the data variety also varies wildly and it has a very high growth rate as well.  Collectively Big Data characteristics are known in the industrial parlance as the three Vs of big data viz., 

  • large Volume
  • high Velocity (changing rapidly) and 
  • Varied type

The Big Data Question

The big question that is continuously challenging the storage industry, fertile minds and high intellects is how to have an infrastructure flexible enough to economically store this ever increasing volume of rapidly changing data of much varied types; and secondly,  how to ascertain valuable insights-trends and patterns- that can help the businesses, the research, the industry and the mankind at large.  

The real challenge is to accomplish the storage and search at the web scale- a gigantic task indeed.

Google - GFS and MapReduce

The challenge to tame this eruption of big data was recognized and accepted by the data engineers at Google  as early as late nineties and early 2000s. As a company on the path to becoming synonymous with global search, they were not only trying to tame this tsunami of big data, they were shooting for an even bigger goal: to do this using COTS (commercial off the shelf) machines  so as to keep the cost as low as practically possible.

To solve the storage problem, the engineers at Google  developed GFS- Google File System. It is scalable, distributed and suitable for large data-intensive applications that are of a distributed nature.  In 2003, Google engineers, Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung presented their research at the 19th “ACM Symposium on Operating Systems Principles” held in New York.  

To help search and find the insights,  Google engineers Jeffrey Dean and Sanjay Ghemawat presented the MapReduce Framework in the year 2004 at “OSDI’04: Sixth Symposium on Operating System Design and Implementation”, held at San Francisco in December.

Both these technologies - GFS and MapReduce- were proprietary technologies. They were there but to use them one has to either work for the big G or write the codes himself - a challenge too big for any one without matching resources.