Hadoop - The Open Source Approach to Big Data Processing you Ought to Know

A four part  Big Data Series:

Part 1: Big Data, GFS and MapReduce – Google’s Historic Contributions
Part 2: Hadoop - The Open Source Approach To Big Data Processing You Ought To Know
Part 3: MapReduce - The Big Data Crunching Framework
Part 4: MapReduce Framework - How Does It Work?

Imagine yourself around the turn of the century- sometime around 1999-2002.  Internet is evolving, Java is hot, XML is roaring and more and more people are joining the formative Internet Atlantic. So what’s the big technology challenge?  Well it is not that difficult to fathom out that one of the biggest technology challenges of this time is the handling of this ever increasing volume of data continuously being generated on the Internet- the peaks of the initial waves that will soon become Big Data tsunami.  

Google was not the only player - there were many others hunting for the silver bullet to win the big game of Big Data.

Nutch project was evolving at the Apache Software Foundation.  The goal was to have a web-scale, crawler based search engine, and, unlike Google, an open source one.  It was based on Lucene and Java.  In 2004 Nutch demo’ed it’s capabilities.  It successfully processed a hundred million web pages on four nodes - a distant milestone from being web scale.  That was in 2004 - the year MapReduce was presented by Google and a year earlier GFS was already introduced.

Inspired by GFS and MapReduce research papers,  Nutch added DFS (Distributed File System) and MapReduce implementation. With just two part-time developers, Doug Cutting, the then Internet Archive Search Director and Nutch spear head, and Mike Cafarella, a graduate student at the University of Washington, it took over two years.  

The results were quite encouraging - scaling up to several hundred million webpages, a system easier to program and run - but still a far cry from being web scale; that was 2006.

These were hectic times.  Yahoo being a Search Company like Google, was also facing this huge challenge of data storage and processing at the web scale in a scalable and economical way.

In 2005, Yahoo’s Eric Baldeschwieler (the famous E14), Owen O’Malley and others set forth to crack the secret of big data processing.  It was only due to the great vision and technology leadership from E14 and Raymie Stata, the then CTO at Yahoo, that the team turned to the Apache Software Foundation. Doug Cutting along with Arun Murthy and many other smart heads were brought into the team at Yahoo.

The year 2006 proved to be a year of rapid developments.  The new, reinforced team at Yahoo began to achieve successes.  Hadoop project got split off Nutch with Cutting himself responsible for Apache Open Source liaison. E14 was responsible for engineers, clusters, users etc. 

With the much needed resource being supplied by Yahoo and the innovative leadership of Cutting, Nutch, now morphed into Hadoop, finally reached web scale. That was in 2008.

Hadoop Down Stream 

Hadoop has encouraged a number of startups along with some established companies to take this big data processing platform forward. Some of the big names from the startup side are:

  • Cloudera
  • Hortonworks (a Yahoo company launched in 2011, with E14 as CTO and Owen O’Malley as co-founder)
  • EMC
  • Greenplum
  • MapR

From the established side, we have:

  • Intel
  • IBM
  • Amazon Web Services
  • Facebook
  • Twitter
  • Yahoo (actually the real force behind Hadoop), 
  • LinkedIn
  • eBay, etc. 

Cutting has the unique distinction of being the founder of Hadoop as well as chairman of the Apache Software foundation and Chief Architect at Cloudera - too many feathers in his cap. 

HDFS - Hadoop File System

The Hadoop Distributed File System (HDFS) is a distributed file system.  It has its roots in Apache Nutch web search engine project and has been, at a later stage, inspired by the Google File System. It can be successfully run on commodity hardware thus keeping the TCO (total cost of ownership) low. 

The major characteristics of HDFS  are: 

A.  Fault Tolerant

In a web scale distributed system, failures are common.  As such any system targeting this kind of scale has to be fault tolerant.  HDFS achieves this fault tolerance through replication.  The data is replicated thrice by default.

B.  Deployable on Low Cost Commodity Hardware

That was one of the major objective behind Google File System and it has been successfully achieved by HDFS as well. This keeps not only the hardware cost low, it also helps in keeping the deployment, service and maintenance and upgrading/ scaling up cost at a minimum compared to specialized big data crunchers.  This has a high positive impact on keeping the total cost of ownership (TCO) within attractive limits from a purely business point of view.

C.  Provides Speedy Access to Application Data

In any web scale architecture, speed is a matter of high import.  HDFS provides processing nodes high speed access to application data which goes in line with batch processing systems.

D.  Suitable For Applications With Large Data Sets

HDFS has been designed for applications with large data sets-typically the file size in an HDFS application lies in the range of several Giga bytes to Terra bytes, whereas the total storage usually runs in petabytes. With this size of data, it is convenient to move computation closer to data rather than moving data to computation which is the norm in general.  This helps in  minimizing network congestion and, therefore, the throughput increases- a much valued outcome. There are interfaces provided to move applications closer to relevant data in HDFS.

E.  Highly Portable

Portability of a system is of utmost importance when a new technology is created so as to enable its deployment and/or migration to different available platforms.  It also helps in system adoption without the fear of becoming platform hostage.  Scaling/extending a system at some later stage also becomes smooth when the computation requirements outgrow the existing hardware/platform facilities.

Hadoop Architecture

In a typical deployment there is one NameNode running on a dedicated commodity server (master) and a number of DataNodes running on inexpensive commodity machines (slave nodes), usually one DataNode instance per machine.

Operations supported by HDFS are like other file systems, such as making and removing directories, creating and removing files in these directories, moving a file from one directory to another, renaming a file, etc.

Files in HDFS are stored in a sequence of blocks. These blocks, excluding the last one, are of the same size. To achieve fault tolerance, these blocks are replicated. The replication factor and the block size are configurable.  Typical values being 64 MB for the block size and 3 for the replication factor. Files in HDFS are write-once and only one writer is permissible at any time.

Deployment and operation of a Hadoop DFS involves other more intricate and complex operations and activities.  These include but are not limited to data replication, persistence, cluster rebalancing, data integrity, robustness, etc.