What is Hadoop anyway?

Just to get this out of the way, Hadoop is more of a grid computing ecosystem than a platform.

HDFS is the underlying Cluster Filesystem. It has some unique attributes that allow grid processes to know where data resides within the grid. This allows the map/reduce processes to process close to where the data lives.  In theory, you could use glusterfs… but I wouldn’t recommend it.  At least not yet.

This leads us to the second part of the discussion; map/reduce. The details of this are beyond my scope, but the summary is that data processing jobs are distributed to many “grid nodes” for processing and returning results.

So, in it’s purest form, Hadoop is a cluster Map/Reduce process sitting atop a location based cluster file system (HDFS).  However, there’s a lot more to it.

There are various projects that sit atop Hadoop to provide various ways to access and use the cluster.  Pig, Hive, HBase, Zookeeper, etc.  are all additions to Core Hadoop.  Besides adding new components the base is rapidly changing.  For example, in HDFS 1.0, the namenode was a single point of failure.  In HDFS 2.0, there is an HA solution available.  In Hadoop 1.0  JobTracker and Task Manager are the job controls; Hadoop 2.0 uses a thing called Yarn.  It should allow for some interesting additions to the Ecosystem as we inject new modules for processing, etc.

Not sure where to start?  Distributions to the rescue.  Cloudera, Hortonworks and MapR are the few that I know about.  Much like Linux distributions, these companies take the available components and package them together in one convenient bundle.  They also contribute updates back to the community and usually include a management tool to install and maintain your cluster.

Yes, this is very high level and it only scratches the surface of “what is Hadoop.”  Sorry.  It is a deep subject and is a constantly evolving ecosystem.  HDFS and all of the Map/Reduce type applications are easy to separate.  As Hadoop continues to grow and evolve expect more options for the processing side of things; this will of course continue to cloud the definition of Hadoop.   Enjoy! 🙂


Grease Monkey ~~ GM

About Grease Monkey

Computer nerd since the 80's. Data nerd since the 90's. Generic nerd for a lifetime.
This entry was posted in Uncategorized. Bookmark the permalink.