
Has Hadoop gone the way of COBOL? Still used by some crusty installations, but past it’s prime for relevance?
I was a cloud skeptic. I still am, but…. The world has embraced cloud. For reasons beyond my understanding, it is now, Generally Acceptable to run your business on any of the 3 major cloud providers. Generally Acceptable here has a legal meaning that implies you aren’t incompent or negligent in moving your system there.
As a former systems administrator, I think it’s entirely crazy to put all of your stuff in a cloud, because of single points of failure. 1 company, 1 SPOF.
What does this have to do with Hadoop? Only that the first round of treatment for ailing Hadoop installs was Hadoop in the Cloud. EMR, HDI and whatever the Google version is promised the same things Hortonworks and Cloudera did for on-prem installs, but eliminated “the problem” of building and maintaining it yourself. Just spin it up and start Hadooping, they said.
“So AWS, how to I submit jobs to my new EMR cluster?” Just like always! Just create these ssh tunnels and…. Right… https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-connect-master-node-ssh.html
So the hype busted again and CEO’s started hearing about Spark and Kafka. They don’t know how it’s any different, but it sounds cool and everyone is talking about it.
The reality is that Spark and Kafka are mostly different from Hadoop, but they also solve different problems. The old elephant was designed for large batchy problems. You (Netflix) can work some magic to do more than batch, but it requires actual knowledge and design. Most enterprise Hadoop customers don’t like that, they just want silver bullets.
So Kafka is fast and so is Spark and they can “stream” things together and be the cool kids at the table for a while. It’s much more complex than most people need, but it’s buzzy.
Meanwhile, our good buddies at the Bezos Circus have been stealing from Google again. (Google actually they gave this away… probably to track your movements) EKS is born and now you can run Spark on k8s against your cloud object store. And this actually works!
It’s dynamic compute (on demand even) with low cost, reasonable performance storage. Learning how to spec out a Spark job will take a while, but that’s on the Dev and/or Data Science side. Because Spark does parallel by design, it scales out pretty quickly. Because Cloud providers own the network connectivity, we don’t have to worry about building out data and compute racks with high speed interconnects.
There are still pain points. Security Group, IAM roles, internode communications and that pesky, “How do I connect again?” But once you solve them, they’re easy to copy. A few software companies will even build your cluster for you on demand. Databricks and Dataiku come to mind.
GCP seems to be the least painful to implement this design. AWS comes in a distant second with much pain, gnashing of teeth and circular documentation that doesn’t really answer the question you ask. Microsoft Azure comes in dead last in ease of use. You’ll consider breaking your own hand as a reason you couldn’t work on that project. You’ll consider it again. Microsoft really does suck.
At the end of the cycle, Hadoop will still have some devout uses who understood what it was meant for and had success using it correctly. Those users will probably also have Spark on k8s, because it solves a different problem. So my prediction is that yes, Hadoop will survive, but much the way COBOL has. Probably not for as long tho…
Followup: https://www.theguardian.com/technology/2021/dec/15/amazon-down-web-services-outage-netflix-slack-ring-doordash-latest
https://www.zdnet.com/article/aws-misfires-once-more-just-days-after-a-massive-failure/
