Bringing the power of Hadoop to the enterprise is a tricky matter. While we all know the wonderful virtues of distributed storage and compute and how it’s solving Big Data problems in the web world, it is entirely a different matter when dealing with the challenges of a large enterprise.
I’m actually somewhat envious of some company’s laser-beam approach to Big Data. Most of these environments are challenged with volume & velocity. Our focus leans more towards variety. Our ETL team currently has over 5,000 unique workflows. Many of these products are redundant, useless, and pathetic attempts of moving data by myopic projects over time. Ineffective MDM and data modeling practices have also fed this beast over the years. I’m not suggesting that we move all of these over to Hadoop day one, but the writing is on the wall that at some point we will run into the same mess if we’re not careful.
How does one manage this variety of unique data sources? Perhaps we should consider Grease Monkey’s challenge of server management in much the same way: I dub thee Data Orchestration.
We’re looking at various options of Talend and custom code to allow us to manage this. This will evolve over time, but we’re trying to look ahead.
So what do you use to manage your data ingest & emit?