GlusterFS and Hadoop, not replacing HDFS

Enterprise Hadoop must cooperate with many other forms of data transmission and ingestion. Any form of MFT, Mqueue or file landing zone requires disk space. Not HDFS disk, just disk that we can mount, MFT, SFTP, etc. to until we actually ingest the data into Hadoop. (where life if beautiful all the time.)

Traditional “Enterprise” disk space is provided by SAN or NAS mounts. There are reasons for this: snapshots, flashcopies, highly available nodes, re-redundant disks and de-duplication oh my! There are many valid reasons for using these technologies. Most – if not all – of those reasons do not apply to Hadoop landing zones.

Enter GlusterFS; a striped, redundant, multiple access point solution. My SPOF Hadoop v. 1.x NameNode can write to a GlusterFS mount, I can boot my DataNodes to a GlusterFS mount that has a backup server baked right into the mount command. I can point MFT, SFTP, Mqueue, etc. to a mount that has redundancy baked right in. This is sounding redundant.

My point is that GlusterFS meets the multi-node, replicated storage requirements enterprises demand, but using Local SATA disk at a ~35 times less than SAN cost. That SWAG is based on our internal cost of SAN @ $7.50/GB vs. $0.22/GB.

Good, Fast & Cheap — It’s a brave new world.

Posted in Administration, Deployment, Tuning | Leave a comment

Consuming JSON Strings in SQL Server

This article describes a TSQL JSON parser and its evil twin, a JSON outputter, and provides the source. It is also designed to illustrate a number of string manipulation techniques in TSQL. With it you can do things like this to extract the data from a JSON document:

Read the full article here.

Posted in Uncategorized | Leave a comment

You Paid for Support?! Bwah-ha-ha

We’re using Open Source Software extensively in our Big Enterprise. It really irritates me that we pay millions of dollars for “Support” from our vendors and we get endless circles of “try this,” “that should work” and “oh, that’s an upstream bug, we’ll file a bug report.” Seriously? For 10% of what we’re paying these guys, I’ll do it myself.

I currently have 3 bugs open with 3 vendors; 2 of those are open source. Let’s talk about them.

1) OS won’t PXE boot across a LACP Bond. The documentation says it should. Everything “looks right” but after 3 business days of the vendor telling me to try things I’ve already tried, I finally solved this myself. I can boot my DataNode image to my servers, but I wanted to install an OS on some of the control nodes. As soon as the install agent starts up, it loses network connectivity. I told it how to configure the bond on the kernel boot line, but it fails to see it and use it. Trying to use a single interface doesn’t work because the switch is expecting to distribute the packets (per LACP 802.3ad spec) across 4 NICs. It turns out that I can tell the kernel to use eth0 and NOT probe other network devices, which solves 99% of my problem. It’s not perfect, but it’s a hellava lot better than trying to hand install. Here’s hint if you have this problem: nonet.

2) Proprietary software vendor can’t pull the Avro schema from HDFS. This seems to be squarely in their court for resolution, however, they claim it’s a bug in Hive and opened a bug report. Come on kids, if you’re finding hdfs:// and expecting file:// something is wrong on your side.

3) Open source Hadoop vendor opened a bug report because pig doesn’t correctly support Avro in our version. We supplied a bug report and a bug solution from Apache, but they made us chase our tails for 10 days before they agreed and opened a new bug report.

After losing some 600 blocks of data in our Dev cluster we found out there is a “fix” for under replicated blocks coming in HDFS 0.20, but 0.1x doesn’t have this “feature.” Support DID help us find that issue, but ONLY after they ran us thru hoops looking for non-existent configuration problems.

My advice: Eschew paid support and dig into the details on your own. You’ll learn more, be more valuable and solve you own problems faster.

Posted in Uncategorized | Leave a comment

Replication FAIL

We’ve had our clusters running for a few months without significant issues. Or at least so we thought.
I’m not sure of the why and how yet, but it seems that even rack topology scripts running, replication factor of 3 and nightly rebalancing we had some 600 blocks failing to be replicated across racks. After digging thru documentation, consulting vendors and generally feeling frustrated, I discovered that this is somewhat know. Apparently the “fix” is something I found back in December of last year. See here.

So my new nightly routine — automated of course — is to fsck the cluster looking for ‘Replica placement policy” violations, alter the replication factor +1, then set it back after it’s replicated. I am somewhat irritated by this need.

Posted in Administration, Deployment, Tuning | Leave a comment

Intelligent Design – A hindsight lesson.

Our Boot from Network Datanode design was conceived in ignorance of real world application. Serial Number vs. MAC Address debates ensued in Ivory Tower minds and a schema was built. I’m currently in-between designs. I’ve consulted our resident data genius and he devised a superior schema that I have not yet been able to implement.

Today was spent trying to figure out why I built a view to contain my base information and how in the Hell to add new DataNodes so everything works as expected. Knowing the future and working with the past can be painful, especially when multiplied by the 0.0.2 changes to the existing schema that “seemed like a good idea” at the time. The bonus multiplier for getting this done is that I have 3 clusters to add in the next 60 days, a new admin to get trained and 6 developers to support. Did I mention that I get to figure out how to re-process failed files in production? The developer who created the process in no longer available and didn’t have time for a knowledge transfer. That’s why we pay $200/hour for Professional Services! So they can go away and leave us trying to understand their mistakes! (Just a little bitter)

What’s the lesson here? Just because you have incrementally better designs, doesn’t mean you should half-ass implement them. Live with what you have until you can fully move to a better design. Full on releases are much better than temporary fixes that get forgotten when you’re distracted for 3 days. I’m planning to create a new database to house all of the changes coming in releas 1.1 of Indostan. Otherwise we’ll be stuck in hybrid upgrade Hell forever. Yes, lazy admin, that means you have to do something the “hard/wrong” way EVEN WHEN you now there is a better/easier one. Until the new stuff is fully baked, use the old stuff.

I know this flies in the face of “continuous deployment” models, but it works much better for foundation level applications. It allows me to compartmentalize change and when I have 3-5 hair on fire events per day, that is essential. Today seemed slow and I think there were 4 crisis that had to be solved “right now.”

Being able to strike a balance between growth and stability is a valuable skill for any administrator. It’s sometimes harder for Hadoop Admins because there are new and better options weekly if not daily which makes it much more of a requirement. Choose an intelligent design and run it for a while. 3 weeks, 3 months or maybe 6 months, but take the time to let evolve in your mind and within your environment. Then re-evaluate and make changes. 6 months is a lifetime in Hadoop add-ons.

Posted in Deployment, Development, Hindsight, Opinions | Leave a comment

Insights for Articles from the Hadoop Summit 2013

Hadoop Summit KeynoteI just left the Hadoop Summit 2013 so my next series of articles are going to be on some insights I learned.  For this post I’m going to just post a long list of future topics  – let me know which ones are the most interesting and I’ll prioritize:

  1. The biggest complaint around Hadoop is it is pretty immature and needs more Enterprise capabilities
  2. Only 30% of companies are doing Hadoop today
  3. Definition of big data – high volume, velocity, and variety of data
  4. Definition of Hadoop – collection of things in a framework to process across a distributed network
  5. Amazon (AWS) started 5 ½ million Hadoop clusters last year
  6. Traditional IT vs Big Data Style
  7. Tractor vendors are becoming analytics vendors
  8. It takes a community to raise an elephant
  9. Yahoo is massive in the Hadoop space (365 PB and 40k nodes) – here is what I learned from them

10. APM GPU, Atom – not ready for big data

11. Solid State Storage, In Memory, SATA, SAS – when to use which in Hadoop clusters

12. YARN – Yet Another Resource Negotiator or a necessary component for solidifying Hadoop and moving it to the next level

13. Top 10 Things to Make Your Cluster Run Better (This one is my favorite)

14. Why LinkedIn and Yahoo are going to kick butt with big data

15. How to create good performance tests for a cluster

16. Hadoop Cluster Security – authorization, authentication, encryption, etc

17. Automating Hadoop Clusters

18. Hadoop and OpenStack  and Why You need to consider using them together

19. Email Archiving with Hadoop – the perfect use case…maybe…maybe not

20. File Ingesting

21. Lustre

22. Apache Falcon and data life cycle management

23. Storm

24. In Memory DB as a component of Hadoop

25. Tez – game changer – I think so…

26. Knox Security Gateway

27. NFS into Hadoop directly

28. HDFS Snapshots

29. Why we don’t use Ambari and why we should use the metrics

30. HBASE and all of the things you need to consider before deploying (it’s different)

31. Excel Data Explorer and Geoflow and how they might displace more expensive data mining solutions

32. Hadoop Scaling to Internal Cloud

33. YARN – is this just virtualization on top of Hadoop

34. Hadoop Infrastructure rethought

35. Cluster Segmentation – when one big cluster just won’t do…

Let me know which topics are most intriguing and I’ll post those first.


Posted in Uncategorized | 1 Comment

Hadoop and the honeycomb

I love the kind of honey where they leave a piece of the honeycomb in the jar.  Its great to chew on when you’ve used up all the honey.  Reminds me of this big old oak tree we used to pull honey out of in the woods as a kid.  Where was I?

Oh yeah.  I was showing my daughter how the honeycomb is made up of perfectly shaped hexagons, then it hit me.  Honeycombs are great illustrations of the 3x block replication factor in Hadoop which inspired us to create this logo.

Posted in Uncategorized | Leave a comment

Hadoop Hindsight #1 Start Small

I thought we would start a weekly series on some lessons we’ve learned.  Many of the topics we’ve learned the hard way so we thought it might be helpful for those a few steps behind us.  YMMV, but we wish this ideology was firmly ensconced when we started.

Identify a business problem that Hadoop is uniquely suited for.
Just because you found this cool new hammer doesn’t mean everything is a nail.  Find challenges that your existing tech can’t answer easily.  One of our first projects involved moving 300 gigs of EDI transaction files.  A business unit was having BA’s grep for customer strings on 26,000 files to find 4 or 5 files, then FTP’ing those to their deskptop for manual parsing and review.  They might spend a few HOURS doing this for each request.  It was a natural and simple use of Hadoop.  We learned a lot about design patterns, scheduling, and data cleanup.

Solve this one business challenge well.
Notice I didn’t say nail it perfectly.  There are many aspects of Big Data that will challenge the way you’ve looked at things the last 20 years.  The solution should be good, but not necessarily perfect.  Accepting this gives time to establish PM strategy and basic design patterns.

Put together a small team that has worked well together in the past.
This is critical to your success! Please, please, please take note!  Inter-team communication is the foundation upon which your Hadoop practice will grow.  In The Mythical Man-Month my man Fredrick Brooks said:

To avoid disaster, all the teams working on a project should remain in contact with each other in as many ways as possible…

Ideally a team should consist of the following:

1 Salesman (aka VPs)
1 Agile-trained PM
1 Architect
2 Former DBAs
1-3 skilled java developers
1 Cluster Admin

Obviously this is very simplified and some roles can overlap.  My point is you should have no more than 10 people max starting out!

Support your solution.
This very same team should also live thru at least 3 months of support of the solution they’ve created.  Valuable insight is gained once you have to fix a few production problems.  Let the solution mature in production a bit to understand support considerations. This gives you time to adjust your design patterns. Trust me, you’ll want time to reflect on your work and correct flaws.

Smash your solution and rebuild (Optional – If time permits)
Good luck getting the time, but if you’re serious about a sustainable Enterprise Hadoop solution this should be rightly considered.

Go forth and multiply.
By this time your patterns and procedures should form the DNA of your new Hadoop cell. You’re team should naturally develop into the evangelists and leaders upon which the mitosis of a new project occurs, carrying with it the new replicated chromosomes.  As your project cells divide and multiply, you’ll be able to take on more formidable challenges.

That’s all I have to say about that.

Posted in Administration, Development, Opinions | Leave a comment

Interview Questions for Hadoop Developers

Interview Questions for Hadoop Developers
(via Dice News in Tech)

Hadoop is an open distributed software framework that enables programmers to run an enormous number of nodes handling terabytes of data. One of its most significant abilities is allowing a system to continue to operate even if a significant number of nodes fail. Since Hadoop is continuing to mature…

Continue reading

Posted in Uncategorized | Leave a comment

Cinderella has left the Hadoop Cluster

HadoopIt’s Friday evening before our Hadoop Administrator leaves for a week of vacation in New Hampshire and about an hour before he leaves he says “it’s turning into a pumpkin in an hour”.  Of course we wanted to go live with a new project on Friday afternoon before he leaves.  About a week before he left we reminded everyone he was leaving and made it clear he didn’t have a “real” backup.

So now me – the “Ugly Step Sister” is filling in for Cinderella and of course things don’t go as planned….it never does.  I’m fond of saying if things “just worked” we wouldn’t need high priced engineers/architects, so I suppose that’s a good thing.  Now after a week of filling in for Cinderella we have had a number of last minute things that needed fixed and I have figured out that I have been living in an ivory tower a bit and I have forgotten how to sweep floors. 

Hadoop, Administration, Hadoop thoughtsWhen moving into architecture ten years ago my biggest fear was my IT capabilities would atrophy and I would be less employable.  I know that I’m actually very good at strategy and I am far more value strategically driving millions of dollars in savings and large strategic shifts in IT, but boy this week was a humbling experience.   Now that I’m through the week and I have overcome a number of obstacles (some of them slowly) here is what I have learned:

First – I need to get back to my roots.  Every architect needs a sandbox to get hands on keyboard and they need to allocate time to get their hands dirty.  I’m creating a sandbox for myself to tinker and I’m going to spend some time doing some Hadoop Automation work – I really like getting my hands dirty.

Second – We need more automation – many of the things we do in Hadoop are manual today.  Right now the Hadoop Engineering team (Grease Monkey) is so busy fighting fires…or hair on fire developers that he isn’t automating much.  While he was out I put together a rough user add script – that adds users includes integrating with active directory, creating Hadoop environments, adjusting permissions, rolling changes from access to name nodes, creating pub, sub, bus directories, creating hive table, testing…and more testing.  (Grease Monkey took it to the next level when he returned)

Third – Developers are crazy…hair on fire crazy and giving them a deadline drives crazy blame everyone behavior.  We need some queuing mechanism that gives us the ability to track, plan, and prioritize work.  We put a preliminary easy to use one in place, but nobody is using it.  We need to go back and address this in a non-corporate, fast moving startup sort of fashion….but we NEED IT!

Fourth – While we are doing some amazing innovative things that are game changing in the admin space we are also very early in the Operationalization of the platform and we have a ton of solidification work to do before we can breathe easy.  As fast as Hadoop is evolving I’m not sure we will ever breathe easy though.

Posted in Administration, Career | 1 Comment