Weaponizing Hadoop

We are usually left to bash for scripting Hadoop functions. It’s the default in Linux and it’s usually good enough.

There are enough “bash-isms” that will cause your Java/pig/database people serious heart ache. If you’re new to Hadoop, go ahead and let the developers develop. After a few months you will have solved some common problems and now is the time to regroup. Take a couple of weeks to “sharpen the saw” by finding the best of the good and standardize on your solution. Life is so much better when every Hadoop developer does not have to solve common problems such as:
– How do I know which Cluster I’m in?
– How do I do config files so I’m not hard coding my paths, nodes, etc.
– How do I notify on failure/success
– When do I notify
– How should I structure my processing, processed and archive directories.

There are many more common questions to ask and answer. You should plan on having a reset every 3 to 6 months.

If you don’t take the time to consolidate, you’ll end up supporting dozens of different solutions to the same problem. I don’t know about you, but I’d rather have 1 process to understand.

Sharpen the saw or spend your life supporting bash scripts created by Java devs! I should have saved that horror story for Halloween!

Posted in Administration, Deployment, Development, syndicated | Tagged , , , | Leave a comment

Hadoop Hindsight #2 Keep it simple: more than likely someone else has encountered your problem.

An adventure is only an inconvenience rightly considered. An inconvenience is an adventure wrongly considered.
-G.K. Chesterton

Sometimes our ego gets the best of us.  This seems to occur more often in Hadoop than anywhere else I’ve worked.  I’m not sure if this relatively new world propels us into thinking we’re on an island, or if java developers are inherently poor data analysts.  At any rate, we need to reign in our bloated self-image and realize that someone else likely encountered our issue and a seasoned committer carried it thru the stack to resolution.  Let me give you an example:

Sqooping data with newlines

I wish I had caught this issue earlier. Some of our developers were pulling data from Teradata and DB2 and encountered embedded newline and ctrl-a data in a few columns.  Claiming the ‘bad’ data broke their process, they overreacted and jumped to using Avro files to resolve their problem.  While avro is well and good for some issues, this was major overkill that turned out causing issues within Datameer and created additional complexity in HCat.  I took some time to ‘research’ (ala google-fu) to see what others had done to get around this.  I already had a few simple ideas, like regex your SQL to remove \n\r\01, but I was really looking for a more elegant solution.

It took me 30 minutes or so to work up an example, create a failure, and RTFM for a resolution.  I was hitting walls everywhere much like our developers, the sqoop documentation isn’t bad, but there are some holes.  A little more searching and I found Cloudera Sqoop-129 Newlines in RDBMS fields break hive.  Created 11/2010 and resolved 5/2011.  Turns out it was fixed in sqoop version 1.3.0 and we are on 1.4.2 – looking good so far.  The fix implemented these arguments which handles elimination or replacement of these characters during the load.

--hive-drop-import-delims Drops \n\r, and \01 from string fields when importing to Hive.
--hive-delims-replacement Replace \n\r, and \01 from string fields with user defined string when importing to Hive.

It turns out they fixed our problem from a Hive standpoint, but its actually valid for Pig, etc.  Its much more elegeant than a source-SQL/regex solution because I don’t need to specify fields – everything is covered.  Now in our case the business users didn’t even care about the newlines that were present in 3 of 2 million rows (ug!) so I just used –hive-drop-import-delims in the sqoop command and everything was fine.

So by adding a single line to a Sqoop step, I eliminated the need to maintain an additional serialization framework and downstream processes will likely be easier to maintain.  When dealing with basic business data we need to realize it isn’t rocket science – some else has probably already figured it out.




Posted in Development, Hindsight, Opinions | Tagged , | Leave a comment

GlusterFS and Hadoop, not replacing HDFS

Enterprise Hadoop must cooperate with many other forms of data transmission and ingestion. Any form of MFT, Mqueue or file landing zone requires disk space. Not HDFS disk, just disk that we can mount, MFT, SFTP, etc. to until we actually ingest the data into Hadoop. (where life if beautiful all the time.)

Traditional “Enterprise” disk space is provided by SAN or NAS mounts. There are reasons for this: snapshots, flashcopies, highly available nodes, re-redundant disks and de-duplication oh my! There are many valid reasons for using these technologies. Most – if not all – of those reasons do not apply to Hadoop landing zones.

Enter GlusterFS; a striped, redundant, multiple access point solution. My SPOF Hadoop v. 1.x NameNode can write to a GlusterFS mount, I can boot my DataNodes to a GlusterFS mount that has a backup server baked right into the mount command. I can point MFT, SFTP, Mqueue, etc. to a mount that has redundancy baked right in. This is sounding redundant.

My point is that GlusterFS meets the multi-node, replicated storage requirements enterprises demand, but using Local SATA disk at a ~35 times less than SAN cost. That SWAG is based on our internal cost of SAN @ $7.50/GB vs. $0.22/GB.

Good, Fast & Cheap — It’s a brave new world.

Posted in Administration, Deployment, Tuning | Leave a comment

Consuming JSON Strings in SQL Server

This article describes a TSQL JSON parser and its evil twin, a JSON outputter, and provides the source. It is also designed to illustrate a number of string manipulation techniques in TSQL. With it you can do things like this to extract the data from a JSON document:

Read the full article here.

Posted in Uncategorized | Leave a comment

You Paid for Support?! Bwah-ha-ha

We’re using Open Source Software extensively in our Big Enterprise. It really irritates me that we pay millions of dollars for “Support” from our vendors and we get endless circles of “try this,” “that should work” and “oh, that’s an upstream bug, we’ll file a bug report.” Seriously? For 10% of what we’re paying these guys, I’ll do it myself.

I currently have 3 bugs open with 3 vendors; 2 of those are open source. Let’s talk about them.

1) OS won’t PXE boot across a LACP Bond. The documentation says it should. Everything “looks right” but after 3 business days of the vendor telling me to try things I’ve already tried, I finally solved this myself. I can boot my DataNode image to my servers, but I wanted to install an OS on some of the control nodes. As soon as the install agent starts up, it loses network connectivity. I told it how to configure the bond on the kernel boot line, but it fails to see it and use it. Trying to use a single interface doesn’t work because the switch is expecting to distribute the packets (per LACP 802.3ad spec) across 4 NICs. It turns out that I can tell the kernel to use eth0 and NOT probe other network devices, which solves 99% of my problem. It’s not perfect, but it’s a hellava lot better than trying to hand install. Here’s hint if you have this problem: nonet.

2) Proprietary software vendor can’t pull the Avro schema from HDFS. This seems to be squarely in their court for resolution, however, they claim it’s a bug in Hive and opened a bug report. Come on kids, if you’re finding hdfs:// and expecting file:// something is wrong on your side.

3) Open source Hadoop vendor opened a bug report because pig doesn’t correctly support Avro in our version. We supplied a bug report and a bug solution from Apache, but they made us chase our tails for 10 days before they agreed and opened a new bug report.

After losing some 600 blocks of data in our Dev cluster we found out there is a “fix” for under replicated blocks coming in HDFS 0.20, but 0.1x doesn’t have this “feature.” Support DID help us find that issue, but ONLY after they ran us thru hoops looking for non-existent configuration problems.

My advice: Eschew paid support and dig into the details on your own. You’ll learn more, be more valuable and solve you own problems faster.

Posted in Uncategorized | Leave a comment

Replication FAIL

We’ve had our clusters running for a few months without significant issues. Or at least so we thought.
I’m not sure of the why and how yet, but it seems that even rack topology scripts running, replication factor of 3 and nightly rebalancing we had some 600 blocks failing to be replicated across racks. After digging thru documentation, consulting vendors and generally feeling frustrated, I discovered that this is somewhat know. Apparently the “fix” is something I found back in December of last year. See here.

So my new nightly routine — automated of course — is to fsck the cluster looking for ‘Replica placement policy” violations, alter the replication factor +1, then set it back after it’s replicated. I am somewhat irritated by this need.

Posted in Administration, Deployment, Tuning | Leave a comment

Intelligent Design – A hindsight lesson.

Our Boot from Network Datanode design was conceived in ignorance of real world application. Serial Number vs. MAC Address debates ensued in Ivory Tower minds and a schema was built. I’m currently in-between designs. I’ve consulted our resident data genius and he devised a superior schema that I have not yet been able to implement.

Today was spent trying to figure out why I built a view to contain my base information and how in the Hell to add new DataNodes so everything works as expected. Knowing the future and working with the past can be painful, especially when multiplied by the 0.0.2 changes to the existing schema that “seemed like a good idea” at the time. The bonus multiplier for getting this done is that I have 3 clusters to add in the next 60 days, a new admin to get trained and 6 developers to support. Did I mention that I get to figure out how to re-process failed files in production? The developer who created the process in no longer available and didn’t have time for a knowledge transfer. That’s why we pay $200/hour for Professional Services! So they can go away and leave us trying to understand their mistakes! (Just a little bitter)

What’s the lesson here? Just because you have incrementally better designs, doesn’t mean you should half-ass implement them. Live with what you have until you can fully move to a better design. Full on releases are much better than temporary fixes that get forgotten when you’re distracted for 3 days. I’m planning to create a new database to house all of the changes coming in releas 1.1 of Indostan. Otherwise we’ll be stuck in hybrid upgrade Hell forever. Yes, lazy admin, that means you have to do something the “hard/wrong” way EVEN WHEN you now there is a better/easier one. Until the new stuff is fully baked, use the old stuff.

I know this flies in the face of “continuous deployment” models, but it works much better for foundation level applications. It allows me to compartmentalize change and when I have 3-5 hair on fire events per day, that is essential. Today seemed slow and I think there were 4 crisis that had to be solved “right now.”

Being able to strike a balance between growth and stability is a valuable skill for any administrator. It’s sometimes harder for Hadoop Admins because there are new and better options weekly if not daily which makes it much more of a requirement. Choose an intelligent design and run it for a while. 3 weeks, 3 months or maybe 6 months, but take the time to let evolve in your mind and within your environment. Then re-evaluate and make changes. 6 months is a lifetime in Hadoop add-ons.

Posted in Deployment, Development, Hindsight, Opinions | Leave a comment

Insights for Articles from the Hadoop Summit 2013

Hadoop Summit KeynoteI just left the Hadoop Summit 2013 so my next series of articles are going to be on some insights I learned.  For this post I’m going to just post a long list of future topics  – let me know which ones are the most interesting and I’ll prioritize:

  1. The biggest complaint around Hadoop is it is pretty immature and needs more Enterprise capabilities
  2. Only 30% of companies are doing Hadoop today
  3. Definition of big data – high volume, velocity, and variety of data
  4. Definition of Hadoop – collection of things in a framework to process across a distributed network
  5. Amazon (AWS) started 5 ½ million Hadoop clusters last year
  6. Traditional IT vs Big Data Style
  7. Tractor vendors are becoming analytics vendors
  8. It takes a community to raise an elephant
  9. Yahoo is massive in the Hadoop space (365 PB and 40k nodes) – here is what I learned from them

10. APM GPU, Atom – not ready for big data

11. Solid State Storage, In Memory, SATA, SAS – when to use which in Hadoop clusters

12. YARN – Yet Another Resource Negotiator or a necessary component for solidifying Hadoop and moving it to the next level

13. Top 10 Things to Make Your Cluster Run Better (This one is my favorite)

14. Why LinkedIn and Yahoo are going to kick butt with big data

15. How to create good performance tests for a cluster

16. Hadoop Cluster Security – authorization, authentication, encryption, etc

17. Automating Hadoop Clusters

18. Hadoop and OpenStack  and Why You need to consider using them together

19. Email Archiving with Hadoop – the perfect use case…maybe…maybe not

20. File Ingesting

21. Lustre

22. Apache Falcon and data life cycle management

23. Storm

24. In Memory DB as a component of Hadoop

25. Tez – game changer – I think so…

26. Knox Security Gateway

27. NFS into Hadoop directly

28. HDFS Snapshots

29. Why we don’t use Ambari and why we should use the metrics

30. HBASE and all of the things you need to consider before deploying (it’s different)

31. Excel Data Explorer and Geoflow and how they might displace more expensive data mining solutions

32. Hadoop Scaling to Internal Cloud

33. YARN – is this just virtualization on top of Hadoop

34. Hadoop Infrastructure rethought

35. Cluster Segmentation – when one big cluster just won’t do…

Let me know which topics are most intriguing and I’ll post those first.


Posted in Uncategorized | 1 Comment

Hadoop and the honeycomb

I love the kind of honey where they leave a piece of the honeycomb in the jar.  Its great to chew on when you’ve used up all the honey.  Reminds me of this big old oak tree we used to pull honey out of in the woods as a kid.  Where was I?

Oh yeah.  I was showing my daughter how the honeycomb is made up of perfectly shaped hexagons, then it hit me.  Honeycombs are great illustrations of the 3x block replication factor in Hadoop which inspired us to create this logo.

Posted in Uncategorized | Leave a comment

Hadoop Hindsight #1 Start Small

I thought we would start a weekly series on some lessons we’ve learned.  Many of the topics we’ve learned the hard way so we thought it might be helpful for those a few steps behind us.  YMMV, but we wish this ideology was firmly ensconced when we started.

Identify a business problem that Hadoop is uniquely suited for.
Just because you found this cool new hammer doesn’t mean everything is a nail.  Find challenges that your existing tech can’t answer easily.  One of our first projects involved moving 300 gigs of EDI transaction files.  A business unit was having BA’s grep for customer strings on 26,000 files to find 4 or 5 files, then FTP’ing those to their deskptop for manual parsing and review.  They might spend a few HOURS doing this for each request.  It was a natural and simple use of Hadoop.  We learned a lot about design patterns, scheduling, and data cleanup.

Solve this one business challenge well.
Notice I didn’t say nail it perfectly.  There are many aspects of Big Data that will challenge the way you’ve looked at things the last 20 years.  The solution should be good, but not necessarily perfect.  Accepting this gives time to establish PM strategy and basic design patterns.

Put together a small team that has worked well together in the past.
This is critical to your success! Please, please, please take note!  Inter-team communication is the foundation upon which your Hadoop practice will grow.  In The Mythical Man-Month my man Fredrick Brooks said:

To avoid disaster, all the teams working on a project should remain in contact with each other in as many ways as possible…

Ideally a team should consist of the following:

1 Salesman (aka VPs)
1 Agile-trained PM
1 Architect
2 Former DBAs
1-3 skilled java developers
1 Cluster Admin

Obviously this is very simplified and some roles can overlap.  My point is you should have no more than 10 people max starting out!

Support your solution.
This very same team should also live thru at least 3 months of support of the solution they’ve created.  Valuable insight is gained once you have to fix a few production problems.  Let the solution mature in production a bit to understand support considerations. This gives you time to adjust your design patterns. Trust me, you’ll want time to reflect on your work and correct flaws.

Smash your solution and rebuild (Optional – If time permits)
Good luck getting the time, but if you’re serious about a sustainable Enterprise Hadoop solution this should be rightly considered.

Go forth and multiply.
By this time your patterns and procedures should form the DNA of your new Hadoop cell. You’re team should naturally develop into the evangelists and leaders upon which the mitosis of a new project occurs, carrying with it the new replicated chromosomes.  As your project cells divide and multiply, you’ll be able to take on more formidable challenges.

That’s all I have to say about that.

Posted in Administration, Development, Opinions | Leave a comment