Cloud Hadoop? Buzzword Fiesta!

We haven’t quite jumped the shark yet, but this is going to be full of buzzwords.

Started a new gig where we’re building Dev, POC and possibly some prod clusters on AWS. Once again the first 80% of this was pretty easy. Using Cloudbreak, it’s fairly easy to create clusters. Developing new Amabari (gag) blueprints is pretty easy and they do a lot of the heavy lifting. It starts getting ugly after that.

Cloudbreak uses Consul to “discover” hosts after creating the AWS instances. There is a little black magic going on with Consul to use pseudo DNS. Add to this “Containers” and you have a pretty screwed up environment from a purist point of view. So add Kerberos to this mix and you might need some Xanax. Kerberos wants nice FQDNs for all of the hosts involved. That sorta goes against the idea of Elastic Hadoop, but we’ll burn that bridge later. Just getting Consul and Ambari to see each “node” (sometimes as a container) using consistent names is going to be interesting.

So, Kerberized, Elastic Hadoop in the Cloud with encrypted data in flight and at rest. That’s the buzzword goal. :-/

Posted in Deployment, Security | Leave a comment

Hadoop 2.0 GA

I’ve been watching the Hadoop user mailing lists and jira counts. It sure seems like 2.0 GA is more like 2.0 Beta 1.

I’m looking forward to RC 1 before we move it into a serious cluster. Just my $0.02.

Posted in Uncategorized | Leave a comment

Intuition about Chromecast

I bought a Google Chromecast device. It was really cheap ($35+tax) and I’m a whore for media casting devices.

When I “flick” tablet video’s to Chromecast, I notice something interesting. I see the display of Chromecast ponder what I’ve sent it. Then I see my table seemingly release the content. Yes, significant controls still run on the tab, but I “Feel like” the chromecast device has taken over playback and is now merely listening for “remote control” codes from the tab.

This feeling may be 100% wrong. It is based on my experiences with using VLC Player on a TV Display device along with VLC remote on android.

No opinions have been confirmed as fact. I’m only thinking this based on 30 years of IT experience. YMMV, IANAL. ;)

Internet Meme’s will eventually become Skynet IMHO. The Rick-rolled Chuck Norris Meme’s are our only defense.

Posted in Uncategorized | Leave a comment

hello woRld!

R is the latest Hadoop darling. It is an open source language that “is widely used among statisticians and data miners for developing statistical software and data analysis. Polls and surveys of data miners are showing R’s popularity has increased substantially in recent years.” See the Wiki Article for more details:

The good news is that R is being developed by PHD level statisticians and academics across the globe. It contains many statistical functions, models and analytic constructs.

The bad news is that it’s being developed by PHD’s and academics! Great for developing analytics, not great developing stable solutions. :)

Enter the Hadoop Sysadmin! He can write bash scripts, he can redirect output and he can weaponize R! Man up Hadoop Admin, you’re going to understand R sytax; it’s not terribly difficult, but it is slightly weird. R can read ENV VARs and run from the CMD line and – given it’s collegiate beginnings – this counts a 10x win.

Google is your friend and has most of your answers. I can say that R CMD BATCH hellowoRld.R MyProg.log is a good start. A nice bash script to handle errors, in/output dirs, etc. can have you running R in normal batch mode pretty quickly. You’ll probably have a more difficult time explain proper development techniques to the Data Scientist that just wants to run things in RStudio. ;)

Posted in Administration, Deployment, Development, Tuning | Tagged , , , | Leave a comment

Weaponizing Hadoop

We are usually left to bash for scripting Hadoop functions. It’s the default in Linux and it’s usually good enough.

There are enough “bash-isms” that will cause your Java/pig/database people serious heart ache. If you’re new to Hadoop, go ahead and let the developers develop. After a few months you will have solved some common problems and now is the time to regroup. Take a couple of weeks to “sharpen the saw” by finding the best of the good and standardize on your solution. Life is so much better when every Hadoop developer does not have to solve common problems such as:
- How do I know which Cluster I’m in?
- How do I do config files so I’m not hard coding my paths, nodes, etc.
- How do I notify on failure/success
- When do I notify
- How should I structure my processing, processed and archive directories.

There are many more common questions to ask and answer. You should plan on having a reset every 3 to 6 months.

If you don’t take the time to consolidate, you’ll end up supporting dozens of different solutions to the same problem. I don’t know about you, but I’d rather have 1 process to understand.

Sharpen the saw or spend your life supporting bash scripts created by Java devs! I should have saved that horror story for Halloween!

Posted in Administration, Deployment, Development, syndicated | Tagged , , , | Leave a comment

Hadoop Hindsight #2 Keep it simple: more than likely someone else has encountered your problem.

An adventure is only an inconvenience rightly considered. An inconvenience is an adventure wrongly considered.
-G.K. Chesterton

Sometimes our ego gets the best of us.  This seems to occur more often in Hadoop than anywhere else I’ve worked.  I’m not sure if this relatively new world propels us into thinking we’re on an island, or if java developers are inherently poor data analysts.  At any rate, we need to reign in our bloated self-image and realize that someone else likely encountered our issue and a seasoned committer carried it thru the stack to resolution.  Let me give you an example:

Sqooping data with newlines

I wish I had caught this issue earlier. Some of our developers were pulling data from Teradata and DB2 and encountered embedded newline and ctrl-a data in a few columns.  Claiming the ‘bad’ data broke their process, they overreacted and jumped to using Avro files to resolve their problem.  While avro is well and good for some issues, this was major overkill that turned out causing issues within Datameer and created additional complexity in HCat.  I took some time to ‘research’ (ala google-fu) to see what others had done to get around this.  I already had a few simple ideas, like regex your SQL to remove \n\r\01, but I was really looking for a more elegant solution.

It took me 30 minutes or so to work up an example, create a failure, and RTFM for a resolution.  I was hitting walls everywhere much like our developers, the sqoop documentation isn’t bad, but there are some holes.  A little more searching and I found Cloudera Sqoop-129 Newlines in RDBMS fields break hive.  Created 11/2010 and resolved 5/2011.  Turns out it was fixed in sqoop version 1.3.0 and we are on 1.4.2 – looking good so far.  The fix implemented these arguments which handles elimination or replacement of these characters during the load.

--hive-drop-import-delims Drops \n\r, and \01 from string fields when importing to Hive.
--hive-delims-replacement Replace \n\r, and \01 from string fields with user defined string when importing to Hive.

It turns out they fixed our problem from a Hive standpoint, but its actually valid for Pig, etc.  Its much more elegeant than a source-SQL/regex solution because I don’t need to specify fields – everything is covered.  Now in our case the business users didn’t even care about the newlines that were present in 3 of 2 million rows (ug!) so I just used –hive-drop-import-delims in the sqoop command and everything was fine.

So by adding a single line to a Sqoop step, I eliminated the need to maintain an additional serialization framework and downstream processes will likely be easier to maintain.  When dealing with basic business data we need to realize it isn’t rocket science – some else has probably already figured it out.




Posted in Development, Hindsight, Opinions | Tagged , | Leave a comment

GlusterFS and Hadoop, not replacing HDFS

Enterprise Hadoop must cooperate with many other forms of data transmission and ingestion. Any form of MFT, Mqueue or file landing zone requires disk space. Not HDFS disk, just disk that we can mount, MFT, SFTP, etc. to until we actually ingest the data into Hadoop. (where life if beautiful all the time.)

Traditional “Enterprise” disk space is provided by SAN or NAS mounts. There are reasons for this: snapshots, flashcopies, highly available nodes, re-redundant disks and de-duplication oh my! There are many valid reasons for using these technologies. Most – if not all – of those reasons do not apply to Hadoop landing zones.

Enter GlusterFS; a striped, redundant, multiple access point solution. My SPOF Hadoop v. 1.x NameNode can write to a GlusterFS mount, I can boot my DataNodes to a GlusterFS mount that has a backup server baked right into the mount command. I can point MFT, SFTP, Mqueue, etc. to a mount that has redundancy baked right in. This is sounding redundant.

My point is that GlusterFS meets the multi-node, replicated storage requirements enterprises demand, but using Local SATA disk at a ~35 times less than SAN cost. That SWAG is based on our internal cost of SAN @ $7.50/GB vs. $0.22/GB.

Good, Fast & Cheap — It’s a brave new world.

Posted in Administration, Deployment, Tuning | Leave a comment

Consuming JSON Strings in SQL Server

This article describes a TSQL JSON parser and its evil twin, a JSON outputter, and provides the source. It is also designed to illustrate a number of string manipulation techniques in TSQL. With it you can do things like this to extract the data from a JSON document:

Read the full article here.

Posted in Uncategorized | Leave a comment

You Paid for Support?! Bwah-ha-ha

We’re using Open Source Software extensively in our Big Enterprise. It really irritates me that we pay millions of dollars for “Support” from our vendors and we get endless circles of “try this,” “that should work” and “oh, that’s an upstream bug, we’ll file a bug report.” Seriously? For 10% of what we’re paying these guys, I’ll do it myself.

I currently have 3 bugs open with 3 vendors; 2 of those are open source. Let’s talk about them.

1) OS won’t PXE boot across a LACP Bond. The documentation says it should. Everything “looks right” but after 3 business days of the vendor telling me to try things I’ve already tried, I finally solved this myself. I can boot my DataNode image to my servers, but I wanted to install an OS on some of the control nodes. As soon as the install agent starts up, it loses network connectivity. I told it how to configure the bond on the kernel boot line, but it fails to see it and use it. Trying to use a single interface doesn’t work because the switch is expecting to distribute the packets (per LACP 802.3ad spec) across 4 NICs. It turns out that I can tell the kernel to use eth0 and NOT probe other network devices, which solves 99% of my problem. It’s not perfect, but it’s a hellava lot better than trying to hand install. Here’s hint if you have this problem: nonet.

2) Proprietary software vendor can’t pull the Avro schema from HDFS. This seems to be squarely in their court for resolution, however, they claim it’s a bug in Hive and opened a bug report. Come on kids, if you’re finding hdfs:// and expecting file:// something is wrong on your side.

3) Open source Hadoop vendor opened a bug report because pig doesn’t correctly support Avro in our version. We supplied a bug report and a bug solution from Apache, but they made us chase our tails for 10 days before they agreed and opened a new bug report.

After losing some 600 blocks of data in our Dev cluster we found out there is a “fix” for under replicated blocks coming in HDFS 0.20, but 0.1x doesn’t have this “feature.” Support DID help us find that issue, but ONLY after they ran us thru hoops looking for non-existent configuration problems.

My advice: Eschew paid support and dig into the details on your own. You’ll learn more, be more valuable and solve you own problems faster.

Posted in Uncategorized | Leave a comment

Replication FAIL

We’ve had our clusters running for a few months without significant issues. Or at least so we thought.
I’m not sure of the why and how yet, but it seems that even rack topology scripts running, replication factor of 3 and nightly rebalancing we had some 600 blocks failing to be replicated across racks. After digging thru documentation, consulting vendors and generally feeling frustrated, I discovered that this is somewhat know. Apparently the “fix” is something I found back in December of last year. See here.

So my new nightly routine — automated of course — is to fsck the cluster looking for ‘Replica placement policy” violations, alter the replication factor +1, then set it back after it’s replicated. I am somewhat irritated by this need.

Posted in Administration, Deployment, Tuning | Leave a comment