Quick Split to Fix data silliness

We have a vendor sending us daily updates on shipping info. We have a well known and defined structure for each type of data and those types map neatly to tables in our database. We have about 9 tables that need updated each day to give us the complete picture from this vendor point of view.

After months of trying to get them to send the data, it finally showed up; in 1 file. *sigh* They jammed, in random order, all of the new tables records into 1 unorganized file. The only saving grace is that the 2nd column defines the record — and table — type.

After I pondered this for a few moments, I started working on a quick and “simple” solution. I came up with this:

for x in `cat INFILE.dat | cut -f 2 -d '\001' | sort | uniq`; do cat INFILE.dat | grep $'\001'${x}$'\001' > ${x}.txt; done

Grab all of the unique table types and loop over the data for each one grep’ing as needed. There are probably more efficient ways, but this works pretty fast on our smallish data set.

Posted in Administration, Data | Tagged , , | Leave a comment

MapR is a better base?

I’ve heard about MapR for a long time and haven’t given it much consideration vs. OSS stacks. I reconsidering my position and conduction some evaluations.

Why? MaprFS is a real POSIX File system that runs on Raw devices, not atop ext4, xfs, etc. It also runs natively, not in a JVM. It claims HDFS compliance and so should work with other Hadoop tools. It also has an integrated HBase API compatible data store and Kafka 0.9 API compatible pub/sub service. These are part of the base FS which should make things “better” than the rest. We’ll see.

If some of these things prove out, this could be a game changer for my perspective.

Posted in Administration, Market Segment/Growth, Opinions | Tagged , , , , | Leave a comment

Ubuntu sucks and so does Debian

Full disclosure, I cut my teeth on Slackware and Redhat in the mid 90′s.  I even tried Yggdrasil once.  That being said…

I fully fail to understand the allure of Ubuntu or it’s Mommy distro, Debian.  Yes I know Ubuntu is supposed to be the “world peace” of OS’s but that lie has been exposed many times.  An arrogant .com millionaire started it and he directs the mostly moronic design decisions.  What’s to love again?  Their website barley mentions Linux.

Debian might be a good design but rpm isn’t all that bad.  WTF (why) do we need an entirely different package mgt. tool?  Can we really not learn to live with or at least contribute changes to the existing standards?  I’ll fight over that one… RPM was WAY here first.

Moving on; this applies to Hadoop in the realm of “what os should I install to carry my Hadoop cluster?”  The obvious answer, IMHO, is CentOS.  It’s free, it’s driven from the designs of a commercial business and it’s the development platform for most Hadoop contributors.  Even if that last part wasn’t true, it’d still be the best choice.

CentOS is derived from the source code published (as required by lic.) by RedHat.  It claims binary compatibility and a $0 price tag.  Some of you are thinking that this is evil by association.  Allow me to correct your thinking.

RedHat must make money to survive.  In order to make money, they must offer something more than a “free” os.  As it turns out, the best way to do that is contribute their “cool new” features, fixes and patches to the community.  It’s also required by the EULA. ;)  So now we have a for-profit company contributing “cool new” features to the kernel and distribution.  And we can reap those rewards for free via CentOS.  Why wouldn’t you use them?

Posted in Administration, Opinions | Leave a comment

Cloud Hadoop? Buzzword Fiesta!

We haven’t quite jumped the shark yet, but this is going to be full of buzzwords.

Started a new gig where we’re building Dev, POC and possibly some prod clusters on AWS. Once again the first 80% of this was pretty easy. Using Cloudbreak, it’s fairly easy to create clusters. Developing new Amabari (gag) blueprints is pretty easy and they do a lot of the heavy lifting. It starts getting ugly after that.

Cloudbreak uses Consul to “discover” hosts after creating the AWS instances. There is a little black magic going on with Consul to use pseudo DNS. Add to this “Containers” and you have a pretty screwed up environment from a purist point of view. So add Kerberos to this mix and you might need some Xanax. Kerberos wants nice FQDNs for all of the hosts involved. That sorta goes against the idea of Elastic Hadoop, but we’ll burn that bridge later. Just getting Consul and Ambari to see each “node” (sometimes as a container) using consistent names is going to be interesting.

So, Kerberized, Elastic Hadoop in the Cloud with encrypted data in flight and at rest. That’s the buzzword goal. :-/

Posted in Deployment, Security | Leave a comment

Hadoop 2.0 GA

I’ve been watching the Hadoop user mailing lists and jira counts. It sure seems like 2.0 GA is more like 2.0 Beta 1.

I’m looking forward to RC 1 before we move it into a serious cluster. Just my $0.02.

Posted in Uncategorized | Leave a comment

Intuition about Chromecast

I bought a Google Chromecast device. It was really cheap ($35+tax) and I’m a whore for media casting devices.

When I “flick” tablet video’s to Chromecast, I notice something interesting. I see the display of Chromecast ponder what I’ve sent it. Then I see my table seemingly release the content. Yes, significant controls still run on the tab, but I “Feel like” the chromecast device has taken over playback and is now merely listening for “remote control” codes from the tab.

This feeling may be 100% wrong. It is based on my experiences with using VLC Player on a TV Display device along with VLC remote on android.

No opinions have been confirmed as fact. I’m only thinking this based on 30 years of IT experience. YMMV, IANAL. ;)

Internet Meme’s will eventually become Skynet IMHO. The Rick-rolled Chuck Norris Meme’s are our only defense.

Posted in Uncategorized | Leave a comment

hello woRld!

R is the latest Hadoop darling. It is an open source language that “is widely used among statisticians and data miners for developing statistical software and data analysis. Polls and surveys of data miners are showing R’s popularity has increased substantially in recent years.” See the Wiki Article for more details: http://en.wikipedia.org/wiki/R_(programming_language)

The good news is that R is being developed by PHD level statisticians and academics across the globe. It contains many statistical functions, models and analytic constructs.

The bad news is that it’s being developed by PHD’s and academics! Great for developing analytics, not great developing stable solutions. :)

Enter the Hadoop Sysadmin! He can write bash scripts, he can redirect output and he can weaponize R! Man up Hadoop Admin, you’re going to understand R sytax; it’s not terribly difficult, but it is slightly weird. R can read ENV VARs and run from the CMD line and – given it’s collegiate beginnings – this counts a 10x win.

Google is your friend and has most of your answers. I can say that R CMD BATCH hellowoRld.R MyProg.log is a good start. A nice bash script to handle errors, in/output dirs, etc. can have you running R in normal batch mode pretty quickly. You’ll probably have a more difficult time explain proper development techniques to the Data Scientist that just wants to run things in RStudio. ;)

Posted in Administration, Deployment, Development, Tuning | Tagged , , , | Leave a comment

Weaponizing Hadoop

We are usually left to bash for scripting Hadoop functions. It’s the default in Linux and it’s usually good enough.

There are enough “bash-isms” that will cause your Java/pig/database people serious heart ache. If you’re new to Hadoop, go ahead and let the developers develop. After a few months you will have solved some common problems and now is the time to regroup. Take a couple of weeks to “sharpen the saw” by finding the best of the good and standardize on your solution. Life is so much better when every Hadoop developer does not have to solve common problems such as:
- How do I know which Cluster I’m in?
- How do I do config files so I’m not hard coding my paths, nodes, etc.
- How do I notify on failure/success
- When do I notify
- How should I structure my processing, processed and archive directories.

There are many more common questions to ask and answer. You should plan on having a reset every 3 to 6 months.

If you don’t take the time to consolidate, you’ll end up supporting dozens of different solutions to the same problem. I don’t know about you, but I’d rather have 1 process to understand.

Sharpen the saw or spend your life supporting bash scripts created by Java devs! I should have saved that horror story for Halloween!

Posted in Administration, Deployment, Development, syndicated | Tagged , , , | Leave a comment

Hadoop Hindsight #2 Keep it simple: more than likely someone else has encountered your problem.

An adventure is only an inconvenience rightly considered. An inconvenience is an adventure wrongly considered.
-G.K. Chesterton

Sometimes our ego gets the best of us.  This seems to occur more often in Hadoop than anywhere else I’ve worked.  I’m not sure if this relatively new world propels us into thinking we’re on an island, or if java developers are inherently poor data analysts.  At any rate, we need to reign in our bloated self-image and realize that someone else likely encountered our issue and a seasoned committer carried it thru the stack to resolution.  Let me give you an example:

Sqooping data with newlines

I wish I had caught this issue earlier. Some of our developers were pulling data from Teradata and DB2 and encountered embedded newline and ctrl-a data in a few columns.  Claiming the ‘bad’ data broke their process, they overreacted and jumped to using Avro files to resolve their problem.  While avro is well and good for some issues, this was major overkill that turned out causing issues within Datameer and created additional complexity in HCat.  I took some time to ‘research’ (ala google-fu) to see what others had done to get around this.  I already had a few simple ideas, like regex your SQL to remove \n\r\01, but I was really looking for a more elegant solution.

It took me 30 minutes or so to work up an example, create a failure, and RTFM for a resolution.  I was hitting walls everywhere much like our developers, the sqoop documentation isn’t bad, but there are some holes.  A little more searching and I found Cloudera Sqoop-129 Newlines in RDBMS fields break hive.  Created 11/2010 and resolved 5/2011.  Turns out it was fixed in sqoop version 1.3.0 and we are on 1.4.2 – looking good so far.  The fix implemented these arguments which handles elimination or replacement of these characters during the load.

--hive-drop-import-delims Drops \n\r, and \01 from string fields when importing to Hive.
--hive-delims-replacement Replace \n\r, and \01 from string fields with user defined string when importing to Hive.

It turns out they fixed our problem from a Hive standpoint, but its actually valid for Pig, etc.  Its much more elegeant than a source-SQL/regex solution because I don’t need to specify fields – everything is covered.  Now in our case the business users didn’t even care about the newlines that were present in 3 of 2 million rows (ug!) so I just used –hive-drop-import-delims in the sqoop command and everything was fine.

So by adding a single line to a Sqoop step, I eliminated the need to maintain an additional serialization framework and downstream processes will likely be easier to maintain.  When dealing with basic business data we need to realize it isn’t rocket science – some else has probably already figured it out.

 

 

 

Posted in Development, Hindsight, Opinions | Tagged , | Leave a comment

GlusterFS and Hadoop, not replacing HDFS

Enterprise Hadoop must cooperate with many other forms of data transmission and ingestion. Any form of MFT, Mqueue or file landing zone requires disk space. Not HDFS disk, just disk that we can mount, MFT, SFTP, etc. to until we actually ingest the data into Hadoop. (where life if beautiful all the time.)

Traditional “Enterprise” disk space is provided by SAN or NAS mounts. There are reasons for this: snapshots, flashcopies, highly available nodes, re-redundant disks and de-duplication oh my! There are many valid reasons for using these technologies. Most – if not all – of those reasons do not apply to Hadoop landing zones.

Enter GlusterFS; a striped, redundant, multiple access point solution. My SPOF Hadoop v. 1.x NameNode can write to a GlusterFS mount, I can boot my DataNodes to a GlusterFS mount that has a backup server baked right into the mount command. I can point MFT, SFTP, Mqueue, etc. to a mount that has redundancy baked right in. This is sounding redundant.

My point is that GlusterFS meets the multi-node, replicated storage requirements enterprises demand, but using Local SATA disk at a ~35 times less than SAN cost. That SWAG is based on our internal cost of SAN @ $7.50/GB vs. $0.22/GB.

Good, Fast & Cheap — It’s a brave new world.

Posted in Administration, Deployment, Tuning | Leave a comment