Experimenting w/ Neo4j

Graph databases are a really neat concept. We’ve started playing with Neo here as we attempt to link customers with visits and actions based on those visits. It seems like a really good fit at first glance.

Our challenge is moving from the traditional RDBMS thinking to Graph thinking. Lots of experimenting and changing models to be found.

If we decided to use Neo, I’ll post some thoughts our how we wrapped our heads around it.

Posted in Data, Development, Tuning | Leave a comment

Just give it a nudge.

The second definition of nudge, according to Webster, is to “prod lightly: urge into action.”
We use that concept in our data environments for various long running processes; for things that we want to happen frequently, but with an unknown runtime.

There are many ways to modify/customize this concept for specific uses. This basic concept is here:


#!/bin/bash
## Nudge.sh -- for forever running, self re-entrant script.

SLEEPTIME=2m

# You may have configuration variables you want to include. Just create a file with the script name plus .env and this will source that file.
# . ${0}.env

# Optional, but you might want to make sure you keep running by default
if [ -e ${0}.stop ]; then rm -f ${0}.stop; fi

# do something useful
python useful.py >> log.$0 2>&1

# Just so we know things are running, even w/o log entries.
touch log.file

## and now we sleep then relaunch or exit as needed.
sleep $SLEEPTIME

# To stop the forever loop, just touch a file. (touch ${0}.stop)
if [ -e ${0}.stop ]; then echo "Stopping"; exit; fi

# Call myself again as a background child.
$0 &

# Remove myself as a parent to the just launched process. ($! is the PID for that child)
disown $!


Run a command, take a nap, repeat.

Posted in Administration, Deployment, Development | Leave a comment

Redshift ups and downs

AWS Redshift has been popular lately around my current gig. We’ve got a couple of clusters in use and a few more in POC mode. The in-use clusters are easy to justify pre-paid instances. A few thousand dollars and you have a cluster; not bad.

The cost of POC’s is a little more difficult, so we do what we can to keep them off-line when not in use. Fortunately, the AWS CLI makes this fairly trivial. I have a couple of short scripts that I have cron’d for daily (M-F) execution.

First we need to stop the cluster w/ a Snapshot:


CLUSTERID=Your-Clustername
TS=$(date +"%Y%m%d-%H%M%S")
echo $TS > ${CLUSTERID}_ss.ts
SNAPID=autosnap-$CLUSTERID
aws redshift delete-cluster --cluster-identifier $CLUSTERID --final-cluster-snapshot-identifier ${SNAPID}-$TS


All we’re doing is creating a unique snapshot name that we can store and use to create a new cluster from. Timestamps are a favorite tool of mine for doing this.

And in the morning we create a new cluster based on the snapshot:


CLUSTERID=Your-Clustername
TS=$(cat ${CLUSTERID}_ss.ts)
SNAPID=autosnap-$CLUSTERID
aws redshift restore-from-cluster-snapshot --cluster-identifier $CLUSTERID --snapshot-identifier ${SNAPID}-$TS


In this script we Read TS from the file we created on shutdown and use that (the latest) snapshot to start our cluster.

Cron these scripts for daily execution and Profit! ūüôā

Posted in Administration, Development | 1 Comment

Quick Split to Fix data silliness

We have a vendor sending us daily updates on shipping info. We have a well known and defined structure for each type of data and those types map neatly to tables in our database. We have about 9 tables that need updated each day to give us the complete picture from this vendor point of view.

After months of trying to get them to send the data, it finally showed up; in 1 file. *sigh* They jammed, in random order, all of the new tables records into 1 unorganized file. The only saving grace is that the 2nd column defines the record — and table — type.

After I pondered this for a few moments, I started working on a quick and “simple” solution. I came up with this:

for x in `cat INFILE.dat | cut -f 2 -d $'\x01' | sort | uniq`; do cat INFILE.dat | grep $'\001'${x}$'\001' > ${x}.txt; done

Grab all of the unique table types and loop over the data for each one grep’ing as needed. There are probably more efficient ways, but this works pretty fast on our smallish data set.

Posted in Administration, Data | Tagged , , | Leave a comment

MapR is a better base?

I’ve heard about MapR for a long time and haven’t given it much consideration vs. OSS stacks. I reconsidering my position and conduction some evaluations.

Why? MaprFS is a real POSIX File system that runs on Raw devices, not atop ext4, xfs, etc. It also runs natively, not in a JVM. It claims HDFS compliance and so should work with other Hadoop tools. It also has an integrated HBase API compatible data store and Kafka 0.9 API compatible pub/sub service. These are part of the base FS which should make things “better” than the rest. We’ll see.

If some of these things prove out, this could be a game changer for my perspective.

Posted in Administration, Market Segment/Growth, Opinions | Tagged , , , , | 1 Comment

Ubuntu sucks and so does Debian

Full disclosure, I cut my teeth on Slackware and Redhat in the mid 90’s. ¬†I even tried Yggdrasil once. ¬†That being said…

I fully fail to understand the allure of Ubuntu or it’s Mommy distro, Debian. ¬†Yes I know Ubuntu is supposed to be the “world peace” of OS’s but that lie has been exposed many times. ¬†An arrogant .com millionaire started it and he directs the mostly moronic design decisions. ¬†What’s to love again? ¬†Their website barley mentions Linux.

Debian might be a good design but rpm isn’t all that bad. ¬†WTF (why) do we need an entirely different package mgt. tool? ¬†Can we really not learn to live with or at least contribute changes to the existing standards? ¬†I’ll fight over that one… RPM was WAY here first.

Moving on; this applies to Hadoop in the realm of “what os should I install to carry my Hadoop cluster?” ¬†The obvious answer, IMHO, is CentOS. ¬†It’s free, it’s driven from the designs of a¬†commercial¬†business and it’s the development platform for most Hadoop contributors. ¬†Even if that last part wasn’t true, it’d still be the best choice.

CentOS is derived from the source code published (as required by lic.) by RedHat.  It claims binary compatibility and a $0 price tag.  Some of you are thinking that this is evil by association.  Allow me to correct your thinking.

RedHat must make money to survive. ¬†In order to make money, they must offer something more than a “free” os. ¬†As it turns out, the best way to do that is contribute their “cool new” features, fixes and patches to the community. ¬†It’s also required by the EULA. ūüėČ ¬†So now we have a for-profit company contributing “cool new” features to the kernel and distribution. ¬†And we can reap those rewards for free via CentOS. ¬†Why wouldn’t you use them?

Posted in Administration, Opinions | Leave a comment

Cloud Hadoop? Buzzword Fiesta!

We haven’t quite jumped the shark yet, but this is going to be full of buzzwords.

Started a new gig where we’re building Dev, POC and possibly some prod clusters on AWS. Once again the first 80% of this was pretty easy. Using Cloudbreak, it’s fairly easy to create clusters. Developing new Amabari (gag) blueprints is pretty easy and they do a lot of the heavy lifting. It starts getting ugly after that.

Cloudbreak uses Consul to “discover” hosts after creating the AWS instances. There is a little black magic going on with Consul to use pseudo DNS. Add to this “Containers” and you have a pretty screwed up environment from a purist point of view. So add Kerberos to this mix and you might need some Xanax. Kerberos wants nice FQDNs for all of the hosts involved. That sorta goes against the idea of Elastic Hadoop, but we’ll burn that bridge later. Just getting Consul and Ambari to see each “node” (sometimes as a container) using consistent names is going to be interesting.

So, Kerberized, Elastic Hadoop in the Cloud with encrypted data in flight and at rest. That’s the buzzword goal. :-/

Posted in Deployment, Security | Leave a comment

Hadoop 2.0 GA

I’ve been watching the Hadoop user mailing lists and jira counts. It sure seems like 2.0 GA is more like 2.0 Beta 1.

I’m looking forward to RC 1 before we move it into a serious cluster. Just my $0.02.

Posted in Uncategorized | Leave a comment

Intuition about Chromecast

I bought a Google Chromecast device. It was really cheap ($35+tax) and I’m a whore for media casting devices.

When I “flick” tablet video’s to Chromecast, I notice something interesting. I see the display of Chromecast ponder what I’ve sent it. Then I see my table seemingly release the content. Yes, significant controls still run on the tab, but I “Feel like” the chromecast device has taken over playback and is now merely listening for “remote control” codes from the tab.

This feeling may be 100% wrong. It is based on my experiences with using VLC Player on a TV Display device along with VLC remote on android.

No opinions have been confirmed as fact. I’m only thinking this based on 30 years of IT experience. YMMV, IANAL. ūüėČ

Internet Meme’s will eventually become Skynet IMHO. The Rick-rolled Chuck Norris Meme’s are our only defense.

Posted in Uncategorized | Leave a comment

hello woRld!

R is the latest Hadoop darling. It is an open source language that “is widely used among statisticians and data miners for developing statistical software and data analysis. Polls and surveys of data miners are showing R’s popularity has increased substantially in recent years.” See the Wiki Article for more details: http://en.wikipedia.org/wiki/R_(programming_language)

The good news is that R is being developed by PHD level statisticians and academics across the globe. It contains many statistical functions, models and analytic constructs.

The bad news is that it’s being developed by PHD’s and academics! Great for developing analytics, not great developing stable solutions. ūüôā

Enter the Hadoop Sysadmin! He can write bash scripts, he can redirect output and he can weaponize R! Man up Hadoop Admin, you’re going to understand R sytax; it’s not terribly difficult, but it is slightly weird. R can read ENV VARs and run from the CMD line and – given it’s collegiate beginnings – this counts a 10x win.

Google is your friend and has most of your answers. I can say that R CMD BATCH hellowoRld.R MyProg.log is a good start. A nice bash script to handle errors, in/output dirs, etc. can have you running R in normal batch mode pretty quickly. You’ll probably have a more difficult time explain proper development techniques to the Data Scientist that just wants to run things in RStudio. ūüėČ

Posted in Administration, Deployment, Development, Tuning | Tagged , , , | Leave a comment