A wonderful, ugly script that just keeps working

Today were going to look at parts of a complex “nudge” script as I’ve described previously. It has a few more bells and whistles and constantly amazes me how well it adapts.

I’ll show the good bits in sections so we can discuss.

First some cool date math
TDY=`date +%Y-%m-%d`
if [ -n "$1" ] ; then
date -d $1
if [ $? == 0 ]; then
TDY=`date -d $1 +%Y-%m-%d`
TMO=`date -d "$TDY + 1day" +%Y-%m-%d`
TS=`date -d $TDY +%Y%m%d`

TDY is today’s date. Unless you passed in a valid date that you want to use. This is useful for processing batches of data based on load date, landed date, etc.
TMO is tomorrow. That’s useful for finding files that landed today. You need TMO to do that with find. We’ll see more about that in a bit.
TS is a TimeStamp for logging purposes. Since TDY might be a passed value, we need to ensure that TS is used. We can also expand this for Hour Min Secs.

MY_PATH="`dirname \"$0\"`" # relative
MY_PATH="`( cd \"$MY_PATH\" && pwd )`"

This is a cool trick to always know exactly where you started, no matter who or where you are. Useful for self updates as seen below.

printf "%(%Y-%m-%d %H:%M:%S)T This log entry contains a date/time stamp.\n" "$(date +%s)" >> $LOG

git reset --hard HEAD; git pull
chmod +x $0

To ensure we end up where we started, we head back the the MY_PATH value we saved earlier.
Then we we ensure that we have the latest incarnation of ourself and ensure we’re executable.

Finally, the last 2 lines of the code are always spawn myself in the background and disown the child process, as described.

Leave a comment

The 3 Question Test

  1. A burger and fries costs $1.10; the burger costs $1 more than the fries. How much do the fries cost?
  2. 5 servers can sort 5 TB of data in 5 minutes; how long would 100 servers take to sort 100 TB of data?
  3. A patch of mold doubles in size every day. It takes 9 days to cover the sample dish; how long to cover 1/2 of the sample dish?

I’ll post a comment w/ the answers.

1 Comment

Experimenting w/ Neo4j

Graph databases are a really neat concept. We’ve started playing with Neo here as we attempt to link customers with visits and actions based on those visits. It seems like a really good fit at first glance.

Our challenge is moving from the traditional RDBMS thinking to Graph thinking. Lots of experimenting and changing models to be found.

If we decided to use Neo, I’ll post some thoughts our how we wrapped our heads around it.

Leave a comment

Just give it a nudge.

The second definition of nudge, according to Webster, is to “prod lightly: urge into action.”
We use that concept in our data environments for various long running processes; for things that we want to happen frequently, but with an unknown runtime.

There are many ways to modify/customize this concept for specific uses. This basic concept is here:

## Nudge.sh -- for forever running, self re-entrant script.


# You may have configuration variables you want to include. Just create a file with the script name plus .env and this will source that file.
# . ${0}.env

# Optional, but you might want to make sure you keep running by default
if [ -e ${0}.stop ]; then rm -f ${0}.stop; fi

# do something useful
python useful.py >> log.$0 2>&1

# Just so we know things are running, even w/o log entries.
touch log.file

## and now we sleep then relaunch or exit as needed.

# To stop the forever loop, just touch a file. (touch ${0}.stop)
if [ -e ${0}.stop ]; then echo "Stopping"; exit; fi

# Call myself again as a background child.
$0 &

# Remove myself as a parent to the just launched process. ($! is the PID for that child)
disown $!

Run a command, take a nap, repeat.

Leave a comment

Redshift ups and downs

AWS Redshift has been popular lately around my current gig. We’ve got a couple of clusters in use and a few more in POC mode. The in-use clusters are easy to justify pre-paid instances. A few thousand dollars and you have a cluster; not bad.

The cost of POC’s is a little more difficult, so we do what we can to keep them off-line when not in use. Fortunately, the AWS CLI makes this fairly trivial. I have a couple of short scripts that I have cron’d for daily (M-F) execution.

First we need to stop the cluster w/ a Snapshot:

TS=$(date +"%Y%m%d-%H%M%S")
echo $TS > ${CLUSTERID}_ss.ts
aws redshift delete-cluster --cluster-identifier $CLUSTERID --final-cluster-snapshot-identifier ${SNAPID}-$TS

All we’re doing is creating a unique snapshot name that we can store and use to create a new cluster from. Timestamps are a favorite tool of mine for doing this.

And in the morning we create a new cluster based on the snapshot:

TS=$(cat ${CLUSTERID}_ss.ts)
aws redshift restore-from-cluster-snapshot --cluster-identifier $CLUSTERID --snapshot-identifier ${SNAPID}-$TS

In this script we Read TS from the file we created on shutdown and use that (the latest) snapshot to start our cluster.

Cron these scripts for daily execution and Profit! ūüôā

1 Comment

Quick Split to Fix data silliness

We have a vendor sending us daily updates on shipping info. We have a well known and defined structure for each type of data and those types map neatly to tables in our database. We have about 9 tables that need updated each day to give us the complete picture from this vendor point of view.

After months of trying to get them to send the data, it finally showed up; in 1 file. *sigh* They jammed, in random order, all of the new tables records into 1 unorganized file. The only saving grace is that the 2nd column defines the record — and table — type.

After I pondered this for a few moments, I started working on a quick and “simple” solution. I came up with this:

for x in `cat INFILE.dat | cut -f 2 -d $'\x01' | sort | uniq`; do cat INFILE.dat | grep $'\001'${x}$'\001' > ${x}.txt; done

Grab all of the unique table types and loop over the data for each one grep’ing as needed. There are probably more efficient ways, but this works pretty fast on our smallish data set.

Leave a comment

MapR is a better base?

I’ve heard about MapR for a long time and haven’t given it much consideration vs. OSS stacks. I reconsidering my position and conduction some evaluations.

Why? MaprFS is a real POSIX File system that runs on Raw devices, not atop ext4, xfs, etc. It also runs natively, not in a JVM. It claims HDFS compliance and so should work with other Hadoop tools. It also has an integrated HBase API compatible data store and Kafka 0.9 API compatible pub/sub service. These are part of the base FS which should make things “better” than the rest. We’ll see.

If some of these things prove out, this could be a game changer for my perspective.

1 Comment

Ubuntu sucks and so does Debian

Full disclosure, I cut my teeth on Slackware and Redhat in the mid 90’s. ¬†I even tried Yggdrasil once. ¬†That being said…

I fully fail to understand the allure of Ubuntu or it’s Mommy distro, Debian. ¬†Yes I know Ubuntu is supposed to be the “world peace” of OS’s but that lie has been exposed many times. ¬†An arrogant .com millionaire started it and he directs the mostly moronic design decisions. ¬†What’s to love again? ¬†Their website barley mentions Linux.

Debian might be a good design but rpm isn’t all that bad. ¬†WTF (why) do we need an entirely different package mgt. tool? ¬†Can we really not learn to live with or at least contribute changes to the existing standards? ¬†I’ll fight over that one… RPM was WAY here first.

Moving on; this applies to Hadoop in the realm of “what os should I install to carry my Hadoop cluster?” ¬†The obvious answer, IMHO, is CentOS. ¬†It’s free, it’s driven from the designs of a¬†commercial¬†business and it’s the development platform for most Hadoop contributors. ¬†Even if that last part wasn’t true, it’d still be the best choice.

CentOS is derived from the source code published (as required by lic.) by RedHat.  It claims binary compatibility and a $0 price tag.  Some of you are thinking that this is evil by association.  Allow me to correct your thinking.

RedHat must make money to survive. ¬†In order to make money, they must offer something more than a “free” os. ¬†As it turns out, the best way to do that is contribute their “cool new” features, fixes and patches to the community. ¬†It’s also required by the EULA. ūüėČ ¬†So now we have a for-profit company contributing “cool new” features to the kernel and distribution. ¬†And we can reap those rewards for free via CentOS. ¬†Why wouldn’t you use them?

Leave a comment

Cloud Hadoop? Buzzword Fiesta!

We haven’t quite jumped the shark yet, but this is going to be full of buzzwords.

Started a new gig where we’re building Dev, POC and possibly some prod clusters on AWS. Once again the first 80% of this was pretty easy. Using Cloudbreak, it’s fairly easy to create clusters. Developing new Amabari (gag) blueprints is pretty easy and they do a lot of the heavy lifting. It starts getting ugly after that.

Cloudbreak uses Consul to “discover” hosts after creating the AWS instances. There is a little black magic going on with Consul to use pseudo DNS. Add to this “Containers” and you have a pretty screwed up environment from a purist point of view. So add Kerberos to this mix and you might need some Xanax. Kerberos wants nice FQDNs for all of the hosts involved. That sorta goes against the idea of Elastic Hadoop, but we’ll burn that bridge later. Just getting Consul and Ambari to see each “node” (sometimes as a container) using consistent names is going to be interesting.

So, Kerberized, Elastic Hadoop in the Cloud with encrypted data in flight and at rest. That’s the buzzword goal. :-/

Leave a comment

Hadoop 2.0 GA

I’ve been watching the Hadoop user mailing lists and jira counts. It sure seems like 2.0 GA is more like 2.0 Beta 1.

I’m looking forward to RC 1 before we move it into a serious cluster. Just my $0.02.

Leave a comment