T vs. V and W Shaped People

We talk a lot about hiring T shaped people at my current gig and I think it’s a misnomer for a couple of reasons.

First, it implies a ratio of depth to width that is askew. Developers and Admins in todays world have to be familiar with a really wide range of technologies and tools. Hell, you can’t even just develop anymore, you also have to understand deploy tools, repository caches and git. Git is more than traditional VCS as most of us have learned. This fun and frightening article explains some of the challenges of Javascript these days: How it feels to learn Javascript in 2016

Second, it makes it look like we’re only deep in 1 area. Who spikes down into, Linux without having some MySQL, Network, bash, python, etc. chops? And more than just a little across the Top of the T. As you get deeper into a specific area, you find that you need to bring other skills up (down) to a similar level. Maybe not quite as deep, but close. As you group your various depths together you start to look more V shaped. As you add deep skills in other areas, you might be a W shape. As a Liniux Admin, I transitioned into Hadoop Administration. I had to gain some skills around understanding of Java, HDFS, YARN, etc. along the way. While Linux is a part of Hadoop, Linux Admin skills are still different from Hadoop Admin Skill. So, I was a W. Working with Amazon Web Services is another skill area that is Wide. Understanding their offerings has certainly been challenging, from VPC’s and Redshift Clusters, to Route 53, IAM and more. Now I feel like a VW. :D

Maybe the whole idea looking for people who have a skill is the wrong direction. Maybe, we shold look for people who can learn and figure out how to solve problems. Otherwise we have to start looking for VWWVW people, and that’s hard to pronounce.

Posted in Administration, Career, Development, Opinions | Leave a comment

System Administration Rules to Live By

I’ve had a variation of these running around for a while. Tweaks may come and go with trends, but the concepts are the same.

  1. When they say “Go Big!” they don’t mean it.
  2. Start with optimistic scripts. Finish them defensively.
  3. Assume your audience knows nothing.
  4. Suspend disbelief – It’ll work.
  5. Self documenting isn’t.
  6. Track your work – If it’s not in JIRA (or Trello) it didn’t happen.
  7. Celebrate “Big Wins” – They don’t last long and are soon forgotten.
  8. If everything is a top priority, nothing is. Don’t stress about it.
  9. Don’t Panic. This is probably fixable.
  10. Fail fast. If something doesn’t work, dump it or be stuck with it forever.
  11. Nuke it from orbit, it’s the only way to be sure.
  12. Productive Teams Eschew Enterprise Solutions – because they suck.
  13. Make it work, doesn’t mean fix it – it just means, MAKE IT WORK!

And a few truths about automated services

  1. If it runs, it must log status
  2. If it logs, it must be rotated
  3. Logs must be reviewed by something
  4. If it runs, it must have an SLA
  5. If it runs, it must be alerted when it misses SLA
Posted in Administration, Opinions | Leave a comment

A wonderful, ugly script that just keeps working

Today were going to look at parts of a complex “nudge” script as I’ve described previously. It has a few more bells and whistles and constantly amazes me how well it adapts.

I’ll show the good bits in sections so we can discuss.

First some cool date math
TDY=`date +%Y-%m-%d`
if [ -n "$1" ] ; then
date -d $1
if [ $? == 0 ]; then
TDY=`date -d $1 +%Y-%m-%d`
TMO=`date -d "$TDY + 1day" +%Y-%m-%d`
TS=`date -d $TDY +%Y%m%d`

TDY is today’s date. Unless you passed in a valid date that you want to use. This is useful for processing batches of data based on load date, landed date, etc.
TMO is tomorrow. That’s useful for finding files that landed today. You need TMO to do that with find. We’ll see more about that in a bit.
TS is a TimeStamp for logging purposes. Since TDY might be a passed value, we need to ensure that TS is used. We can also expand this for Hour Min Secs.

MY_PATH="`dirname \"$0\"`" # relative
MY_PATH="`( cd \"$MY_PATH\" && pwd )`"

This is a cool trick to always know exactly where you started, no matter who or where you are. Useful for self updates as seen below.

printf "%(%Y-%m-%d %H:%M:%S)T This log entry contains a date/time stamp.\n" "$(date +%s)" >> $LOG

git reset --hard HEAD; git pull
chmod +x $0

To ensure we end up where we started, we head back the the MY_PATH value we saved earlier.
Then we we ensure that we have the latest incarnation of ourself and ensure we’re executable.

Finally, the last 2 lines of the code are always spawn myself in the background and disown the child process, as described.

Posted in Uncategorized | Leave a comment

The 3 Question Test

  1. A burger and fries costs $1.10; the burger costs $1 more than the fries. How much do the fries cost?
  2. 5 servers can sort 5 TB of data in 5 minutes; how long would 100 servers take to sort 100 TB of data?
  3. A patch of mold doubles in size every day. It takes 9 days to cover the sample dish; how long to cover 1/2 of the sample dish?

I’ll post a comment w/ the answers.

Posted in Uncategorized | 1 Comment

Experimenting w/ Neo4j

Graph databases are a really neat concept. We’ve started playing with Neo here as we attempt to link customers with visits and actions based on those visits. It seems like a really good fit at first glance.

Our challenge is moving from the traditional RDBMS thinking to Graph thinking. Lots of experimenting and changing models to be found.

If we decided to use Neo, I’ll post some thoughts our how we wrapped our heads around it.

Posted in Data, Development, Tuning | Leave a comment

Just give it a nudge.

The second definition of nudge, according to Webster, is to “prod lightly: urge into action.”
We use that concept in our data environments for various long running processes; for things that we want to happen frequently, but with an unknown runtime.

There are many ways to modify/customize this concept for specific uses. This basic concept is here:

## Nudge.sh -- for forever running, self re-entrant script.


# You may have configuration variables you want to include. Just create a file with the script name plus .env and this will source that file.
# . ${0}.env

# Optional, but you might want to make sure you keep running by default
if [ -e ${0}.stop ]; then rm -f ${0}.stop; fi

# do something useful
python useful.py >> log.$0 2>&1

# Just so we know things are running, even w/o log entries.
touch log.file

## and now we sleep then relaunch or exit as needed.

# To stop the forever loop, just touch a file. (touch ${0}.stop)
if [ -e ${0}.stop ]; then echo "Stopping"; exit; fi

# Call myself again as a background child.
$0 &

# Remove myself as a parent to the just launched process. ($! is the PID for that child)
disown $!

Run a command, take a nap, repeat.

Posted in Administration, Deployment, Development | Leave a comment

Redshift ups and downs

AWS Redshift has been popular lately around my current gig. We’ve got a couple of clusters in use and a few more in POC mode. The in-use clusters are easy to justify pre-paid instances. A few thousand dollars and you have a cluster; not bad.

The cost of POC’s is a little more difficult, so we do what we can to keep them off-line when not in use. Fortunately, the AWS CLI makes this fairly trivial. I have a couple of short scripts that I have cron’d for daily (M-F) execution.

First we need to stop the cluster w/ a Snapshot:

TS=$(date +"%Y%m%d-%H%M%S")
echo $TS > ${CLUSTERID}_ss.ts
aws redshift delete-cluster --cluster-identifier $CLUSTERID --final-cluster-snapshot-identifier ${SNAPID}-$TS

All we’re doing is creating a unique snapshot name that we can store and use to create a new cluster from. Timestamps are a favorite tool of mine for doing this.

And in the morning we create a new cluster based on the snapshot:

TS=$(cat ${CLUSTERID}_ss.ts)
aws redshift restore-from-cluster-snapshot --cluster-identifier $CLUSTERID --snapshot-identifier ${SNAPID}-$TS

In this script we Read TS from the file we created on shutdown and use that (the latest) snapshot to start our cluster.

Cron these scripts for daily execution and Profit! :)

Posted in Administration, Development | 1 Comment

Quick Split to Fix data silliness

We have a vendor sending us daily updates on shipping info. We have a well known and defined structure for each type of data and those types map neatly to tables in our database. We have about 9 tables that need updated each day to give us the complete picture from this vendor point of view.

After months of trying to get them to send the data, it finally showed up; in 1 file. *sigh* They jammed, in random order, all of the new tables records into 1 unorganized file. The only saving grace is that the 2nd column defines the record — and table — type.

After I pondered this for a few moments, I started working on a quick and “simple” solution. I came up with this:

for x in `cat INFILE.dat | cut -f 2 -d $'\x01' | sort | uniq`; do cat INFILE.dat | grep $'\001'${x}$'\001' > ${x}.txt; done

Grab all of the unique table types and loop over the data for each one grep’ing as needed. There are probably more efficient ways, but this works pretty fast on our smallish data set.

Posted in Administration, Data | Tagged , , | Leave a comment

MapR is a better base?

I’ve heard about MapR for a long time and haven’t given it much consideration vs. OSS stacks. I reconsidering my position and conduction some evaluations.

Why? MaprFS is a real POSIX File system that runs on Raw devices, not atop ext4, xfs, etc. It also runs natively, not in a JVM. It claims HDFS compliance and so should work with other Hadoop tools. It also has an integrated HBase API compatible data store and Kafka 0.9 API compatible pub/sub service. These are part of the base FS which should make things “better” than the rest. We’ll see.

If some of these things prove out, this could be a game changer for my perspective.

Posted in Administration, Market Segment/Growth, Opinions | Tagged , , , , | Leave a comment

Ubuntu sucks and so does Debian

Full disclosure, I cut my teeth on Slackware and Redhat in the mid 90′s.  I even tried Yggdrasil once.  That being said…

I fully fail to understand the allure of Ubuntu or it’s Mommy distro, Debian.  Yes I know Ubuntu is supposed to be the “world peace” of OS’s but that lie has been exposed many times.  An arrogant .com millionaire started it and he directs the mostly moronic design decisions.  What’s to love again?  Their website barley mentions Linux.

Debian might be a good design but rpm isn’t all that bad.  WTF (why) do we need an entirely different package mgt. tool?  Can we really not learn to live with or at least contribute changes to the existing standards?  I’ll fight over that one… RPM was WAY here first.

Moving on; this applies to Hadoop in the realm of “what os should I install to carry my Hadoop cluster?”  The obvious answer, IMHO, is CentOS.  It’s free, it’s driven from the designs of a commercial business and it’s the development platform for most Hadoop contributors.  Even if that last part wasn’t true, it’d still be the best choice.

CentOS is derived from the source code published (as required by lic.) by RedHat.  It claims binary compatibility and a $0 price tag.  Some of you are thinking that this is evil by association.  Allow me to correct your thinking.

RedHat must make money to survive.  In order to make money, they must offer something more than a “free” os.  As it turns out, the best way to do that is contribute their “cool new” features, fixes and patches to the community.  It’s also required by the EULA. ;)  So now we have a for-profit company contributing “cool new” features to the kernel and distribution.  And we can reap those rewards for free via CentOS.  Why wouldn’t you use them?

Posted in Administration, Opinions | Leave a comment