Kafka on AWS EC2 w/ SSL and External Visibility

I’m truly shocked by how difficult this information is to gather up in 1 place. Maybe because AWS has their own version of Kafka functionality.

At any rate, after much reading and irritation I have it working. There is still some work to do securing Zookeeper and adding ACL’s to Kafka, but we’ll get there later.

Tip 1: Use an Elastic IP per kafka broker and give it a DNS entry. We’re using Route53 so that’s pretty easy.
Tip 2: Put a complete list of your Kafka brokers and their INTERNAL IP addresses in /etc/hosts on each broker to match their DNS Hostname
Tip 3: Edit the network settings on your brokers so the hostname is accurate w/ the DNS Entry
Tip 4: Bounce box after this to ensure it works properly
Tip 5: Do NOT use underscores in your hostname.

All of this is to ensure Zookeeper can figure out WTF is going on. It won’t let you tell it directly… :(

We have a 3 node cluster running across 3 AZ’s in our VPC. Single node kafka is a lot easier. The following settings for server.properties are what makes this work for us:

Make sure broker.id is set to a unique # for each broker
auto.leader.rebalance.enable=true — enables better shutdown/restart experiences
controlled.shutdown.enable=true — helps with this as well
listeners=SSL://__HOST__:9093 — This is for your specific broker. We disabled PLAINTEXT as an option.
advertised.listeners=SSL://__HOST__:9093 — One of these is deprecated.
host.name=__HOST__
advertised.host.name=__HOST__

We wanted to keep everything forever, so you have to set a really high value like log.retention.hours=2147483647 to make that work. For the SSL stuff:

ssl.keystore.location=/root/jks/kafka.server.keystore.jks
ssl.keystore.password=
ssl.key.password=
ssl.truststore.location=/root/jks/kafka.server.truststore.jks
ssl.truststore.password=

security.inter.broker.protocol=SSL

There is a whole lot to learn about generating sign CA Keys and keystores for Java. I don’t have the energy, so I’ll give you the link:
http://kafka.apache.org/documentation/#security

Kafka has “Rack Awareness” built in, so this is kinda cool. I use a bash script to start my brokers, which allows me to use a template and fill in the rack location using stuff that EC2 instances know about themselves.

In the template set broker.rack=’__AZ__’
and in your bash script do something like this:
AZ=$(curl -s http://169.254.169.254/latest/meta-data/placement/availability-zone)
cat /root/kafka_current/config/server.properties.tmp | sed -e “s/__ID__/${ID}/g;s/__HOST__/${HOST}/g;s/__AZ__/${AZ}/g” > /root/kafka_current/config/server.properties

I embed the Broker ID in the hostname so I can do stuff like:
ID=$(hostname | cut -f 1 -d. | cut -f 2 -d “b”)
HOST=$(hostname)

So my broker names are kb1.mydomain.com, kb2.mydomain.com, etc.

Most of the Kafka Monitoring and Management tools completely fail w/ SSL. :(

Posted in Administration, Deployment | Leave a comment

Drilling thru Multiple Clusters

…or Using Apache Drill to join data across discreet domains.

We’ve been doing some work with Redshift lately. While it’s an effective tool for storing and crunching thru large amounts of structured data, it’s limited by a few “-ism’s” that keep it from being more useful.

The first is just annoying: It’s an identity island. It doesn’t attache to anything for UAA; not LDAP, not even IAM! This is a damn shame.

The second restriction is around selecting across databases. Redshift allows you to create multiple databases in a single cluster. And multiple schemas within each database. The good news is that you can cross the Schema boundaries to join tables, etc. The bad news is that you can’t select across multiple databases in the same cluster. :-/ In theory, this is good for data separation, etc. but in practice is means I must load multiple copies of my Enterprise Lookup Tables. I can’t have just 1 copy of my Master Custom ID to Address table, I have to have one in each schema.

Enter Apache Drill. Drill allows me to configure multiple connections and connection types for use in Drill Queries. So I can configure a psql connection via jdbc for two clusters, and query them as one like this:

select s.units_sold, l.customer_name from redshift_A.billing.sales s join redshift_b.lookups.customers l on s.cust_id = b.cust_id

In this case my 2 distinct Redshift clusters have Drill Storage configurations* named redshift_A and redshift_b. These definitions are tied to a specific Redshift Database on each cluster, so including that might be a better naming standard. In the Redshift_A cluster there is a schema named “billing” and a table named sales. So the table definition in our SQL Select statements is Storage_Name.Schema_name.table. Again is a local alias for a specific Redshift Cluster and Database combination.

*Here is a sample configuration:
{
"type": "jdbc",
"driver": "org.postgresql.Driver",
"url": "jdbc:postgresql://my-rs-cluster.awsGibberish.us-east-1.redshift.amazonaws.com:5439/mydb?ssl=true&sslfactory=org.postgresql.ssl.NonValidatingFactory",
"username": "admin",
"password": "--secretpw--",
"enabled": true
}

As you can see, we are using the Postgresql JDBC Driver w/ SSL to connect to “my-rs-cluster” and the specific database “mydb” in that cluster.

Pretty cool stuff.

But wait! There’s more! Drill isn’t limited to Redshift or even JDBC. It can work directly with S3 and/or local files various format types: parquet, csv, tsv, etc., Hive, HBase, Mongo and others.

I’ve just begun to explore the abilities and qwerks here, but I’m liking the start.

Posted in Uncategorized | Leave a comment

T vs. V and W Shaped People

We talk a lot about hiring T shaped people at my current gig and I think it’s a misnomer for a couple of reasons.

First, it implies a ratio of depth to width that is askew. Developers and Admins in todays world have to be familiar with a really wide range of technologies and tools. Hell, you can’t even just develop anymore, you also have to understand deploy tools, repository caches and git. Git is more than traditional VCS as most of us have learned. This fun and frightening article explains some of the challenges of Javascript these days: How it feels to learn Javascript in 2016

Second, it makes it look like we’re only deep in 1 area. Who spikes down into, Linux without having some MySQL, Network, bash, python, etc. chops? And more than just a little across the Top of the T. As you get deeper into a specific area, you find that you need to bring other skills up (down) to a similar level. Maybe not quite as deep, but close. As you group your various depths together you start to look more V shaped. As you add deep skills in other areas, you might be a W shape. As a Liniux Admin, I transitioned into Hadoop Administration. I had to gain some skills around understanding of Java, HDFS, YARN, etc. along the way. While Linux is a part of Hadoop, Linux Admin skills are still different from Hadoop Admin Skill. So, I was a W. Working with Amazon Web Services is another skill area that is Wide. Understanding their offerings has certainly been challenging, from VPC’s and Redshift Clusters, to Route 53, IAM and more. Now I feel like a VW. :D

Maybe the whole idea looking for people who have a skill is the wrong direction. Maybe, we shold look for people who can learn and figure out how to solve problems. Otherwise we have to start looking for VWWVW people, and that’s hard to pronounce.

Posted in Administration, Career, Development, Opinions | Leave a comment

System Administration Rules to Live By

I’ve had a variation of these running around for a while. Tweaks may come and go with trends, but the concepts are the same.

  1. When they say “Go Big!” they don’t mean it.
  2. Start with optimistic scripts. Finish them defensively.
  3. Assume your audience knows nothing.
  4. Suspend disbelief – It’ll work.
  5. Self documenting isn’t.
  6. Track your work – If it’s not in JIRA (or Trello) it didn’t happen.
  7. Celebrate “Big Wins” – They don’t last long and are soon forgotten.
  8. If everything is a top priority, nothing is. Don’t stress about it.
  9. Don’t Panic. This is probably fixable.
  10. Fail fast. If something doesn’t work, dump it or be stuck with it forever.
  11. Nuke it from orbit, it’s the only way to be sure.
  12. Productive Teams Eschew Enterprise Solutions – because they suck.
  13. Make it work, doesn’t mean fix it – it just means, MAKE IT WORK!

And a few truths about automated services

  1. If it runs, it must log status
  2. If it logs, it must be rotated
  3. Logs must be reviewed by something
  4. If it runs, it must have an SLA
  5. If it runs, it must be alerted when it misses SLA
Posted in Administration, Opinions | Leave a comment

A wonderful, ugly script that just keeps working

Today were going to look at parts of a complex “nudge” script as I’ve described previously. It has a few more bells and whistles and constantly amazes me how well it adapts.

I’ll show the good bits in sections so we can discuss.

First some cool date math
TDY=`date +%Y-%m-%d`
if [ -n "$1" ] ; then
date -d $1
if [ $? == 0 ]; then
TDY=`date -d $1 +%Y-%m-%d`
fi
fi
TMO=`date -d "$TDY + 1day" +%Y-%m-%d`
TS=`date -d $TDY +%Y%m%d`


TDY is today’s date. Unless you passed in a valid date that you want to use. This is useful for processing batches of data based on load date, landed date, etc.
TMO is tomorrow. That’s useful for finding files that landed today. You need TMO to do that with find. We’ll see more about that in a bit.
TS is a TimeStamp for logging purposes. Since TDY might be a passed value, we need to ensure that TS is used. We can also expand this for Hour Min Secs.

MY_PATH="`dirname \"$0\"`" # relative
MY_PATH="`( cd \"$MY_PATH\" && pwd )`"


This is a cool trick to always know exactly where you started, no matter who or where you are. Useful for self updates as seen below.

printf "%(%Y-%m-%d %H:%M:%S)T This log entry contains a date/time stamp.\n" "$(date +%s)" >> $LOG


cd $MY_PATH
git reset --hard HEAD; git pull
chmod +x $0


To ensure we end up where we started, we head back the the MY_PATH value we saved earlier.
Then we we ensure that we have the latest incarnation of ourself and ensure we’re executable.

Finally, the last 2 lines of the code are always spawn myself in the background and disown the child process, as described.

Posted in Uncategorized | Leave a comment

The 3 Question Test

  1. A burger and fries costs $1.10; the burger costs $1 more than the fries. How much do the fries cost?
  2. 5 servers can sort 5 TB of data in 5 minutes; how long would 100 servers take to sort 100 TB of data?
  3. A patch of mold doubles in size every day. It takes 9 days to cover the sample dish; how long to cover 1/2 of the sample dish?

I’ll post a comment w/ the answers.

Posted in Uncategorized | 1 Comment

Experimenting w/ Neo4j

Graph databases are a really neat concept. We’ve started playing with Neo here as we attempt to link customers with visits and actions based on those visits. It seems like a really good fit at first glance.

Our challenge is moving from the traditional RDBMS thinking to Graph thinking. Lots of experimenting and changing models to be found.

If we decided to use Neo, I’ll post some thoughts our how we wrapped our heads around it.

Posted in Data, Development, Tuning | Leave a comment

Just give it a nudge.

The second definition of nudge, according to Webster, is to “prod lightly: urge into action.”
We use that concept in our data environments for various long running processes; for things that we want to happen frequently, but with an unknown runtime.

There are many ways to modify/customize this concept for specific uses. This basic concept is here:


#!/bin/bash
## Nudge.sh -- for forever running, self re-entrant script.

SLEEPTIME=2m

# You may have configuration variables you want to include. Just create a file with the script name plus .env and this will source that file.
# . ${0}.env

# Optional, but you might want to make sure you keep running by default
if [ -e ${0}.stop ]; then rm -f ${0}.stop; fi

# do something useful
python useful.py >> log.$0 2>&1

# Just so we know things are running, even w/o log entries.
touch log.file

## and now we sleep then relaunch or exit as needed.
sleep $SLEEPTIME

# To stop the forever loop, just touch a file. (touch ${0}.stop)
if [ -e ${0}.stop ]; then echo "Stopping"; exit; fi

# Call myself again as a background child.
$0 &

# Remove myself as a parent to the just launched process. ($! is the PID for that child)
disown $!



Run a command, take a nap, repeat.

Posted in Administration, Deployment, Development | Leave a comment

Redshift ups and downs

AWS Redshift has been popular lately around my current gig. We’ve got a couple of clusters in use and a few more in POC mode. The in-use clusters are easy to justify pre-paid instances. A few thousand dollars and you have a cluster; not bad.

The cost of POC’s is a little more difficult, so we do what we can to keep them off-line when not in use. Fortunately, the AWS CLI makes this fairly trivial. I have a couple of short scripts that I have cron’d for daily (M-F) execution.

First we need to stop the cluster w/ a Snapshot:


CLUSTERID=Your-Clustername
TS=$(date +"%Y%m%d-%H%M%S")
echo $TS > ${CLUSTERID}_ss.ts
SNAPID=autosnap-$CLUSTERID
aws redshift delete-cluster --cluster-identifier $CLUSTERID --final-cluster-snapshot-identifier ${SNAPID}-$TS


All we’re doing is creating a unique snapshot name that we can store and use to create a new cluster from. Timestamps are a favorite tool of mine for doing this.

And in the morning we create a new cluster based on the snapshot:


CLUSTERID=Your-Clustername
TS=$(cat ${CLUSTERID}_ss.ts)
SNAPID=autosnap-$CLUSTERID
aws redshift restore-from-cluster-snapshot --cluster-identifier $CLUSTERID --snapshot-identifier ${SNAPID}-$TS


In this script we Read TS from the file we created on shutdown and use that (the latest) snapshot to start our cluster.

Cron these scripts for daily execution and Profit! :)

Posted in Administration, Development | 1 Comment

Quick Split to Fix data silliness

We have a vendor sending us daily updates on shipping info. We have a well known and defined structure for each type of data and those types map neatly to tables in our database. We have about 9 tables that need updated each day to give us the complete picture from this vendor point of view.

After months of trying to get them to send the data, it finally showed up; in 1 file. *sigh* They jammed, in random order, all of the new tables records into 1 unorganized file. The only saving grace is that the 2nd column defines the record — and table — type.

After I pondered this for a few moments, I started working on a quick and “simple” solution. I came up with this:

for x in `cat INFILE.dat | cut -f 2 -d $'\x01' | sort | uniq`; do cat INFILE.dat | grep $'\001'${x}$'\001' > ${x}.txt; done

Grab all of the unique table types and loop over the data for each one grep’ing as needed. There are probably more efficient ways, but this works pretty fast on our smallish data set.

Posted in Administration, Data | Tagged , , | Leave a comment