Progress and Setbacks — Hadoop 3.0 breaks things

241 properties have been deprecated in Hadoop 3.0. Is that enough change for you? Dozens of sites have dedicated space to discovering and explaining all the new goodness in Hadoop 3.0. I haven’t read any that discuss the problems that this massive overhaul brings to Hadoop stability. It may be that I don’t read enough?

Way back in Dec. 2017, The Apache Software Foundation announced GA of 3.0. Hortonworks waited until v 3.1 before releasing HDP 3.0 in the Summer of 2018. Now, vendors are trying to decide when the massive changes are worth implementing. Hadoop may be on 3.1, but HDP is 3.0 and I know very few enterprise class users who will be using this in production. This is part of the problem.

I’m personally excited about advancements in software and especially Hadoop. YARN now supports Docker containers, Namenode can now be split (is that different than federated) and other good things that you can read about in other blogs. I am not, however, an enterprise. I also don’t have to develop software that needs to contend with all of these changes.

I have friends that develop software for Hadoop and I do not envy them. They must now incorporate and test 241 property name changes in their code and maintain backward comparability. This does not include the myriad of changes required to resolve other deprecated systems. For example, the MapReduce engine for Hive is no longer an option. That flat out breaks a piece of software I use. I’m sure the Hadoop devs had good reason for removing this, but it does cause issues and we’ve seen this kind of thing before.

Surely Hadoop could not fall into this trap. The Python guys (gender neutral) have been battling with broken upgrade paths for years. Lots of years. I just hope that Hadoop can overcome this divide; otherwise I feel it’s market share will continue to erode and deepen the trough of dissolution.

You may ask, “What can I do to help prevent this erosive divide?” I guess the only thing you can do is push for 3.0 migrations. Hadoop definitely needs to grow or it will fade away. Application developers and integrators are going to need to bite the bullet and spend time and money on refactoring their code. The only way that will happen is by applying the pressure of customer demands. Customers need to communicate to their vendors that they have a timeline for migrating to Hadoop 3.x and they need to know their vendors will have a functional version by then or be replaced.

I’m sure your mileage will vary, but my first experience connecting software to a HDP 3.0 stack did not go well. Nothing exploded, but nothing worked either.

Posted in Administration, Deployment, Development, Opinions | Leave a comment

One year later

I took an unplanned vacation from work, big data and updating DFP. It’s been fun, but now I’m back and have a new job. Many things are going on and I hope to share some interesting items in the near future.

Mostly this is for my better understanding, secondarily, it may help someone. Upcoming topics should include Data Science, Data cleaning and Hadoop connectivity with security and proxy users. This should be fun. 🙂


Posted in Career, Market Segment/Growth, Opinions | Leave a comment

Kafka on AWS EC2 w/ SSL and External Visibility

I’m truly shocked by how difficult this information is to gather up in 1 place. Maybe because AWS has their own version of Kafka functionality.

At any rate, after much reading and irritation I have it working. There is still some work to do securing Zookeeper and adding ACL’s to Kafka, but we’ll get there later.

Tip 1: Use an Elastic IP per kafka broker and give it a DNS entry. We’re using Route53 so that’s pretty easy.
Tip 2: Put a complete list of your Kafka brokers and their INTERNAL IP addresses in /etc/hosts on each broker to match their DNS Hostname
Tip 3: Edit the network settings on your brokers so the hostname is accurate w/ the DNS Entry
Tip 4: Bounce box after this to ensure it works properly
Tip 5: Do NOT use underscores in your hostname.

All of this is to ensure Zookeeper can figure out WTF is going on. It won’t let you tell it directly… 🙁

We have a 3 node cluster running across 3 AZ’s in our VPC. Single node kafka is a lot easier. The following settings for are what makes this work for us:

Make sure is set to a unique # for each broker
auto.leader.rebalance.enable=true — enables better shutdown/restart experiences
controlled.shutdown.enable=true — helps with this as well
listeners=SSL://__HOST__:9093 — This is for your specific broker. We disabled PLAINTEXT as an option.
advertised.listeners=SSL://__HOST__:9093 — One of these is deprecated.

We wanted to keep everything forever, so you have to set a really high value like log.retention.hours=2147483647 to make that work. For the SSL stuff:


There is a whole lot to learn about generating sign CA Keys and keystores for Java. I don’t have the energy, so I’ll give you the link:

Kafka has “Rack Awareness” built in, so this is kinda cool. I use a bash script to start my brokers, which allows me to use a template and fill in the rack location using stuff that EC2 instances know about themselves.

In the template set broker.rack=’__AZ__’
and in your bash script do something like this:
AZ=$(curl -s
cat /root/kafka_current/config/ | sed -e “s/__ID__/${ID}/g;s/__HOST__/${HOST}/g;s/__AZ__/${AZ}/g” > /root/kafka_current/config/

I embed the Broker ID in the hostname so I can do stuff like:
ID=$(hostname | cut -f 1 -d. | cut -f 2 -d “b”)

So my broker names are,, etc.

Most of the Kafka Monitoring and Management tools completely fail w/ SSL. 🙁

Posted in Administration, Deployment | Leave a comment

Drilling thru Multiple Clusters

…or Using Apache Drill to join data across discreet domains.

We’ve been doing some work with Redshift lately. While it’s an effective tool for storing and crunching thru large amounts of structured data, it’s limited by a few “-ism’s” that keep it from being more useful.

The first is just annoying: It’s an identity island. It doesn’t attache to anything for UAA; not LDAP, not even IAM! This is a damn shame.

The second restriction is around selecting across databases. Redshift allows you to create multiple databases in a single cluster. And multiple schemas within each database. The good news is that you can cross the Schema boundaries to join tables, etc. The bad news is that you can’t select across multiple databases in the same cluster. :-/ In theory, this is good for data separation, etc. but in practice is means I must load multiple copies of my Enterprise Lookup Tables. I can’t have just 1 copy of my Master Custom ID to Address table, I have to have one in each schema.

Enter Apache Drill. Drill allows me to configure multiple connections and connection types for use in Drill Queries. So I can configure a psql connection via jdbc for two clusters, and query them as one like this:

select s.units_sold, l.customer_name from redshift_A.billing.sales s join redshift_b.lookups.customers l on s.cust_id = b.cust_id

In this case my 2 distinct Redshift clusters have Drill Storage configurations* named redshift_A and redshift_b. These definitions are tied to a specific Redshift Database on each cluster, so including that might be a better naming standard. In the Redshift_A cluster there is a schema named “billing” and a table named sales. So the table definition in our SQL Select statements is Storage_Name.Schema_name.table. Again is a local alias for a specific Redshift Cluster and Database combination.

*Here is a sample configuration:
"type": "jdbc",
"driver": "org.postgresql.Driver",
"url": "jdbc:postgresql://",
"username": "admin",
"password": "--secretpw--",
"enabled": true

As you can see, we are using the Postgresql JDBC Driver w/ SSL to connect to “my-rs-cluster” and the specific database “mydb” in that cluster.

Pretty cool stuff.

But wait! There’s more! Drill isn’t limited to Redshift or even JDBC. It can work directly with S3 and/or local files various format types: parquet, csv, tsv, etc., Hive, HBase, Mongo and others.

I’ve just begun to explore the abilities and qwerks here, but I’m liking the start.

Posted in Uncategorized | Leave a comment

T vs. V and W Shaped People

We talk a lot about hiring T shaped people at my current gig and I think it’s a misnomer for a couple of reasons.

First, it implies a ratio of depth to width that is askew. Developers and Admins in todays world have to be familiar with a really wide range of technologies and tools. Hell, you can’t even just develop anymore, you also have to understand deploy tools, repository caches and git. Git is more than traditional VCS as most of us have learned. This fun and frightening article explains some of the challenges of Javascript these days: How it feels to learn Javascript in 2016

Second, it makes it look like we’re only deep in 1 area. Who spikes down into, Linux without having some MySQL, Network, bash, python, etc. chops? And more than just a little across the Top of the T. As you get deeper into a specific area, you find that you need to bring other skills up (down) to a similar level. Maybe not quite as deep, but close. As you group your various depths together you start to look more V shaped. As you add deep skills in other areas, you might be a W shape. As a Liniux Admin, I transitioned into Hadoop Administration. I had to gain some skills around understanding of Java, HDFS, YARN, etc. along the way. While Linux is a part of Hadoop, Linux Admin skills are still different from Hadoop Admin Skill. So, I was a W. Working with Amazon Web Services is another skill area that is Wide. Understanding their offerings has certainly been challenging, from VPC’s and Redshift Clusters, to Route 53, IAM and more. Now I feel like a VW. 😀

Maybe the whole idea looking for people who have a skill is the wrong direction. Maybe, we shold look for people who can learn and figure out how to solve problems. Otherwise we have to start looking for VWWVW people, and that’s hard to pronounce.

Posted in Administration, Career, Development, Opinions | Leave a comment

System Administration Rules to Live By

I’ve had a variation of these running around for a while. Tweaks may come and go with trends, but the concepts are the same.

  1. When they say “Go Big!” they don’t mean it.
  2. Start with optimistic scripts. Finish them defensively.
  3. Assume your audience knows nothing.
  4. Suspend disbelief – It’ll work.
  5. Self documenting isn’t.
  6. Track your work – If it’s not in JIRA (or Trello) it didn’t happen.
  7. Celebrate “Big Wins” – They don’t last long and are soon forgotten.
  8. If everything is a top priority, nothing is. Don’t stress about it.
  9. Don’t Panic. This is probably fixable.
  10. Fail fast. If something doesn’t work, dump it or be stuck with it forever.
  11. Nuke it from orbit, it’s the only way to be sure.
  12. Productive Teams Eschew Enterprise Solutions – because they suck.
  13. Make it work, doesn’t mean fix it – it just means, MAKE IT WORK!

And a few truths about automated services

  1. If it runs, it must log status
  2. If it logs, it must be rotated
  3. Logs must be reviewed by something
  4. If it runs, it must have an SLA
  5. If it runs, it must be alerted when it misses SLA
Posted in Administration, Opinions | Leave a comment

A wonderful, ugly script that just keeps working

Today were going to look at parts of a complex “nudge” script as I’ve described previously. It has a few more bells and whistles and constantly amazes me how well it adapts.

I’ll show the good bits in sections so we can discuss.

First some cool date math
TDY=`date +%Y-%m-%d`
if [ -n "$1" ] ; then
date -d $1
if [ $? == 0 ]; then
TDY=`date -d $1 +%Y-%m-%d`
TMO=`date -d "$TDY + 1day" +%Y-%m-%d`
TS=`date -d $TDY +%Y%m%d`

TDY is today’s date. Unless you passed in a valid date that you want to use. This is useful for processing batches of data based on load date, landed date, etc.
TMO is tomorrow. That’s useful for finding files that landed today. You need TMO to do that with find. We’ll see more about that in a bit.
TS is a TimeStamp for logging purposes. Since TDY might be a passed value, we need to ensure that TS is used. We can also expand this for Hour Min Secs.

MY_PATH="`dirname \"$0\"`" # relative
MY_PATH="`( cd \"$MY_PATH\" && pwd )`"

This is a cool trick to always know exactly where you started, no matter who or where you are. Useful for self updates as seen below.

printf "%(%Y-%m-%d %H:%M:%S)T This log entry contains a date/time stamp.\n" "$(date +%s)" >> $LOG

git reset --hard HEAD; git pull
chmod +x $0

To ensure we end up where we started, we head back the the MY_PATH value we saved earlier.
Then we we ensure that we have the latest incarnation of ourself and ensure we’re executable.

Finally, the last 2 lines of the code are always spawn myself in the background and disown the child process, as described.

Posted in Uncategorized | Leave a comment

The 3 Question Test

  1. A burger and fries costs $1.10; the burger costs $1 more than the fries. How much do the fries cost?
  2. 5 servers can sort 5 TB of data in 5 minutes; how long would 100 servers take to sort 100 TB of data?
  3. A patch of mold doubles in size every day. It takes 9 days to cover the sample dish; how long to cover 1/2 of the sample dish?

I’ll post a comment w/ the answers.

Posted in Uncategorized | 1 Comment

Experimenting w/ Neo4j

Graph databases are a really neat concept. We’ve started playing with Neo here as we attempt to link customers with visits and actions based on those visits. It seems like a really good fit at first glance.

Our challenge is moving from the traditional RDBMS thinking to Graph thinking. Lots of experimenting and changing models to be found.

If we decided to use Neo, I’ll post some thoughts our how we wrapped our heads around it.

Posted in Data, Development, Tuning | Leave a comment

Just give it a nudge.

The second definition of nudge, according to Webster, is to “prod lightly: urge into action.”
We use that concept in our data environments for various long running processes; for things that we want to happen frequently, but with an unknown runtime.

There are many ways to modify/customize this concept for specific uses. This basic concept is here:

## -- for forever running, self re-entrant script.


# You may have configuration variables you want to include. Just create a file with the script name plus .env and this will source that file.
# . ${0}.env

# Optional, but you might want to make sure you keep running by default
if [ -e ${0}.stop ]; then rm -f ${0}.stop; fi

# do something useful
python >> log.$0 2>&1

# Just so we know things are running, even w/o log entries.
touch log.file

## and now we sleep then relaunch or exit as needed.

# To stop the forever loop, just touch a file. (touch ${0}.stop)
if [ -e ${0}.stop ]; then echo "Stopping"; exit; fi

# Call myself again as a background child.
$0 &

# Remove myself as a parent to the just launched process. ($! is the PID for that child)
disown $!

Run a command, take a nap, repeat.

Posted in Administration, Deployment, Development | Leave a comment