…or Using Apache Drill to join data across discreet domains.
We’ve been doing some work with Redshift lately. While it’s an effective tool for storing and crunching thru large amounts of structured data, it’s limited by a few “-ism’s” that keep it from being more useful.
The first is just annoying: It’s an identity island. It doesn’t attache to anything for UAA; not LDAP, not even IAM! This is a damn shame.
The second restriction is around selecting across databases. Redshift allows you to create multiple databases in a single cluster. And multiple schemas within each database. The good news is that you can cross the Schema boundaries to join tables, etc. The bad news is that you can’t select across multiple databases in the same cluster. :-/ In theory, this is good for data separation, etc. but in practice is means I must load multiple copies of my Enterprise Lookup Tables. I can’t have just 1 copy of my Master Custom ID to Address table, I have to have one in each schema.
Enter Apache Drill. Drill allows me to configure multiple connections and connection types for use in Drill Queries. So I can configure a psql connection via jdbc for two clusters, and query them as one like this:
select s.units_sold, l.customer_name from redshift_A.billing.sales s join redshift_b.lookups.customers l on s.cust_id = b.cust_id
In this case my 2 distinct Redshift clusters have Drill Storage configurations* named redshift_A and redshift_b. These definitions are tied to a specific Redshift Database on each cluster, so including that might be a better naming standard. In the Redshift_A cluster there is a schema named “billing” and a table named sales. So the table definition in our SQL Select statements is Storage_Name.Schema_name.table. Again is a local alias for a specific Redshift Cluster and Database combination.
*Here is a sample configuration:
As you can see, we are using the Postgresql JDBC Driver w/ SSL to connect to “my-rs-cluster” and the specific database “mydb” in that cluster.
Pretty cool stuff.
But wait! There’s more! Drill isn’t limited to Redshift or even JDBC. It can work directly with S3 and/or local files various format types: parquet, csv, tsv, etc., Hive, HBase, Mongo and others.
I’ve just begun to explore the abilities and qwerks here, but I’m liking the start.