My name is DataG and I’m a data modeler. It’s been 6 weeks since my last star-schema.
Lets face it. Codd, Imhoff, Inmon, and Kimball paved the way for almost every data analyst and app-dev professional since the relational model worked its way into corporate data centers. We cried, learned, and laughed as crazy ideas like data warehousing and dimensional modeling became part of our lexicon. My kids are well fed and have the latest shoes thanks to these data architects (actually I owe more to the foolish and lazy who didn’t employ quality in their database design; thx for the consulting $s!).
As a 14 year veteran of RDBMS performance tuning, I was a staunch defender of the relational model. A few years ago, I found myself running into hurdles as the 3 V’s (Volume, Velocity, Variety) started to shake my belief system. At first the problem would manifest itself in small ways and we found ways to overcome by partitioning, faster hardware, etc. I would start to question my skills as ever-growing informational pressure caused designs to fail. Based on historical results, the problems I faced were a direct result of a poor data model. It took a look of honest questioning and the death of many sacred cows to accept a new way of thinking. Any data architect worth their salt should step aside from the past and look forward.
But we can’t forget ALL of the lessons learned. While the way we store and manipulate data may change, the ability to control and regulate it in a multitenant datastore will be just as important. These concepts must persist if Hadoop is to be accepted just like the relational databases of the past. Therefore I propose the following qualities that must continue in our brave new world .
Organization
Organization is highly subjective and use case specific. In Hadoop, we can use directory structures to organize data by business unit, by stage in lifecycle (new vs. old, hot versus cold, raw versus derived), or other concerns. I’ll have to be honest, it took me a little while to absorb this because somehow it seems completely natural to experienced Hadoop developers (aka app-dev people), but it leaves a big hole in the heart of a DBA. A good DBA hearts metadata – that’s how they carve up access and do stuff. An old DBA hearts metadata about metadata – and perhaps a few backups of the metadata.
Now all I have is HDFS, which is really just a filesystem with basic unix security. So directories ARE my metadata. Hmmm. I see a metastore in our future that will allow us to map a directory structure to more metadata. DBA_TABLES anyones?
Space Quotas
Storage control, especially in multitenant environments will be a huge concern. Just like organization, having data bucketed by directory is also what gives you control over quotas (note: create another post about the small-filesize quota snafu).
Partitioning
Finally, tools like Hive understand partition pruning during query execution. Each partition is simply a directory with a special naming convention that indicates the range of the table to which the contained data belongs. Tools other than Hive can similar partition pruning by simply only including the directories that are known to contain data of interest.
Partitioning will also facilitate data removal. Although the mantra of Hadoop is to store everything forever because there might be value later, there is the occasional contractual or legal requirement to do so.
OK. So what does this all wind up looking like? Here is a straw dog (just for you Jeff B) :
/data : Contains raw data sets ingested from other systems. Read only to users.
/user/<username> : Home directories / scratch pads for users.
/Dbay : Contains ETL process queue directories
/tmp : Sticky-bit set scratch for tools and users (no guarantee on longevity).
/data and /DBay are the interesting ones.
/data/<dataset name>/<optional partitions>
Where <dataset name> is the equivalent of a table name in an RDBMS. Data sets may be partitioned by N columns, but that’s optional and use case dependent.
Ex: Partitioned clickstream data by day.
/data/clickstream/date=20120101/{x.dat,y.dat,z.dat}
/data/clickstream/date=20120102/{x.dat,y.dat,z.dat}
/Dbay/<group>/<application>/<process>/{incoming,working,complete,failed}
Where <group> is the line of business / group (research, search quality, fraud analysis), <application> is the name of the application the process supports, and <process> is for applications that have multiple processing stages. Each process “queue” has four state directories:
incoming: newly arriving files drop off here. A process atomically renames them into a temp directory under working to indicate they’re in progress (and so overlapping processes don’t steal them).
working: Contains a timestamped directory for each attempt at processing the files. Files in these directories older than X require human intervention. Monitor for this.
complete: After a Dbay process finish processing a file in working, this is where it lands.
failed: If a Dbay process decides to permanently reject a file (and ask for a human to look at it), it moves it here. If the directory contains > 0 files, it requires human intervention.
This is a complex issue that often gets overlooked and creates a bit of angst for a recovering DBA who is used to a mature datastore . It’s effectively the shared filesystem version of data modeling.