As we look to bring private data into Hadoop I find myself imagining the management of thousands of separate doors for individual data elements. With regulated data this means we’ll need someone going around continuously checking doors to make sure they are locked, they don’t have the wrong type of data in there, and nobody is weaseling their way into a room they don’t belong. If this sounds like a huge hassle and something that is fraught with challenges you are right! So how do you secure Hadoop while not creating a nightmare that is hell to manage. First, you limit access to only those people that need it via an access, control, and data layer. Second, you provide a level of access control via a gateway such as Platfora or Datameer from a self service perspective at least. Third, you classify your data in a meaningful way as it enters. Fourth, you attempt to set the appropriate permissions on your data and assign kerberos keys to lock down the security. This last one is pretty clunky, but it looks like it will get better with the coming Knox Gateway, but I still think that is a bit of a stretch.
The area I’m looking at right now is just masking the data. The idea here is you just tokenize the fields as the data enters on a field by field basis. So a social security number might move from 520-76-9933 to 123-32-1242. It still preserves the format, it can still be unique, you don’t get the performance impact from encrypting, and most people probably don’t care about the individual data fields, oh….and it’s reversible on a field by field basis. To me this is a lot easier than trying to lock every door and making sure the wrong people aren’t seeing the wrong data….AND the most important part – this keeps the platform out of the controlled regulated platform list. Two vendors that provide this type of solution are Dataguise and Voltage – with Voltage being in this business for a long time….
At the end of the day instead of managing a bunch of doors to the elephant pins I’d rather just turn all of the pink (pink=regulated) elephants yellow (yellow = nothing special) and if someone needs a pink elephant we’ll just reverse the process for that user only.