An adventure is only an inconvenience rightly considered. An inconvenience is an adventure wrongly considered.
Sometimes our ego gets the best of us. This seems to occur more often in Hadoop than anywhere else I’ve worked. I’m not sure if this relatively new world propels us into thinking we’re on an island, or if java developers are inherently poor data analysts. At any rate, we need to reign in our bloated self-image and realize that someone else likely encountered our issue and a seasoned committer carried it thru the stack to resolution. Let me give you an example:
Sqooping data with newlines
I wish I had caught this issue earlier. Some of our developers were pulling data from Teradata and DB2 and encountered embedded newline and ctrl-a data in a few columns. Claiming the ‘bad’ data broke their process, they overreacted and jumped to using Avro files to resolve their problem. While avro is well and good for some issues, this was major overkill that turned out causing issues within Datameer and created additional complexity in HCat. I took some time to ‘research’ (ala google-fu) to see what others had done to get around this. I already had a few simple ideas, like regex your SQL to remove \n\r\01, but I was really looking for a more elegant solution.
It took me 30 minutes or so to work up an example, create a failure, and RTFM for a resolution. I was hitting walls everywhere much like our developers, the sqoop documentation isn’t bad, but there are some holes. A little more searching and I found Cloudera Sqoop-129 Newlines in RDBMS fields break hive. Created 11/2010 and resolved 5/2011. Turns out it was fixed in sqoop version 1.3.0 and we are on 1.4.2 – looking good so far. The fix implemented these arguments which handles elimination or replacement of these characters during the load.
||Drops \n, \r, and \01 from string fields when importing to Hive.|
||Replace \n, \r, and \01 from string fields with user defined string when importing to Hive.|
It turns out they fixed our problem from a Hive standpoint, but its actually valid for Pig, etc. Its much more elegeant than a source-SQL/regex solution because I don’t need to specify fields – everything is covered. Now in our case the business users didn’t even care about the newlines that were present in 3 of 2 million rows (ug!) so I just used –hive-drop-import-delims in the sqoop command and everything was fine.
So by adding a single line to a Sqoop step, I eliminated the need to maintain an additional serialization framework and downstream processes will likely be easier to maintain. When dealing with basic business data we need to realize it isn’t rocket science – some else has probably already figured it out.