hello woRld!

R is the latest Hadoop darling. It is an open source language that “is widely used among statisticians and data miners for developing statistical software and data analysis. Polls and surveys of data miners are showing R’s popularity has increased substantially in recent years.” See the Wiki Article for more details: http://en.wikipedia.org/wiki/R_(programming_language)

The good news is that R is being developed by PHD level statisticians and academics across the globe. It contains many statistical functions, models and analytic constructs.

The bad news is that it’s being developed by PHD’s and academics! Great for developing analytics, not great developing stable solutions. :)

Enter the Hadoop Sysadmin! He can write bash scripts, he can redirect output and he can weaponize R! Man up Hadoop Admin, you’re going to understand R sytax; it’s not terribly difficult, but it is slightly weird. R can read ENV VARs and run from the CMD line and – given it’s collegiate beginnings – this counts a 10x win.

Google is your friend and has most of your answers. I can say that R CMD BATCH hellowoRld.R MyProg.log is a good start. A nice bash script to handle errors, in/output dirs, etc. can have you running R in normal batch mode pretty quickly. You’ll probably have a more difficult time explain proper development techniques to the Data Scientist that just wants to run things in RStudio. ;)

About Grease Monkey

30+ Years of IT Geekiness, Linux Fanboy and Open Source patriot.
This entry was posted in Administration, Deployment, Development, Tuning and tagged , , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>