Last week we finally finished the research paper we’ve been working on during the Spring semester. We dug deep into R and Hadoop land to compare a couple of different ways of integrating R and Hadoop. For this I wrote up a couple of Python scripts that executed the jobs we needed to run and downloaded all Google N-Grams. The links will be at the bottom of the page.
The number of big data applications in the scientific domain is continuously increasing. R is a popular language among the scientific community and has extensive support for mathematical and statistical analysis. Hadoop has grown into a universally used tool for data manipulation. There are a number of ways of integrating these two tools: Hadoop streaming, RHIPE (R and Hadoop Integrated Programming Environment), and RHadoop. We tested these methods using a basic wordcount application. Wordcount might not be an all encompassing benchmark, but it provides us with a stable base with very few opportunities for human error. We measured the performance of the three methods on small files, checked their scalability, and tested their effectiveness in handling large files. Of the three options, we found RHIPE to be generally effective, scalable, and consistent, but less customizable. RHadoop tended to be slightly more efficient than R Streaming in performance, yet failed very frequently in our scalability testing.
The Sparkie git repository we used can be found on Github. We have code for the different variety of Wordcount jobs, but I’ve also added the newest K-means code for RHIPE and RHadoop. The repository is a little messy because of the rush we were during the last week, but I’ll definitely clean it up over the weekend. So by the time anybody reads this it should be impeccable.
The data we have collected can be found in Resources. More precisely Week 10 Data.xlsx, the first two spreadsheets were test runs where we searched optimized the code and searched for efficiencies. On Spreadsheet 1 we can see the R times for most files are astronomical, around 1 hour for the smallest of jobs, but this was quickly reduced by include a reducer and removing regular expressions for splitting sentences. Initially our times for RHadoop were very disappointing too, but after a couple of similar changes of cleaning up regular expressions and enabling Vectorization we were able to have it perform near RHIPE’s performance level.
|#mapers||#reducers||RHadoop||RHIPE||Python Streaming||R Streaming|
|75||5||11mins, 24sec||4mins, 18sec||1min, 17sec||6mins, 39sec|
|75||10||4mins, 47sec||3mins, 31sec||1min, 18sec||6mins, 58sec|
|75||15||5mins, 22sec||3mins, 1sec||1min, 21sec||6mins, 14sec|
|75||20||5mins, 01sec||2mins, 37sec||1mins, 27sec||6mins, 34sec|
|86||20||4mins, 54sec||2mins, 15sec||1mins, 22sec||7mins, 1sec|
|100||20||4mins, 4sec||2mins, 2sec||1mins, 10sec||5mins, 39sec|
|120||20||3mins, 43sec||1mins, 47sec||1mins, 3sec||4mins, 22sec||Chunksize = 80MB|
|100||25||4mins, 25sec||2mins, 3sec||1mins, 11sec||5mins, 10sec|
|100||30||4mins, 17sec||2mins, 2sec||1mins, 11sec||5mins, 12sec|
|100||35||4mins, 9sec||2mins, 6sec||1mins, 15sec||5mins, 24sec|
|86||35||4mins, 21sec||2mins, 19sec||1mins, 31sec||6mins, 26sec|
|75||35||4mins, 34sec||2mins, 35sec||1mins, 35sec||6mins, 53sec|
I tried to explain the different configurations available and the options one can use when using these packages.
mapred = list( mapred.max.split.size=as.integer(1024*1024*block_size) , mapreduce.job.reduces=10 ) rhipe.results = rhwatch( map=mapper, reduce=reducer, input=rhfmt("hdfs:///book.txt", type="text"), output="hdfs:///output/book", jobname=paste("rhipe_wordcount_", 1 ,sep="-"), mapred=mapred) We create a named list of parameters, specifically (Line 2) provides the chunk size in bytes and sets the number of reducers (Line 3). The job executions (Line 5) then receives the map and reduce functions (Line 6). RHIPE won't read text documents by default and needs (Line 7) for conversion we can either specify an output path (Line 8) or NULL, and finally supply the options (Line 9).
Please feel free to comment about the paper and point out any mistakes you come across.