Performance Analysis for Scaling up R Computations Using Hadoop

Aka. Sparkie

Last week we finally finished the research paper we’ve been working on during the Spring semester. We dug deep into R and Hadoop land to compare a couple of different ways of integrating R and Hadoop. For this I wrote up a couple of Python scripts that executed the jobs we needed to run and downloaded all Google N-Grams. The links will be at the bottom of the page.

Abstract

The number of big data applications in the scientific domain is continuously increasing. R is a popular language among the scientific community and has extensive support for mathematical and statistical analysis. Hadoop has grown into a universally used tool for data manipulation. There are a number of ways of integrating these two tools: Hadoop streaming, RHIPE (R and Hadoop Integrated Programming Environment), and RHadoop. We tested these methods using a basic wordcount application. Wordcount might not be an all encompassing benchmark, but it provides us with a stable base with very few opportunities for human error. We measured the performance of the three methods on small files, checked their scalability, and tested their effectiveness in handling large files. Of the three options, we found RHIPE to be generally effective, scalable, and consistent, but less customizable. RHadoop tended to be slightly more efficient than R Streaming in performance, yet failed very frequently in our scalability testing.

Code

The Sparkie git repository we used can be found on Github. We have code for the different variety of Wordcount jobs, but I’ve also added the newest K-means code for RHIPE and RHadoop. The repository is a little messy because of the rush we were during the last week, but I’ll definitely clean it up over the weekend. So by the time anybody reads this it should be impeccable.

Data

The data we have collected can be found in Resources. More precisely Week 10 Data.xlsx, the first two spreadsheets were test runs where we searched optimized the code and searched for efficiencies. On Spreadsheet 1 we can see the R times for most files are astronomical, around 1 hour for the smallest of jobs, but this was quickly reduced by include a reducer and removing regular expressions for splitting sentences.  Initially our times for RHadoop were very disappointing too, but after a couple of similar changes of cleaning up regular expressions and enabling Vectorization we were able to have it perform near RHIPE’s performance level.

Small Tests 10GB googlebooks-eng-all-5gram-20120701-un
#mapers #reducers RHadoop RHIPE Python Streaming R Streaming
112 10 3mins, 12sec
120 5 4mins, 45sec
75 5 11mins, 24sec 4mins, 18sec 1min, 17sec 6mins, 39sec
75 10 4mins, 47sec 3mins, 31sec 1min, 18sec 6mins, 58sec
75 15 5mins, 22sec 3mins, 1sec 1min, 21sec 6mins, 14sec
75 20 5mins, 01sec 2mins, 37sec 1mins, 27sec 6mins, 34sec
86 20 4mins, 54sec 2mins, 15sec 1mins, 22sec 7mins, 1sec
100 20 4mins, 4sec 2mins, 2sec 1mins, 10sec 5mins, 39sec
120 20 3mins, 43sec 1mins, 47sec 1mins, 3sec 4mins, 22sec Chunksize = 80MB
100 25 4mins, 25sec 2mins, 3sec 1mins, 11sec 5mins, 10sec
100 30 4mins, 17sec 2mins, 2sec 1mins, 11sec 5mins, 12sec
100 35 4mins, 9sec 2mins, 6sec 1mins, 15sec 5mins, 24sec
86 35 4mins, 21sec 2mins, 19sec 1mins, 31sec 6mins, 26sec
75 35 4mins, 34sec 2mins, 35sec 1mins, 35sec 6mins, 53sec

Explanations

I tried to explain the different configurations available and the options one can use when using these packages.

mapred = list(
       mapred.max.split.size=as.integer(1024*1024*block_size)
        , mapreduce.job.reduces=10
        )
rhipe.results = rhwatch(
                    map=mapper, reduce=reducer,
                    input=rhfmt("hdfs:///book.txt", type="text"),
                    output="hdfs:///output/book",
                    jobname=paste("rhipe_wordcount_", 1 ,sep="-"),
                    mapred=mapred)

We create a named list of parameters, specifically (Line 2) provides the chunk size in bytes and sets the number of reducers (Line 3). The job executions (Line 5) then receives the map and reduce functions (Line 6).
RHIPE won't read text documents by default and needs (Line 7) for conversion we can either specify an output path (Line 8) or NULL, and finally supply the options (Line 9). 

Please feel free to comment about the paper and point out any mistakes you come across.

https://github.com/marek5050/Sparkie/blob/master/Paper/TACC_PROJECT.pdf

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s