Hadoop Streaming with Node JS / Python/ R on Yosemite

Introduction
Configuring HADOOP_HOME
Locating Hadoop Streaming JAR
Configure a Mapper
Configure a Reducer
Download and upload Datasources
Additional configuration ( Setting the number of Mappers/Reducers)
Run the Map/Reduce job
Download the results
Download all files on Github

Introduction

For the most part I followed the Writing an MapReduce Program in Python tutorial. However there are a couple of differences in configuration due to Brew. I also wanted to test it out with NodeJS. So if you followed the Installing Hadoop on Mavericks tutorial then this is how you would do Hadoop streaming.


Configuring HADOOP_HOME

Open up the terminal and check whether you have HADOOP_HOME configured.

 % echo $HADOOP_HOME

If empty then we need to find the installation directory of Hadoop and configure the HADOOP_HOME variable.

a. If installed using Brew it’ll be /usr/local/Cellar/hadoop/
b. OR in the terminal execute.

% find /usr/local/Cellar \-name "*hadoop*" -print
/usr/local/Cellar/hadoop/2.6.0...

Using your favorite text editor add the path to ~/.profile

% vim ~/.profile

Add the line:

export HADOOP_HOME=/usr/local/Cellar/hadoop/

Save and close profile, then execute

% source ~/.profile

Locating Hadoop Streaming JAR

Inside the Hadoop Home directory we’ll also need to locate the hadoop streaming jar, ie. hadoop-streaming–2.6.0.jar.

% find $(echo $HADOOP_HOME) \-name "*streaming*" -print
/usr/local/Cellar/hadoop/2.6.0/libexec/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar  

You’ll need this file to run the streaming job a couple of sections down.

Configure a Mapper

#!/opt/local/bin/node
process.stdin.setEncoding('utf8');

/* http://stackoverflow.com/questions/1144783/replacing-all-occurrences-of-a-string-in-javascript */

function replaceAll(find, replace, str) {
return str.replace(new RegExp(find, 'g'), replace);
}

process.stdin.on('readable', function() {
var chunk = process.stdin.read();
if (chunk !== null) {
chunk = replaceAll('\t',' ',chunk);
chunk = replaceAll('\n',' ',chunk);
chunk = chunk.trim();
var words = chunk.split(' ');
for(word in words){
console.log(words[word]+'\t'+1);
}
}
});

In the above code just replace

/opt/local/bin/node

with the path to your own NodeJS. Which can be found by doing

% which node
/opt/local/bin/node

To make it executable execute

% chmod +x ./mapper.js

To test out the script run

% echo "The big brown fox ran up the stairs, the big brown bear walked down." | ./mapper.js
The 1
big 1
brown   1
fox 1
ran 1
up  1
the 1
stairs, 1
the 1
big 1
brown   1
bear    1
walked  1
down.   1

Every word should be printed with a 1 next to it.

Configure a Reducer

#!/opt/local/bin/node

process.stdin.setEncoding('utf8');
var current_word='';
var current_count=0;

process.stdin.on('readable', function() {
  var chunk = process.stdin.read();
  if (chunk !== null) {
    chunk = chunk.trim();
    var arr = chunk.split('\n');
   for(word in arr){

    var tuple = arr[word].split('\t');
    var word = tuple[0];
    var count = parseInt(tuple[1]);

     if(current_word==word){
        current_count+= count;
 }else{
    if(current_word)
        console.log(current_word +'\t'+ current_count);
    current_word = word;
    current_count = count;
   }
 }
if(current_word == word)
    console.log(current_word +'\t'+ current_count);
  }
});

Again to make it executable execute

% chmod +x ./reducer.js

To test the Reducer we have to sort the input first and then pipe it.

echo "The big brown fox ran up the stairs, the big brown bear walked down." | ./mapper.js | sort -k1,1 | ./reducer.js
The 1
bear    1
big 2
brown   2
down.   1
fox 1
ran 1
stairs, 1
the 2
up  1
walked  1

With a working mapper and reducer now we can fetch some real data.

Download and upload Datasources

Project Gutenberg has a bunch of free online literature so download a book from there, for example Historical Tours in and about Boston by American Oil Corporation. Use CURL to download the book.

% curl http://www.gutenberg.org/cache/epub/48054/pg48054.txt > historical_tours.txt

or

% wget http://www.gutenberg.org/cache/epub/48054/pg48054.txt
% mv pg48054.txt historical_tours.txt

Now we need to upload it to HDFS

% hdfs dfs -put ./historical_tours.txt .

Additional configuration ( Setting the number of Mappers/Reducers)

A great amount of information is hiding away behind a simple, but very poorly documented command -info.

% hadoop jar /usr/local/Cellar/hadoop/2.6.0/libexec/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar -info
>
> ...
> -numReduceTasks  Optional. Number of reduce tasks.
> ....
> To speed up the last maps:
> -Dmapreduce.map.speculative=true
> To speed up the last reduces:
> -Dmapreduce.reduce.speculative=true
> To name the job (appears in the JobTracker Web UI):
> -Dmapreduce.job.name='My Job'

Some of the additional information that’s extremely useful yet missing is:

> -Dmapreduce.job.maps=10 
> -Dmapreduce.job.reduces=10  
> -Dmapreduce.map.java.opts=-Xmx12000M 
> -Dmapreduce.reduce.java.opts=-Xmx12000M 

Run the Map/Reduce job

Now with everything ready we are ready to combine the 3 elements into a single command.

% hadoop jar /usr/local/Cellar/hadoop/2.6.0/libexec/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar \
-Dmapreduce.job.maps=10 \
-Dmapreduce.job.reduces=10 \
-files ./mapper.js,./reducer.js \
-mapper ./mapper.js  \
-reducer ./reducer.js \
-input ./historical_tours.txt -output ./historical-out

Successful job

Download the results

Head over to the hdfs manager at http://127.0.0.1:50070/explorer.html and navigate to the ‘/user//historical-out/’ directory to find the part–00000 file.

The output should have been something like this

1   NaN
"America"   1
"Boston 1
"Brimstone  1
"Bulfinch"  1
"Captain    1
"Common 1
"Constitution"  1
"Constitution," 1
"Defects,"  1
"Do 1
"Duxbury-Marshfield."   1
"Five   1
"Fort   1
"Harvard    1
"Here   1
"I  1
Advertisements

5 thoughts on “Hadoop Streaming with Node JS / Python/ R on Yosemite”

  1. I installed Hadoop on Yosemite. No problem. Then I am trying to use Hadoop Streaming with Python and I keep getting a “Permission Denied” Error like “WARN mapred.LocalJobRunner: job_local1582841369_0001
    org.apache.hadoop.security.AccessControlException: Permission denied: user=root, access=WRITE, inode=”/”:robinet:supergroup:drwxr-xr-x”/ I hva tried everything, but no luck….

  2. Hi, really great tutorial. Thank you so much for putting the effort into making this.

    One possible typo:

    The first section appears to be dealing with the HADOOP_HOME variable. However, in the initial path you reference the “HADOOP_PATH” variable, which is then not mentioned for the rest of the article. I’m assuming you meant HADOOP_HOME there as well?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s