Hadoop on Mac OSX Yosemite part 2

This is a continuation from Installing Hadoop on Mac where we installed Hadoop, Yarn, and HDFS, we also ran our first Hadoop WordCount job. In this part we will actually write our first WordCount.java program and compile it. Then actually run it on the Hadoop standalone we configured.

Creating Hadoop’s Wordcount Program
– Main Class
– Mapper Class
– Reducer Class
Compiling the Hadoop Project
– using the terminal
– using the Maven

Managing the filesystem HDFS
Uploading Data Files
Running a Hadoop Project

Working Github Repo configured with Maven


Hadoop and Hive Running a Hadoop Program
UT CS378 Big Data Programming Lecture Slides

Creating Hadoop’s Wordcount Program


** The syntax seems to be outdated and won’t work with Hadoop 2.7.2. I’d recommend visiting the official hello world at 

https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html ** 

Main and General layout

The main class and code layout will generally be identical. With a main WordCount public class encapsulating the Mapper, Reducer, and Combiner classes.  I wrote the Mapper and Reducer classes in separate sections of the page to make it clearer what is what, but in the end you’ll insert the code and replace the brackets. Start off by creating a file called WordCount.java.

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

import java.io.IOException;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import java.util.StringTokenizer;

public class WordCount extends Configured implements Tool {
   private final static LongWritable ONE = new LongWritable(1L);


static int printUsage() {
System.out.println("wordcount [-m #mappers ] [-r #reducers] input_file output_file");
    return -1;

public int run(String[] args) throws Exception {

    JobConf conf = new JobConf(getConf(), WordCount.class);

// the keys are words (strings)
// the values are counts (ints)

// Here we set the combiner!!!! 

  List other_args = new ArrayList();
   for(int i=0; i < args.length; ++i) {
     try {
        if ("-m".equals(args[i])) {
        } else if ("-r".equals(args[i])) {     
        } else {
      } catch (NumberFormatException except) {
        System.out.println("ERROR: Integer expected instead of " + args[i]);
        return printUsage();
      } catch (ArrayIndexOutOfBoundsException except) {
        System.out.println("ERROR: Required parameter missing from " +
        return printUsage();
// Make sure there are exactly 2 parameters left.
   if (other_args.size() != 2) {
      System.out.println("ERROR: Wrong number of parameters: " +
          other_args.size() + " instead of 2.");
      return printUsage();
    FileInputFormat.setInputPaths(conf, other_args.get(0));
    FileOutputFormat.setOutputPath(conf, new Path(other_args.get(1)));

    return 0;

public static void main(String[] args) throws Exception {
    int res = ToolRunner.run(new Configuration(), new WordCount(), args);

Notice the Job object is responsible for most of the configuration that will happen, the number of mappers, reducer, the input and output types, job name, and much much more.

Building the Mapper class

The idea behind the mapper class is it takes in a row of input and emits a key value pair. The mapper is where the parsing will usually happen. This key value pair will be then caught by the reducer and acted upon.

 * Counts the words in each line.
 * For each line of input, break the line into words and emit them as
 * (word, 1).
public static class MapClass extends MapReduceBase implements Mapper< LongWritable, Text, Text, IntWritable > {
  private final static IntWritable one = new IntWritable(1);
 private Text word = new Text();

public void map(LongWritable key, Text value,
   OutputCollector<text, intwritable=""> output,
   Reporter reporter) throws IOException {

  String line = value.toString();
  StringTokenizer itr = new StringTokenizer(line);
  while (itr.hasMoreTokens()) {
     output.collect(word, one);

Building the Reduce class

The reducer captures the keys and a list of values. In the shuffle and sort phase of Hadoop, all values belonging to a particular key are put together into a list. In the reduce phase we receive that list and the key it belongs to. We usually loop through the list and perform some operation on the individual values. When we finally emit from the class it’ll be actually written in one of the part-r-0000 output files.

 * A reducer class that just emits the sum of the input values.
public static class Reduce extends MapReduceBase implements Reducer< Text, IntWritable, Text, IntWritable > {

public void reduce(Text key, Iterator values,
 OutputCollector<text, intwritable=""> output,
 Reporter reporter) throws IOException {
      int sum = 0;
      while (values.hasNext()) {
        sum += values.next().get();
      output.collect(key, new IntWritable(sum));

Compiling the Hadoop Project

Compiling using the terminal

Compilation of a Hadoop Java project is pretty straight forward once you find out the magic command. Which is hadoop classpath. if you created the above WordCount.java file then just open up the terminal and cd into the folder. The execute the Java compiler with the proper classpath.

$ javac WordCount.java -cp $(hadoop classpath)
 The hadoop classpath provides the compiler with all the paths it needs to compile correctly and you should see a resulting WordCount.class appear in the directory.

Compiling using Maven

When using Maven there’s a very specific directory structure required and a little building within the pom.xml. I’ve created a sample Github repository that has a completely working version. The fastest way to go about would be to clone the repo and just explore it. The source WordCount file is located in the src/main/java/com/qfa path.

$ git clone https://github.com/marek5050/Hadoop_Examples
$ cd Hadoop_Examples
$ mvn install

Running a Hadoop Project

When we finish create a Hadoop java file and packaging it using Maven. We can test out the jar file using

% hadoop jar ./target/bdp-1.3.jar dataSet3.txt  dataOutput1
  • bdp–1.3 is that name of the jar file generated by Maven.
  • dataSet3.txt is the data file we uploaded using put
  • dataOutput will be the folder where results are added

An easier way of running the project is by creating a script, let’s call it run
Create the file in the mvn directory and

% hadoop jar ./target/bdp-1.3.jar \
dataSet3.txt $(date +%s)

Close and save the file.

% chmod +x ./run     //To make it executable.
%./run              // To execute

Now after we package the new jar file using Maven, we just run the hadoop job using ./run and the hadoop job executes and ouputs the information into a file called 10012313131. This number will be the number of millisecond since the beginning of mankind or something, but the great side effect of that is the newest folder will always be the last folder visible and the names will always be unique. So there’s no need to track file names.

After the job runs we just open up the Web GUI and download the resulting file.
Download the Result file using GUI

Managing the filesystem HDFS

“Hadoop hdfs dfs” was deprecated and now it’s done purely with “hdfs dfs”. Some of the basic HDFS commands are:

%hdfs dfs 
> Usage:hdfs dfs
>hdfs dfs 
> -put 
> -cp copy files from src to dest
> -cat
> -ls list the files in a directory
> -mkdir Create a file directory
> -mv 
> -rm remove file
> -rmdir remove directory

Uploading Data Files

To transfer data files into HDFS use either put or copyFromLocal, if the dst parameter is missing the default will be the users home directory, or /user/name/.

hdfs dfs -put  
hdfs dfs -copyFromLocal  
hdfs dfs -put book.txt

Verify the file was added using

hdfs dfs -ls 
hdfs dfs -ls

21 thoughts on “Hadoop on Mac OSX Yosemite part 2”

  1. Hi Marek,
    I followed your steps to build up the hadoop. But whenever I input the hdfs command, I got a warning: WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
    ls: `.’: No such file or directory

    And seemed the command didn’t succeed. Do u know how to deal with this problem? Thanks.

    1. Hello!
      The WARN Unable to load native-hadoop library for your platform will always show up, I haven’t looked into how to get rid of it. It’ll still work without any problems.
      The HDFS issue you are having is because it’s trying to upload the file into a directory that doesn’t exist on HDFS. Usually the directory is /users/yourusername, so try creating that path on the HDFS.

  2. i am not able to get this working … please can you help…
    it gives me this error

    WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
    copyFromLocal: Call From localhost/ to localhost:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused

  3. Hi Marek, I did the command to put the file but when i run the project, it gives the below output:

    SilverBook:hadoop FarahNuzaily$ hadoop jar ./target/bdp-1.3.jar TRANSACTION_APR2014_1.txt dataOutput1
    Not a valid JAR: /usr/local/Cellar/hadoop/2.6.0/libexec/etc/hadoop/target/bdp-1.3.jar

    Can you help out?

    1. Farah, I apologize for that I should have made it more clear, the bdp1.0 is a hadoop jar I already created. So it’s the hadoop job you are trying to run, it’s not going to exist on it’s own. I think it would be a better example if I actually supplied the code and compilation steps. I’ll add that this weekend.

  4. Hello Marek,

    Great post. Works perfectly for me. I am using Streaming with Python Mapper/Reducer. Is there a way to use Combiners with Hadoop for Yosemite ?

  5. Yes, I have an example on Github

    But here is the code:
    $REDUCER= ./reducer.py
    $MAPPER= ./mapper.py

    $ hadoop jar ${HADOOP_STREAMING_JAR} \
    -Dmapreduce.job.name=”$TEST-$MAPPER_COUNT-$REDUCER_COUNT-C-$6-$INPUT” \
    -Dmapreduce.job.maps=$MAPPER_COUNT \
    -Dmapreduce.job.reduces=$REDUCER_COUNT \
    -files ./$MAPPER,./$REDUCER -combiner ./$REDUCER -mapper ./$MAPPER -reducer ./$REDUCER -input ./data/googletxt/$INPUT -output ./$OUTPUT

    If your combiner is not the same file as your reducer, you’ll have to include it with the -files …so -files ./mapper.py,./combiner.py,./reducer.py and then add the -combiner ./combiner.py

  6. Hi Marek,
    I am new to this stuff. Upon running the code I have following errors found in the example (perhaps it’s not reading hadoop source files, although it starts fine)

    User:hadoopp nqamar$ javac WordCount.java -cp $(hadoop classpath)
    WordCount.java:32: error: > expected
    OutputCollector output,
    WordCount.java:32: error: ‘)’ expected
    OutputCollector output,
    WordCount.java:53: error: ‘;’ expected
    Reporter reporter) throws IOException {
    16 errors


    I’d really appreciate your help.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s