Category Archives: Hadoop

Performance Analysis for Scaling up R Computations Using Hadoop

Aka. Sparkie

Last week we finally finished the research paper we’ve been working on during the Spring semester. We dug deep into R and Hadoop land to compare a couple of different ways of integrating R and Hadoop. For this I wrote up a couple of Python scripts that executed the jobs we needed to run and downloaded all Google N-Grams. The links will be at the bottom of the page.

Abstract

The number of big data applications in the scientific domain is continuously increasing. R is a popular language among the scientific community and has extensive support for mathematical and statistical analysis. Hadoop has grown into a universally used tool for data manipulation. There are a number of ways of integrating these two tools: Hadoop streaming, RHIPE (R and Hadoop Integrated Programming Environment), and RHadoop. We tested these methods using a basic wordcount application. Wordcount might not be an all encompassing benchmark, but it provides us with a stable base with very few opportunities for human error. We measured the performance of the three methods on small files, checked their scalability, and tested their effectiveness in handling large files. Of the three options, we found RHIPE to be generally effective, scalable, and consistent, but less customizable. RHadoop tended to be slightly more efficient than R Streaming in performance, yet failed very frequently in our scalability testing.

Continue reading Performance Analysis for Scaling up R Computations Using Hadoop

Advertisements

Installing Hadoop on Mac part 1

A step by step guide to get your running with Hadoop today! In Hadoop on Mac part 2 we actually walk through the creation and compilation process of Java Hadoop Wordcount from beginning to end and automating it with .pom files. hadooplogo

Install HomeBrew
Install Hadoop
Configuring Hadoop
SSH Localhost
Starting and Stopping Hadoop
Good to know

Install HomeBrew

Download it from the website at http://brew.sh/ or simply paste the script inside the terminal

$ ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

Install Hadoop

$ brew install hadoop

Hadoop will be installed in the following directory
/usr/local/Cellar/hadoop

Configuring Hadoop

Edit hadoop-env.sh

The file can be located at /usr/local/Cellar/hadoop/2.6.0/libexec/etc/hadoop/hadoop-env.sh
where 2.6.0 is the hadoop version.

Find the line with

export HADOOP_OPTS="$HADOOP_OPTS -Djava.net.preferIPv4Stack=true"

and change it to

export HADOOP_OPTS="$HADOOP_OPTS -Djava.net.preferIPv4Stack=true -Djava.security.krb5.realm= -Djava.security.krb5.kdc="

Edit Core-site.xml

The file can be located at /usr/local/Cellar/hadoop/2.6.0/libexec/etc/hadoop/core-site.xml .

<configuration>  
<property>
     <name>hadoop.tmp.dir</name>
     <value>/usr/local/Cellar/hadoop/hdfs/tmp</value>
     <description>A base for other temporary directories.</description>
  </property>
  <property>
     <name>fs.default.name</name>                                     
     <value>hdfs://localhost:9000</value>                             
  </property>
</configuration>    

Edit mapred-site.xml

The file can be located at /usr/local/Cellar/hadoop/2.6.0/libexec/etc/hadoop/mapred-site.xml and by default will be blank.

<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9010</value>
</property>
</configuration>

Edit hdfs-site.xml

The file can be located at /usr/local/Cellar/hadoop/2.6.0/libexec/etc/hadoop/hdfs-site.xml .

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

To simplify life edit your ~/.profile using vim or your favorite editor and add the following two commands. By default ~/.profile might not exist.

alias hstart="/usr/local/Cellar/hadoop/2.6.0/sbin/start-dfs.sh;/usr/local/Cellar/hadoop/2.6.0/sbin/start-yarn.sh"
alias hstop="/usr/local/Cellar/hadoop/2.6.0/sbin/stop-yarn.sh;/usr/local/Cellar/hadoop/2.6.0/sbin/stop-dfs.sh"

and execute

$ source ~/.profile

in the terminal to update.

Before we can run Hadoop we first need to format the HDFS using

$ hdfs namenode -format

SSH Localhost

Nothing needs to be done here if you have already generated ssh keys. To verify just check for the existance of ~/.ssh/id_rsa and the ~/.ssh/id_rsa.pub files. If not the keys can be generated using

$ ssh-keygen -t rsa

Enable Remote Login
“System Preferences” -> “Sharing”. Check “Remote Login”
Authorize SSH Keys
To allow your system to accept login, we have to make it aware of the keys that will be used

$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Let’s try to login.

$ ssh localhost
Last login: Fri Mar  6 20:30:53 2015
$ exit

Running Hadoop

Now we can run Hadoop just by typing

$ hstart

and stopping using

$ hstop

Download Examples

To run examples, Hadoop needs to be started.

Hadoop Examples 1.2.1 (Old)
Hadoop Examples 2.6.0 (Current)

Test them out using:

$ hadoop jar  pi 10 100

Good to know

We can access the Hadoop web interface by connecting to

Resource Manager: http://localhost:50070
JobTracker: http://localhost:8088
Specific Node Information: http://localhost:8042

Command
$ jps 
7379 DataNode
7459 SecondaryNameNode
7316 NameNode
7636 NodeManager
7562 ResourceManager
7676 Jps

$ yarn  // For resource management more information than the web interface. 
$ mapred  // Detailed information about jobs

This we can use to access the HDFS filesystem, for any resulting output files.

HDFS viewer

Errors


To resolve ‘WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable’
 (Stackoverflow.com)

Connection Refused after installing Hadoop

$ hdfs dfs -ls
15/03/06 20:13:54 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
ls: Call From spaceship.local/192.168.1.65 to localhost:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:   http://wiki.apache.org/hadoop/ConnectionRefused

The start-up scripts such as start-all.sh do not provide you with specifics about why the startups failed. Some of the time it won’t even notify you that a startup failed… To troubleshoot the service that isn’t functioning execute it manually.

$ hdfs namenode
15/03/06 20:18:31 WARN namenode.FSNamesystem: Encountered exception loading fsimage
org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /usr/local/Cellar/hadoop/hdfs/tmp/dfs/name is in an inconsistent state: storage directory does not exist or is not accessible.
15/03/06 20:18:31 FATAL namenode.NameNode: Failed to start namenode.

and the problem is…

$ hadoop namenode -format

To verify the problem is fixed run

$ hstart
$ hdfs dfs -ls /

If ‘hdfs dfs -ls’ gives you a error

ls: `.': No such file or directory

then we need to create the default directory structure Hadoop expects (ie. /user/whoami_output/)

$ whoami
 spaceship
$ hdfs dfs -mkdir -p /user/spaceship 
 15/03/06 20:31:19 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
$ hdfs dfs -ls
 15/03/06 20:31:23 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
$ hdfs dfs -put book.txt
 15/03/06 20:32:29 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
$ hdfs dfs -ls 
 15/03/06 20:32:50 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
 Found 1 items
 -rw-r--r--   1 marekbejda supergroup      29578 2015-03-06 20:32 book.txt

JPS and Nothing Works…

Seems like certain builds of Java 1.8 (i.e.. 1.8_40) are missing a critical package that breaks Yarn. Check your logs at

$ jps
 5935 Jps
$ vim /usr/local/Cellar/hadoop/2.6.0/libexec/logs/yarn-*
 2015-03-07 16:21:32,934 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in secureMain java.lang.NoClassDefFoundError: sun/management/ExtendedPlatformComponent
..
 2015-03-07 16:21:32,937 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
 2015-03-07 16:21:32,939 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:

http://mail.openjdk.java.net/pipermail/core-libs-dev/2014-November/029818.html

Either downgrade to Java 1.7 or I’m currently running 1.8.0_20

$ java -version
 java version "1.8.0_20"
 Java(TM) SE Runtime Environment (build 1.8.0_20-b26)
 Java HotSpot(TM) 64-Bit Server VM (build 25.20-b23, mixed mode)

I’ve Done Everything!! SSH still asks for a password!!!! OMGG!!!!

So I’ve ran across this problem today, all of a sudden ssh localhost started requesting a password. I’ve generated new keys and searched all day for an answer, thanks to this Apple thread.

$ chmod go-w ~/        
$ chmod 700 ~/.ssh       
$ chmod 600 ~/.ssh/authorized_keys       

Hadoop on Mac OSX Yosemite part 2

This is a continuation from Installing Hadoop on Mac where we installed Hadoop, Yarn, and HDFS, we also ran our first Hadoop WordCount job. In this part we will actually write our first WordCount.java program and compile it. Then actually run it on the Hadoop standalone we configured.

Creating Hadoop’s Wordcount Program
– Main Class
– Mapper Class
– Reducer Class
Compiling the Hadoop Project
– using the terminal
– using the Maven

Extra:
Managing the filesystem HDFS
Uploading Data Files
Running a Hadoop Project

Working Github Repo configured with Maven

External:

Hadoop and Hive Running a Hadoop Program
UT CS378 Big Data Programming Lecture Slides

Creating Hadoop’s Wordcount Program

 

** The syntax seems to be outdated and won’t work with Hadoop 2.7.2. I’d recommend visiting the official hello world at 

https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html ** 

Main and General layout

The main class and code layout will generally be identical. With a main WordCount public class encapsulating the Mapper, Reducer, and Combiner classes.  I wrote the Mapper and Reducer classes in separate sections of the page to make it clearer what is what, but in the end you’ll insert the code and replace the brackets. Start off by creating a file called WordCount.java.

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

import java.io.IOException;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import java.util.StringTokenizer;


public class WordCount extends Configured implements Tool {
   private final static LongWritable ONE = new LongWritable(1L);

[INSERT MAPPER CLASS]
[INSERT REDUCER CLASS]

static int printUsage() {
System.out.println("wordcount [-m #mappers ] [-r #reducers] input_file output_file");
    ToolRunner.printGenericCommandUsage(System.out);
    return -1;
  }

public int run(String[] args) throws Exception {

    JobConf conf = new JobConf(getConf(), WordCount.class);
    conf.setJobName("wordcount");

// the keys are words (strings)
   conf.setOutputKeyClass(Text.class);
// the values are counts (ints)
   conf.setOutputValueClass(IntWritable.class);

   conf.setMapperClass(MapClass.class);
// Here we set the combiner!!!! 
   conf.setCombinerClass(Reduce.class);
   conf.setReducerClass(Reduce.class);

  List other_args = new ArrayList();
   for(int i=0; i < args.length; ++i) {
     try {
        if ("-m".equals(args[i])) {
conf.setNumMapTasks(Integer.parseInt(args[++i]));
        } else if ("-r".equals(args[i])) {     
conf.setNumReduceTasks(Integer.parseInt(args[++i]));
        } else {
          other_args.add(args[i]);
        }
      } catch (NumberFormatException except) {
        System.out.println("ERROR: Integer expected instead of " + args[i]);
        return printUsage();
      } catch (ArrayIndexOutOfBoundsException except) {
        System.out.println("ERROR: Required parameter missing from " +
            args[i-1]);
        return printUsage();
      }
    }
// Make sure there are exactly 2 parameters left.
   if (other_args.size() != 2) {
      System.out.println("ERROR: Wrong number of parameters: " +
          other_args.size() + " instead of 2.");
      return printUsage();
    }
    FileInputFormat.setInputPaths(conf, other_args.get(0));
    FileOutputFormat.setOutputPath(conf, new Path(other_args.get(1)));

    JobClient.runJob(conf);
    return 0;
  }


public static void main(String[] args) throws Exception {
    int res = ToolRunner.run(new Configuration(), new WordCount(), args);
    System.exit(res);
  }
}

Notice the Job object is responsible for most of the configuration that will happen, the number of mappers, reducer, the input and output types, job name, and much much more.

Building the Mapper class

The idea behind the mapper class is it takes in a row of input and emits a key value pair. The mapper is where the parsing will usually happen. This key value pair will be then caught by the reducer and acted upon.

/**
 * Counts the words in each line.
 * For each line of input, break the line into words and emit them as
 * (word, 1).
 */
public static class MapClass extends MapReduceBase implements Mapper< LongWritable, Text, Text, IntWritable > {
  private final static IntWritable one = new IntWritable(1);
 private Text word = new Text();

public void map(LongWritable key, Text value,
   OutputCollector<text, intwritable=""> output,
   Reporter reporter) throws IOException {

  String line = value.toString();
  StringTokenizer itr = new StringTokenizer(line);
  while (itr.hasMoreTokens()) {
    word.set(itr.nextToken());
     output.collect(word, one);
   }
 }
}

Building the Reduce class

The reducer captures the keys and a list of values. In the shuffle and sort phase of Hadoop, all values belonging to a particular key are put together into a list. In the reduce phase we receive that list and the key it belongs to. We usually loop through the list and perform some operation on the individual values. When we finally emit from the class it’ll be actually written in one of the part-r-0000 output files.


/**
 * A reducer class that just emits the sum of the input values.
 */
public static class Reduce extends MapReduceBase implements Reducer< Text, IntWritable, Text, IntWritable > {

public void reduce(Text key, Iterator values,
 OutputCollector<text, intwritable=""> output,
 Reporter reporter) throws IOException {
      int sum = 0;
      while (values.hasNext()) {
        sum += values.next().get();
      }
      output.collect(key, new IntWritable(sum));
    }
  }

Compiling the Hadoop Project

Compiling using the terminal

Compilation of a Hadoop Java project is pretty straight forward once you find out the magic command. Which is hadoop classpath. if you created the above WordCount.java file then just open up the terminal and cd into the folder. The execute the Java compiler with the proper classpath.

$ javac WordCount.java -cp $(hadoop classpath)
 The hadoop classpath provides the compiler with all the paths it needs to compile correctly and you should see a resulting WordCount.class appear in the directory.

Compiling using Maven

When using Maven there’s a very specific directory structure required and a little building within the pom.xml. I’ve created a sample Github repository that has a completely working version. The fastest way to go about would be to clone the repo and just explore it. The source WordCount file is located in the src/main/java/com/qfa path.

$ git clone https://github.com/marek5050/Hadoop_Examples
$ cd Hadoop_Examples
$ mvn install

Running a Hadoop Project

When we finish create a Hadoop java file and packaging it using Maven. We can test out the jar file using

% hadoop jar ./target/bdp-1.3.jar dataSet3.txt  dataOutput1
  • bdp–1.3 is that name of the jar file generated by Maven.
  • dataSet3.txt is the data file we uploaded using put
  • dataOutput will be the folder where results are added

An easier way of running the project is by creating a script, let’s call it run
Create the file in the mvn directory and

% hadoop jar ./target/bdp-1.3.jar \
dataSet3.txt $(date +%s)

Close and save the file.

% chmod +x ./run     //To make it executable.
%./run              // To execute

Now after we package the new jar file using Maven, we just run the hadoop job using ./run and the hadoop job executes and ouputs the information into a file called 10012313131. This number will be the number of millisecond since the beginning of mankind or something, but the great side effect of that is the newest folder will always be the last folder visible and the names will always be unique. So there’s no need to track file names.

After the job runs we just open up the Web GUI and download the resulting file.
Download the Result file using GUI

Managing the filesystem HDFS

“Hadoop hdfs dfs” was deprecated and now it’s done purely with “hdfs dfs”. Some of the basic HDFS commands are:

%hdfs dfs 
> Usage:hdfs dfs
>hdfs dfs 
> -put 
> -cp copy files from src to dest
> -cat
> -ls list the files in a directory
> -mkdir Create a file directory
> -mv 
> -rm remove file
> -rmdir remove directory

Uploading Data Files

To transfer data files into HDFS use either put or copyFromLocal, if the dst parameter is missing the default will be the users home directory, or /user/name/.

hdfs dfs -put  
hdfs dfs -copyFromLocal  
hdfs dfs -put book.txt

Verify the file was added using

hdfs dfs -ls 
hdfs dfs -ls