Spark with Python Notebook on Mac

First thing first….

To use Spark we need to configure the Hadoop eco system of Yarn and HDFS. This can be done following my previous tutorial Installing Hadoop on Yosemite.

From the tutorial run hstart or just make sure Hadoop/yarn is running.

$ hstart

Install HomeBrew

Found here: or simply paste this inside the terminal

ruby -e "$(curl -fsSL"

To install Spark

brew install apache-spark

Will install Spark to directory /usr/local/Cellar/apache-spark/1.4.0/

Create Python HDFS directory and dataset

The directory we will be using for input and output.

$ hdfs dfs -mkdir /Python

Download a book for Word Count

$ wget
$ mv 30760-0.txt book.txt
$ hdfs dfs -put book.txt /Python/
$ hdfs dfs -ls /Python/

Install Anaconda Python

We’ll install Anaconda for Python because it also contains iPython and other tools that will make working with Python easy and enjoyable.

Download and install Anaconda Python 3.4 from

Running iPython notebook

In the terminal execute

$ IPYTHON_OPTS="notebook" pyspark

Which starts iPython kernel, creates the Spark Hdfs connection, and automatically opens up a new browser window with the Python Notebook. In the top right corner click on New Notebook.

words = sc.textFile("hdfs://localhost:9000/Python/book.txt")

words.filter(lambda w: w.startswith(" ")).take(5)

counts = words.flatMap(lambda line: line.split(" ")) \
 .map(lambda word: (word, 1)) \
 .reduceByKey(lambda a, b: a + b)



Python Notebook

You can view the Python Notebook here…

Screen Shot 2015-03-05 at 10.25.02 AM Screen Shot 2015-03-05 at 10.25.22 AM

Screen Shot 2015-03-05 at 11.13.38 AM

Additional links



7 thoughts on “Spark with Python Notebook on Mac”

  1. Hi Marek,
    I was trying to create a new directory Python:

    hdfs dfs -mkdir /Python

    Then I got this error:

    15/08/01 13:51:08 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
    mkdir: Call From MacBook-Pro-2.local/ to localhost:9000 failed on connection exception: Connection refused; For more details see:

    Can you help me out? Thanks.

    1. Hello,
      I’d assume that not all of your Hadoop services are running, probably the namenode itself. Try running the command “jps” you should see this output
      “1048 NodeManager
      729 NameNode
      1309 Jps
      797 DataNode
      879 SecondaryNameNode
      975 ResourceManager”

      If not checkout my installing hadoop tutorial and specifically “hstart” to start all the necessary services.


  2. Hi Marek,
    I have installed hadoop using your guide. And now I am trying to get Spark running. However, when I run the command:
    hdfs dfs -mkdir /Python

    Where is the directory created? I can’t find it even though the terminal says it was created. It confused me a little.
    Maybe you can help me out?



    1. Hello! Yes the so the directory was created on the HDFS “filesystem” you can access using hdfs dfs -ls . There’s a data directory setup in Hadoop configure files you should be able to access if you need to

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s