Wednesday, May 13, 2015

pyspark + ipython = tab-completing python shell for spark

Update (2015-06-21): The new version of Spark (1.4.0) resolved the naming issue for environment variables, so that environment variables in Windows and Linux have the same names. Big thank-you to Aviad "Jo" Cohen for pointing this out to me!

<tldr>
How to run pyspark via IPython/IPython Notebook in Windows/Mac/Linux, and how to turn off its log messages.
</tldr>

Big-data is a lie - that we all know. To make it more believable, a shift was made from disk-based Map-Reduce (aka: Map-Reduce) to RAM-based Map-Reduce (aka: Spark). The immediate benefit is that Pig, which uses disk-based Map-Reduce, now has a fork called Spork (Pig over Spark). Just say that: spork!!! The joy is boundless.

Spark is fun if you like writing in Scala. (I am of course ignoring Java completely, just like everyone else should.) In case you are like me - trapped in the belly of Python, with no intention to ever leave these cozy intestines, you'll be working with the Python port of spark, dubbed PySpark. I call this a 'port' rather than a version or a driver since some parts available in the Scala/Java versions of Spark are not available in Python (for instance, their graph toolkit, GraphX). However, it's still useful for many.

When you downloaded a pre-built binary version of Spark (from here), the versions > 1.3 support integration with ipython. This is based on having ipython in your PATH and adding some environment variables. Here's how to achieve it:

0) Make sure python is installed: from the Windows cmd or Linux shell, type in:
$ python
(you should get a Python shell, exit() the shell.)
$ pip
(you should see that pip is installed.)

Windows users: to make the best out of pyspark you should probably have numpy installed (since it is used by MLlib). It's is a pain to install this on vanilla Python, so my advice is to download Anaconda Python, a distribution of python - which means Python + tons of packages already installed (numpy, scipy, pandas, networkx and much more). This is also good for installing 64-bit Python (rather than the usual 32-bit version). Get Anaconda from here:
http://continuum.io/downloads


1) Install ipython, pyradline:

$ pip install ipython
$ pip install pyreadline


2) Make sure that the ipython executable is in your path (if you can run pip with no problems, that should be ok as well).


3) Set the proper environment variables:
On Windows:


$ set PYSPARK_DRIVER_PYTHON="ipython"
(or set it permanently in the control panel)

On Mac/Linux:
$ declare -x PYSPARK_DRIVER_PYTHON="ipython"
(or add it to your ~/.bash_profile or ~/.bashrc file)

On Windows + Spark < 1.4.0:


$ set IPYTHON=1


4) Run pyspark, and you'll get it in IPython with auto completion working! Now you'll never mistake spelling sc.paralelize sc.parrallelize sc.parallelize again!


Bonus rounds:

5) If you want to use pyspark and IPython Notebook, you can install IPython Notebook, and then make pyspark use it:
On Windows:
$ set PYSPARK_DRIVER_PYTHON_OPTS=notebook
(or set it permanently in the control panel)

On Mac/Linux:
$ declare -x PYSPARK_DRIVER_PYTHON_OPTS="notebook"
(or add it to your ~/.bash_profile or ~/.bashrc file)

On Windows + Spark < 1.4.0:


$ set IPYTHON_OPTS=notebook

In the next IPython Notebook, the sc object will be available right from the start (no need to define it) as well as the pyspark module.


6) Last but not least, like any good Java program Spark makes a big drama out of its execution by printing out lots and lots of lengthy log messages. You can turn them off by following this StackOverflow answer:
http://stackoverflow.com/questions/25193488/how-to-turn-off-info-logging-in-pyspark


Enjoy Sparking!