Monday, May 18, 2015

Learning Torch7 - Part 2: The Basic Tutorial

From this part of the page:
https://github.com/torch/torch7/wiki/Cheatsheet#newbies

I got to the tutorials part:
https://github.com/torch/torch7/wiki/Cheatsheet#tutorials-demos-by-category

And to the first tutorial:
http://code.madbits.com/wiki/doku.php

And to its first part, about the basics:
http://code.madbits.com/wiki/doku.php?id=tutorial_basics

The tutorial is straightforward, but ends with a request to normalize the mnist train data.

This is the code I came up with after a very painful realization.

The realization is what you see on line 4:

train.data = train.data:type(torch.getdefaulttensortype())


It appears that the data is loaded by torch not in tables or torch tensors but in something called "userdata". This is the type assigned to things coming from the dark depths of C / C++ interops with Lua. This userdata had a particular interesting feature: its contents were forced (through wrapping) to be in the range [0, 256) . So storing -26, for instance, would result in the value 256 - 26 = 230 to appear in the data. So casting was the first step to gain back my sanity in this case.

After casting back to torch tensors can you use :mean(), :std() and other tensor methods which make this code short and quick.

By the way - this normalization is called "z-score": http://en.wikipedia.org/wiki/Standard_score
I didn't quite catch the right way to describe the whole act in proper English, but that's what we are doing here (subtracting the mean and dividing by the standard deviation).

More useful things learned along the way:

for k,v in pairs(o) do -- or pairs(getmetatable(o))
   print(k, v)
end

is the equivalent of Python's dir(o) .

I also started working with this IDE: http://studio.zerobrane.com/

Moving on with the tutorials, the supervised learning will be my next post's subject.

Learning Torch7. Part 1

(part 1, aka: "The First Circle")

My entry point is this:
https://github.com/torch/torch7/wiki/Cheatsheet#newbies

I've read this page cover-to-cover and obtained a machine on which torch was installed. Installing it sounds like a nightmare and I wonder if the performance over a docker container would be the same, making the installation seem like a bad dream. So - read is step 1, installation is step 2.

Step 3 is according to the Newbies training course is to Learn Lua in 15 Minutes:
http://tylerneylon.com/a/learn-lua/

Notable comments:
  • nil is the local None/null/void/undefined.
  • do/end wrap blocks (just like { and } would in other languages)
  • There is no ++, += operators, so  n = n+1  is the way to go.
  • == , ~= for equal/nonequal test.
  • .. is string communication
  • Anything undefined evaluates to nil. (So you can type 'lkgjadsgjdas' into the interpreter without getting an error).
  • Only nil and false are considered false. 0 is true!
How to create tables:
-- Literal notation for any (non-nil) value as key: t = {key1 = 'value1', key2 = false}
u = {['@!#'] = 'qbert', [{}] = 1729, [6.28] = 'tau'} print(u[6.28]) -- prints "tau"
Matching keys within tables: -- Key matching is basically by value for numbers -- and strings, but by identity for tables. a = u['@!#'] -- Now a = 'qbert'. b = u[{}] -- We might expect 1729, but it's nil: -- b = nil since the lookup fails. It fails -- because the key we used is not the same object -- as the one used to store the original value. So -- strings & numbers are more portable keys.

Python dir() equivalent
print(_G) ~~ dir()

Using tables as lists/arrays:
-- List literals implicitly set up int keys:
v = {'value1', 'value2', 1.21, 'gigawatts'} for i = 1, #v do -- #v is the size of v for lists. print(v[i]) -- Indices start at 1 !! SO CRAZY! end -- A 'list' is not a real type. v is just a table -- with consecutive integer keys, treated as a list.

The other meaningful parts, to be read thoroughly in the tutorial, are the metatables parts.

Wednesday, May 13, 2015

pyspark + ipython = tab-completing python shell for spark

Update (2015-06-21): The new version of Spark (1.4.0) resolved the naming issue for environment variables, so that environment variables in Windows and Linux have the same names. Big thank-you to Aviad "Jo" Cohen for pointing this out to me!

<tldr>
How to run pyspark via IPython/IPython Notebook in Windows/Mac/Linux, and how to turn off its log messages.
</tldr>

Big-data is a lie - that we all know. To make it more believable, a shift was made from disk-based Map-Reduce (aka: Map-Reduce) to RAM-based Map-Reduce (aka: Spark). The immediate benefit is that Pig, which uses disk-based Map-Reduce, now has a fork called Spork (Pig over Spark). Just say that: spork!!! The joy is boundless.

Spark is fun if you like writing in Scala. (I am of course ignoring Java completely, just like everyone else should.) In case you are like me - trapped in the belly of Python, with no intention to ever leave these cozy intestines, you'll be working with the Python port of spark, dubbed PySpark. I call this a 'port' rather than a version or a driver since some parts available in the Scala/Java versions of Spark are not available in Python (for instance, their graph toolkit, GraphX). However, it's still useful for many.

When you downloaded a pre-built binary version of Spark (from here), the versions > 1.3 support integration with ipython. This is based on having ipython in your PATH and adding some environment variables. Here's how to achieve it:

0) Make sure python is installed: from the Windows cmd or Linux shell, type in:
$ python
(you should get a Python shell, exit() the shell.)
$ pip
(you should see that pip is installed.)

Windows users: to make the best out of pyspark you should probably have numpy installed (since it is used by MLlib). It's is a pain to install this on vanilla Python, so my advice is to download Anaconda Python, a distribution of python - which means Python + tons of packages already installed (numpy, scipy, pandas, networkx and much more). This is also good for installing 64-bit Python (rather than the usual 32-bit version). Get Anaconda from here:
http://continuum.io/downloads


1) Install ipython, pyradline:

$ pip install ipython
$ pip install pyreadline


2) Make sure that the ipython executable is in your path (if you can run pip with no problems, that should be ok as well).


3) Set the proper environment variables:
On Windows:


$ set PYSPARK_DRIVER_PYTHON="ipython"
(or set it permanently in the control panel)

On Mac/Linux:
$ declare -x PYSPARK_DRIVER_PYTHON="ipython"
(or add it to your ~/.bash_profile or ~/.bashrc file)

On Windows + Spark < 1.4.0:


$ set IPYTHON=1


4) Run pyspark, and you'll get it in IPython with auto completion working! Now you'll never mistake spelling sc.paralelize sc.parrallelize sc.parallelize again!


Bonus rounds:

5) If you want to use pyspark and IPython Notebook, you can install IPython Notebook, and then make pyspark use it:
On Windows:
$ set PYSPARK_DRIVER_PYTHON_OPTS=notebook
(or set it permanently in the control panel)

On Mac/Linux:
$ declare -x PYSPARK_DRIVER_PYTHON_OPTS="notebook"
(or add it to your ~/.bash_profile or ~/.bashrc file)

On Windows + Spark < 1.4.0:


$ set IPYTHON_OPTS=notebook

In the next IPython Notebook, the sc object will be available right from the start (no need to define it) as well as the pyspark module.


6) Last but not least, like any good Java program Spark makes a big drama out of its execution by printing out lots and lots of lengthy log messages. You can turn them off by following this StackOverflow answer:
http://stackoverflow.com/questions/25193488/how-to-turn-off-info-logging-in-pyspark


Enjoy Sparking!

Sunday, May 3, 2015

How to install IPython Notebook + matplotlib + GraphLab Create

tl;dr: these instructions *work*. I'm publishing them here mainly to make them quickly accessible for my own use.

Install the following to get a fully working IPython Notebook:

$ sudo apt-get install python-dev
$ sudo apt-get install libzmq3-dev
$ sudo pip install pyzmq
$ sudo pip install jinja2
$ sudo pip install pygments
$ sudo pip install tornado
$ sudo pip install jsonschema
$ sudo pip install ipython
$ sudo pip install "ipython[notebook]"


If you want to use this machine as a server, follow these instructions:

These instructions including creating an SSL key pair so that you can access the IPython server via https from a remote machine.


Additionally, install this to get a fully working matplotlib:

$ sudo apt-get install libpng-dev
$ sudo apt-get install libfreetype6
$ sudo apt-get install libfreetype6-dev
$ sudo apt-get install g++
$ sudo pip install matplotlib


And finally, install GraphLab Create:

$ pip install graphlab-create

Note: for GraphLab Create to work, your Python instance should be built with the UCS4 flag. To check if that was the case:

(Taken from: http://stackoverflow.com/questions/1446347/how-to-find-out-if-python-is-compiled-with-ucs-2-or-ucs-4 )

When built with --enable-unicode=ucs4:
>>> import sys
>>> print sys.maxunicode
1114111
When built with --enable-unicode=ucs2:
>>> import sys
>>> print sys.maxunicode
65535



GraphLab Create Product Key Setup

After GraphLab is installed, you need to get a product key from dato.com (for free, after a simple and short registration). When you register, you receive the product key within a script for installing it into the appropriate directory. In case you lost the script and only have the product key, you can also set it via GraphLab itself.

$ python
>>> import graphlab as gl
>>> gl.product_key.set_product_key("PRODUCT KEY OBTAINED FROM DATO AFTER REGISTRATION")

Tuesday, March 10, 2015

Resizing a VirtualBox vm's disk using gparted - by Derek Molloy

Had to place a link somwhere to this awsome post by Derek Molloy, about how to resize the disk for a VirtualBox VM:
http://derekmolloy.ie/resize-a-virtualbox-disk/

If you install Ubuntu Desktop (like I did), reserve more than 8GB for disk space... This process is a bit lengthy and annoying, but Derek's post makes it a bit better with it's clear explanations.

Tuesday, January 27, 2015

Compiling GCC 4.9.2 from scratch on CentOS 6.5

I required GCC 4.8+ to compile the GraphLab Create SDK:
https://github.com/graphlab-code/GraphLab-Create-SDK

The machine runs CentOS 6.5 (for which yum does not yet offer the latest version of GCC), so I had to compile it on my own. Luckily, I have root access on this machine.

To compile it from scratch, GMP, MPFR and MPC should be installed:

yum install gmp-devel mpfr-devel libmpc-devel

Afterwards, the official installation instructions may be followed:
https://gcc.gnu.org/wiki/InstallingGCC