Monday, April 4, 2016

GraphLab Create via PyCharm

People who want to start using GraphLab Create usually ask two questions:

1) Where do I get GraphLab from?
2) What IDE should I use in my Python projects?

The first answer is to use the Dato Launcher. This is a bundle of Python, GraphLab Create and IPython Notebook. IPython Notebook is the best IDE for data scientists, and is also what we use in the Coursera Machine Learning Foundations course.

The second answer is JetBrains' PyCharm, which is becoming the de-facto standard Python IDE. This is what we use in big projects which also require lots of debugging.

To make PyCharm use the Python bundled in the Dato Launcher, follow the following instructions, which are detailed and include lots of screenshots.

You can download this PDF version of the instructions, which is included in a single file, or the HTML version which is zipped. Unzip all files to see the pictures as well.

This guide was originally written in MarkDown using the Mou editor.


Saturday, October 3, 2015

7 Links To Convince You That Big Data Isn't Your Problem

tl;dr - scroll to the bottom for a list of 7 links you should can read instead of this entire post.

I'll spare you the big bulk of my words and get to the point:

From my personal experience with big data projects, I'll conclude that it's a big (data) fat lie (or a big fat [data] lie if you'd like). Most people don't have big data to begin with, and they use big data technologies for the saying 'we use big data in our product' to acquire more sales. Below is the story in a set of links: read through this links, while bearing in mind the key points I attached to each one of them. I hope that by the time you finish reading this, you'll become more convinced in this point.

And yes, big data technologies still have a place in today's world: buzzwords have always been a part of business, and somebody has to purchase all of these disks and RAM and CPU (packed nicely into "commodity servers"). But if you are a penniless start-up worrying about how will you close the gap and get in the game, rest assured. The game is mostly imaginary - as always!

I dedicate this post to my friend Rony, an economist who asked me for tips on how to incorporate "big data" into her research. A short set of questions proved to me that she was not aware of what big data really is, and that it wasn't relevant for her case. She would use computers, programming and databases - but she should not worry that she is missing any knowledge. So for people like her, whose fears get fed by the same legends big data salespeople spread about it to get more cash, I dedicate this post.

The Beginning: Google Gets There First (as always!)

I mark the beginning of the era of 'big data' with two research papers published their Google.

The first paper described the Google File System (GFS) and you can (and should!) read it here:
http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf
This is a file system that specializes in scalability, in appending incoming data in mass volumes into existing files, and in making high-availability systems (due to redundancy of data). These were all valid concerns relevant for Google's business - but not relevant for just any company.

The second paper described the MapReduce paradigm:
Basically that means that if I have a book and I want to count how many times each word appears in it - I can split it to several chunk, count the words separately in each chunk, and then combine my results. The split+count part is called MAP, the combine part is called REDUCE. This is the nuttiest of all nutshells explanations, but is enough to get some few points across. First, this paradigm fits for batch processing of rather homogenous data. Second, it doesn't fit non-aggregative batch processing tasks. An example is Graph Algorithms - looking at a node in a graph, and retrieving its friends, its friends' friends... etc - is something you don't normally do using MapReduce.

These two concepts are combined into a single system like this: the data is stored using a GFS-like filesystem, then processed using the MapReduce paradigm (the more general term is 'lambda architectures'). Google published these papers but did not release any code. But Doug Cutting and Mika Cafarella, two engineers from Yahoo, decided to implement such a system as part of their work in Yahoo. This became an open-source Apache project named Hadoop. So - Hadoop - open-source GFS+MapReduce. Hadoop became the generic name for a big data processing system and is used widely in academia and industry. But enough about Hadoop at this point.

You might notice that these papers are from 2003 and 2004. Google itself has already got past that point and is using some other technologies which don't have open sources implementations out yet. You can read more about how Hadoop is no longer the cutting-edge big data framework here:


The Disillusionment: Where Big Data Fails in General, and Hadoop fails Specifically

Instead of presenting my own opinion, please look at this KD Nuggets poll:
When asked what is the largest dataset analyzed, data scientists (the people working on analyzing data) respond that a laptop is usually enough for their needs.

(I used to link the previous year's poll when I noticed a new version came out. Still, here it is, you can compare each year's results to see the stability:
http://www.kdnuggets.com/polls/2014/largest-dataset-analyzed-data-mined-2014.html
)

Why is that? For two reasons. The first one is that big data is not always necessary to develop a good data-driven application. Let's say we are trying to create personalized messages for e-commerce websites. Is the entire history of all users necessary? Maybe just the last 3 months? How much do they weight? Is a sample of the data enough for building a good model (rather than the entire corpus)? Is the entire corpus tagged (labelled) so that we can use it for building our model? This is not always the situation. Quantity can not always cover for quality.

The second reason is that most big data systems are aimed at being scalable. That means that such a system deployed on two machines (working together on the computations) will perform better than the same system deployed on a single machine (and it goes on: 3 machines are better than 2, 4 are better than 3... etc). The more machines also allow for robustness - if one of the machines die in a Hadoop cluster, the others can pick up from where it left and finish the job. So big data systems are usually scalability-oriented. Nice!

But do they perform better than a non-scalability-aimed, single-thread based system?

Frank McSherry et. al checked this subject and returned with surprising results. Read their paper here:
Frank McSherry, Michael Isard, Derek G. Murray
Scalability! But at what COST?
http://www.frankmcsherry.org/assets/COST.pdf

A shorter reading would be McSherry's blog-post about this paper:
http://blog.acolyer.org/2015/06/05/scalability-but-at-what-cost/

And also his follow-up post about testing the single laptop on even bigger data:
http://www.frankmcsherry.org/graph/scalability/cost/2015/02/04/COST2.html
For me, this set of publications is what finally took the hot air out of the big data balloon. Most people don't have big data and won't benefit (performance-wise, not business-sexiness-wise) from using those systems. That's what I see when I read the links posted above.

The Truth: Where is Big Data Used Successfully

So is it completely useless to analyze large amounts of data? As I mentioned above - that depends on your goals. Here is an opposite opinion - a positive use-case for using big data:

http://www.ft.com/cms/s/0/304b983e-5a44-11e5-a28b-50226830d644.html
Charles Clover, China’s Baidu searches for AI edge, Financial Times (September 14, 2015 4:24 am)

Here is the (probably copyrighted - sorry about that!) most important part in my opinion:
The company has an advantage in deep-learning algorithms for speech recognition in that most video and audio in China is accompanied by text — nearly all news clips, television shows and films are close-captioned and almost all are available to Baidu and Iqiyi, its video affiliate. 
While a typical academic project uses 2,000 hours of audio data to train voice recognition, says Mr Ng, the troves of data available to China’s version of Google mean he is able to use 100,000 hours. 
He declines to specify just how much the extra 98,000 hours improves the accuracy of his project, but insists it is vital. 
“A lot of people underestimate the difference between 95 per cent and 99 per cent accuracy. It’s not an ‘incremental’ improvement of 4 per cent; it’s the difference between using it occasionally versus using it all the time,” he says.
So in this domain (speech recognition), using this algorithm (deep learning), big data of this high quality (closed captions) is successfully used. This also works well with image recognition done in neural networks over large corpora of tagged images. Google and Facebook have such datasets and they do it successfully. But not every company is Google or Facebook. Are you?

Conclusion

Before you get confused by big-data, please focus on the following points:

Big Data Systems are usually a combination of scalable file systems and scalable processing architectures. The household name in this field is Hadoop, which is based on these two Google papers:
(although Google don't use these technologies nowadays:

Most real data scientists admit they don't really use big data...

... and new evidence shows that these systems are also not very efficient:
Paper by McSherry:
4) http://www.frankmcsherry.org/assets/COST.pdf
Two blog posts he made about this paper:

Big data is sometimes useful - here's an example. Understand whether you are in such a situation:
If you are not - don't feel bad. You're in a good company :)

The Fine Print

I currently work for Dato, the creator of GraphLab Create. Our product is famous for analyzing massive datasets over a single laptop or a single machine. Nevertheless, these opinions do not represent Dato. Also, I got these opinions from experimenting with GraphLab Create, and seeing for myself that good data science can be done over less than petabytes, and in a single machine.

Monday, July 27, 2015

Neural Networks and Deep Learning by Michael Nielsen

1) The best deep learning tutorial I've came across in the last few months is here:

Neural Networks and Deep Learning by Michael Nielsen
http://neuralnetworksanddeeplearning.com/index.html
Read it now! This post is just an attempt to add one more link in the web to this great book.


2) The best deep learning Python (what else??) library I came across is Keras .

http://keras.io/


3) Everyone's favourite deep learning post of the moment is Karpathy's RNN:
http://karpathy.github.io/2015/05/21/rnn-effectiveness/

But imho you should read Yoav Goldberg's response:
http://nbviewer.ipython.org/gist/yoavg/d76121dfde2618422139


And that's it for today.

Tuesday, June 2, 2015

python2.7 + graphlab create on centos

CentOS is notorious for using yum as its package manager (the equivalent of Ubuntu's 'apt-get'). Yum is dependent on the Python version that ships with CentOS, so upgrading to python 2.7 simply kills yum and breaks the system updates. Here's how to solve this.


Long story short: You can follow the instructions here:

https://github.com/h2oai/h2o-2/wiki/Installing-python-2.7-on-centos-6.3.-Follow-this-sequence-exactly-for-centos-machine-only

Or simply use my gist based on this post and my previous post:

https://gist.githubusercontent.com/guy4261/0e9f4081f1c6b078b436/raw/bfded5dd1c53284a83872bc61444511e02d99212/python2.7.9_centos_installation.sh

Monday, May 18, 2015

Learning Torch7 - Part 2: The Basic Tutorial

From this part of the page:
https://github.com/torch/torch7/wiki/Cheatsheet#newbies

I got to the tutorials part:
https://github.com/torch/torch7/wiki/Cheatsheet#tutorials-demos-by-category

And to the first tutorial:
http://code.madbits.com/wiki/doku.php

And to its first part, about the basics:
http://code.madbits.com/wiki/doku.php?id=tutorial_basics

The tutorial is straightforward, but ends with a request to normalize the mnist train data.

This is the code I came up with after a very painful realization.

The realization is what you see on line 4:

train.data = train.data:type(torch.getdefaulttensortype())


It appears that the data is loaded by torch not in tables or torch tensors but in something called "userdata". This is the type assigned to things coming from the dark depths of C / C++ interops with Lua. This userdata had a particular interesting feature: its contents were forced (through wrapping) to be in the range [0, 256) . So storing -26, for instance, would result in the value 256 - 26 = 230 to appear in the data. So casting was the first step to gain back my sanity in this case.

After casting back to torch tensors can you use :mean(), :std() and other tensor methods which make this code short and quick.

By the way - this normalization is called "z-score": http://en.wikipedia.org/wiki/Standard_score
I didn't quite catch the right way to describe the whole act in proper English, but that's what we are doing here (subtracting the mean and dividing by the standard deviation).

More useful things learned along the way:

for k,v in pairs(o) do -- or pairs(getmetatable(o))
   print(k, v)
end

is the equivalent of Python's dir(o) .

I also started working with this IDE: http://studio.zerobrane.com/

Moving on with the tutorials, the supervised learning will be my next post's subject.

Learning Torch7. Part 1

(part 1, aka: "The First Circle")

My entry point is this:
https://github.com/torch/torch7/wiki/Cheatsheet#newbies

I've read this page cover-to-cover and obtained a machine on which torch was installed. Installing it sounds like a nightmare and I wonder if the performance over a docker container would be the same, making the installation seem like a bad dream. So - read is step 1, installation is step 2.

Step 3 is according to the Newbies training course is to Learn Lua in 15 Minutes:
http://tylerneylon.com/a/learn-lua/

Notable comments:
  • nil is the local None/null/void/undefined.
  • do/end wrap blocks (just like { and } would in other languages)
  • There is no ++, += operators, so  n = n+1  is the way to go.
  • == , ~= for equal/nonequal test.
  • .. is string communication
  • Anything undefined evaluates to nil. (So you can type 'lkgjadsgjdas' into the interpreter without getting an error).
  • Only nil and false are considered false. 0 is true!
How to create tables:
-- Literal notation for any (non-nil) value as key: t = {key1 = 'value1', key2 = false}
u = {['@!#'] = 'qbert', [{}] = 1729, [6.28] = 'tau'} print(u[6.28]) -- prints "tau"
Matching keys within tables: -- Key matching is basically by value for numbers -- and strings, but by identity for tables. a = u['@!#'] -- Now a = 'qbert'. b = u[{}] -- We might expect 1729, but it's nil: -- b = nil since the lookup fails. It fails -- because the key we used is not the same object -- as the one used to store the original value. So -- strings & numbers are more portable keys.

Python dir() equivalent
print(_G) ~~ dir()

Using tables as lists/arrays:
-- List literals implicitly set up int keys:
v = {'value1', 'value2', 1.21, 'gigawatts'} for i = 1, #v do -- #v is the size of v for lists. print(v[i]) -- Indices start at 1 !! SO CRAZY! end -- A 'list' is not a real type. v is just a table -- with consecutive integer keys, treated as a list.

The other meaningful parts, to be read thoroughly in the tutorial, are the metatables parts.

Wednesday, May 13, 2015

pyspark + ipython = tab-completing python shell for spark

Update (2015-06-21): The new version of Spark (1.4.0) resolved the naming issue for environment variables, so that environment variables in Windows and Linux have the same names. Big thank-you to Aviad "Jo" Cohen for pointing this out to me!

<tldr>
How to run pyspark via IPython/IPython Notebook in Windows/Mac/Linux, and how to turn off its log messages.
</tldr>

Big-data is a lie - that we all know. To make it more believable, a shift was made from disk-based Map-Reduce (aka: Map-Reduce) to RAM-based Map-Reduce (aka: Spark). The immediate benefit is that Pig, which uses disk-based Map-Reduce, now has a fork called Spork (Pig over Spark). Just say that: spork!!! The joy is boundless.

Spark is fun if you like writing in Scala. (I am of course ignoring Java completely, just like everyone else should.) In case you are like me - trapped in the belly of Python, with no intention to ever leave these cozy intestines, you'll be working with the Python port of spark, dubbed PySpark. I call this a 'port' rather than a version or a driver since some parts available in the Scala/Java versions of Spark are not available in Python (for instance, their graph toolkit, GraphX). However, it's still useful for many.

When you downloaded a pre-built binary version of Spark (from here), the versions > 1.3 support integration with ipython. This is based on having ipython in your PATH and adding some environment variables. Here's how to achieve it:

0) Make sure python is installed: from the Windows cmd or Linux shell, type in:
$ python
(you should get a Python shell, exit() the shell.)
$ pip
(you should see that pip is installed.)

Windows users: to make the best out of pyspark you should probably have numpy installed (since it is used by MLlib). It's is a pain to install this on vanilla Python, so my advice is to download Anaconda Python, a distribution of python - which means Python + tons of packages already installed (numpy, scipy, pandas, networkx and much more). This is also good for installing 64-bit Python (rather than the usual 32-bit version). Get Anaconda from here:
http://continuum.io/downloads


1) Install ipython, pyradline:

$ pip install ipython
$ pip install pyreadline


2) Make sure that the ipython executable is in your path (if you can run pip with no problems, that should be ok as well).


3) Set the proper environment variables:
On Windows:


$ set PYSPARK_DRIVER_PYTHON="ipython"
(or set it permanently in the control panel)

On Mac/Linux:
$ declare -x PYSPARK_DRIVER_PYTHON="ipython"
(or add it to your ~/.bash_profile or ~/.bashrc file)

On Windows + Spark < 1.4.0:


$ set IPYTHON=1


4) Run pyspark, and you'll get it in IPython with auto completion working! Now you'll never mistake spelling sc.paralelize sc.parrallelize sc.parallelize again!


Bonus rounds:

5) If you want to use pyspark and IPython Notebook, you can install IPython Notebook, and then make pyspark use it:
On Windows:
$ set PYSPARK_DRIVER_PYTHON_OPTS=notebook
(or set it permanently in the control panel)

On Mac/Linux:
$ declare -x PYSPARK_DRIVER_PYTHON_OPTS="notebook"
(or add it to your ~/.bash_profile or ~/.bashrc file)

On Windows + Spark < 1.4.0:


$ set IPYTHON_OPTS=notebook

In the next IPython Notebook, the sc object will be available right from the start (no need to define it) as well as the pyspark module.


6) Last but not least, like any good Java program Spark makes a big drama out of its execution by printing out lots and lots of lengthy log messages. You can turn them off by following this StackOverflow answer:
http://stackoverflow.com/questions/25193488/how-to-turn-off-info-logging-in-pyspark


Enjoy Sparking!