Sunday, May 3, 2015

How to install IPython Notebook + matplotlib + GraphLab Create

tl;dr: these instructions *work*. I'm publishing them here mainly to make them quickly accessible for my own use.

Install the following to get a fully working IPython Notebook:

$ sudo apt-get install python-dev
$ sudo apt-get install libzmq3-dev
$ sudo pip install pyzmq
$ sudo pip install jinja2
$ sudo pip install pygments
$ sudo pip install tornado
$ sudo pip install jsonschema
$ sudo pip install ipython
$ sudo pip install "ipython[notebook]"


If you want to use this machine as a server, follow these instructions:

These instructions including creating an SSL key pair so that you can access the IPython server via https from a remote machine.


Additionally, install this to get a fully working matplotlib:

$ sudo apt-get install libpng-dev
$ sudo apt-get install libfreetype6
$ sudo apt-get install libfreetype6-dev
$ sudo apt-get install g++
$ sudo pip install matplotlib


And finally, install GraphLab Create:

$ pip install graphlab-create

Note: for GraphLab Create to work, your Python instance should be built with the UCS4 flag. To check if that was the case:

(Taken from: http://stackoverflow.com/questions/1446347/how-to-find-out-if-python-is-compiled-with-ucs-2-or-ucs-4 )

When built with --enable-unicode=ucs4:
>>> import sys
>>> print sys.maxunicode
1114111
When built with --enable-unicode=ucs2:
>>> import sys
>>> print sys.maxunicode
65535



GraphLab Create Product Key Setup

After GraphLab is installed, you need to get a product key from dato.com (for free, after a simple and short registration). When you register, you receive the product key within a script for installing it into the appropriate directory. In case you lost the script and only have the product key, you can also set it via GraphLab itself.

$ python
>>> import graphlab as gl
>>> gl.product_key.set_product_key("PRODUCT KEY OBTAINED FROM DATO AFTER REGISTRATION")

Tuesday, March 10, 2015

Resizing a VirtualBox vm's disk using gparted - by Derek Molloy

Had to place a link somwhere to this awsome post by Derek Molloy, about how to resize the disk for a VirtualBox VM:
http://derekmolloy.ie/resize-a-virtualbox-disk/

If you install Ubuntu Desktop (like I did), reserve more than 8GB for disk space... This process is a bit lengthy and annoying, but Derek's post makes it a bit better with it's clear explanations.

Tuesday, January 27, 2015

Compiling GCC 4.9.2 from scratch on CentOS 6.5

I required GCC 4.8+ to compile the GraphLab Create SDK:
https://github.com/graphlab-code/GraphLab-Create-SDK

The machine runs CentOS 6.5 (for which yum does not yet offer the latest version of GCC), so I had to compile it on my own. Luckily, I have root access on this machine.

To compile it from scratch, GMP, MPFR and MPC should be installed:

yum install gmp-devel mpfr-devel libmpc-devel

Afterwards, the official installation instructions may be followed:
https://gcc.gnu.org/wiki/InstallingGCC

Monday, December 29, 2014

Installing python-igraph 0.7 (when cannot compile against libxml2.a)

Scientific Linux 6.5 (SL6.5), trying to install python-igraph, but getting error documented here:
https://github.com/igraph/igraph/issues/640. When trying to pip install python-igraph, version 0.7 yields the following error:

/usr/lib64/gcc/x86_64-suse-linux/4.3/../../../../x86_64-suse-linux/bin/ld: /usr/lib64/libxml2.a(entities.o): relocation R_X86_64_32 against `.text' can not be used when making a shared object; recompile with -fPIC

/usr/lib64/libxml2.a: could not read symbols: Bad value

collect2: ld returned 1 exit status
The server is not mine, so I am not a sudoer or anything.

The solution is mentioned in the github issue:


I modified the setup.py as follows:
I changed the line
variants = ["lib{0}.a", "{0}.a", "{0}.lib", "lib{0}.lib"]
to
variants = ["lib{0}.so", "{0}.so"]
I downloaded Python 2.7.9's source code and compiled it on a local directory. I then added it to my PATH so that my newly compiled version would be the one I use by default. I logged out, logged in, downloaded the python-igraph source code, and modified the variants in the setup.py and used Python to install igraph:

python setup.py install

I then tried to import igraph from Python, but got an ImportError:

ImportError: libigraph.so.0: cannot open shared object file: No such file or directory

So I looked it up and found that in the igraph source code directory (where it was built) and found the libigraph.so.0 file. On the GitHub link, they say you should copy it to the /lib folder under your Python installation. That did not work for me (maybe because I don't use Anaconda like they do, but who knows). So I added that

setenv LD_LIBRARY_PATH ${LD_LIBRARY_PATH}:/path/to/build/source/of/python-igraph-0.7/igraphcore/lib

and now igraph could be imported from my own Python installation.


I am writing this post because:

  1. I almost resorted to using the crappy C API of igraph; such desperation is surely worth of a post.
  2. The answer supplied by the igraph team here: http://lists.nongnu.org/archive/html/igraph-help/2014-09/msg00041.html
    is misleading and did not help me, so I thought it'd be worth to upvote the GitHub issue answer by a link.

That's it.

Thursday, February 27, 2014

Getting logs for jobs / tasks in Hadoop 2

tl;dr

mapred job -list all | grep <username> #get your job-id
mapred job -events <job-id> # print attempts
mapred job -logs <job-id> [<attempt-id>] #all/just for attempt

That's it. I'm not going to spend another word discussing this. Grab the shell terminal and try it yourself, and everything will be clear.

tl;dr+

#here's the same example and some outputs with:

  • <username> = reuts
  • <job-id> = job_1392731301770_0015
  • <attempt-id> = attempt_1392731301770_0015_m_000001_0

[root@hadoopb home]# mapred job -list all | grep reuts #all the 
14/02/27 17:20:00 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is inited.
14/02/27 17:20:00 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is started.
 job_1392731301770_0022     FAILED       1393505972362         reuts         default        NORMAL                    0               0       0M              0M                0M      hadoopb-w02:8088/proxy/application_1392731301770_0022/jobhistory/job/job_1392731301770_0022
 job_1392731301770_0021     FAILED       1393504180159         reuts         default        NORMAL                    0               0       0M              0M                0M      hadoopb-w02:8088/proxy/application_1392731301770_0021/jobhistory/job/job_1392731301770_0021
 job_1392731301770_0016     FAILED       1393455977860         reuts         default        NORMAL                    0               0       0M              0M                0M      hadoopb-w02:8088/proxy/application_1392731301770_0016/jobhistory/job/job_1392731301770_0016

#... many more failures after that! I picked one of the last which is not shown here - job_1392731301770_0015

[root@hadoopb home]# mapred job -events job_1392731301770_0015 #the high-level events.
Usage: CLI [-events <job-id> <from-event-#> <#-of-events>]. Event #s start from 1.
[root@hadoopb home]# mapred job -events job_1392731301770_0015 1 100
14/02/27 17:19:44 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is inited.
14/02/27 17:19:44 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is started.
14/02/27 17:19:45 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=FAILED. Redirecting to job history server
Task completion events for job_1392731301770_0015
Number of events (from 1) are: 6
FAILED attempt_1392731301770_0015_m_000001_0 hadoopb-w07:59031/tasklog?plaintext=true&attemptid=attempt_1392731301770_0015_m_000001_0
FAILED attempt_1392731301770_0015_m_000001_1 hadoopb-w06:45315/tasklog?plaintext=true&attemptid=attempt_1392731301770_0015_m_000001_1
FAILED attempt_1392731301770_0015_m_000000_1 hadoopb-w05:48979/tasklog?plaintext=true&attemptid=attempt_1392731301770_0015_m_000000_1
FAILED attempt_1392731301770_0015_m_000001_2 hadoopb-w07:59031/tasklog?plaintext=true&attemptid=attempt_1392731301770_0015_m_000001_2
FAILED attempt_1392731301770_0015_m_000000_2 hadoopb-w06:45315/tasklog?plaintext=true&attemptid=attempt_1392731301770_0015_m_000000_2
FAILED attempt_1392731301770_0015_m_000001_3 hadoopb-w06:45315/tasklog?plaintext=true&attemptid=attempt_1392731301770_0015_m_000001_3

[root@hadoopb home]# mapred job -logs job_1392731301770_0015 #that'd spit all the logs, so lets specify an attemptID.
[root@hadoopb home]# mapred job -logs job_1392731301770_0015 attempt_1392731301770_0015_m_000001_0
14/02/27 17:21:18 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is inited.
14/02/27 17:21:18 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is started.
14/02/27 17:21:19 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=FAILED. Redirecting to job history server

LogType:stderr

LogLength:250
Log Contents:
Traceback (most recent call last):
  File "/mnt/disk2/hadoop/yarn/usercache/reuts/appcache/application_1392731301770_0015/container_1392731301770_0015_01_000003/./mapper.py", line 11, in <module>
    import config
ImportError: No module named config

LogType:stdout

LogLength:0
Log Contents:

LogType:syslog

LogLength:11461
Log Contents:
2014-02-27 01:01:50,943 WARN [main] org.apache.hadoop.conf.Configuration: job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
2014-02-27 01:01:50,944 WARN [main] org.apache.hadoop.conf.Configuration: job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.
2014-02-27 01:01:51,062 INFO [main] org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties

Friday, February 14, 2014

To Unicode and Back in Python 2.xx

(note)

Python syntax highlighting for this post was done using lusever's online syntax highlighter, to which I arrived after testing many other online highlighters and concluded that it's the best!)

tl;dr:

Read these:
http://nedbatchelder.com/text/unipain.html
http://farmdev.com/talks/unicode/

Always turn incoming (read from file, user or web) strings into unicode objects, and outgoing unicode objects (written to file, printed to screen, used for opening a url) back into strings:
# -*- coding: utf-8 -*-
"""Unicode sandwiching methods."""
def welcome_string(obj, encoding='utf-8'):
    """For objects entering the system. Returns utf-8 unicode objects."""
    if isinstance(obj, basestring):
        if not isinstance(obj, unicode):
            obj = unicode(obj, encoding)
    return obj
    
def goodbye_string(obj, encoding = 'utf-8'):
    """For objects leaving the system. Returns an str instance."""
    if isinstance(obj, basestring):
        if isinstance(obj, unicode):
            obj.encode(encoding, 'ignore')
    return obj
    
def printable_string(obj):
    """For unicode objects which should be printed to the screen."""
    if isinstance(obj,unicode):
        obj = obj.encode('ascii','ignore')
    return obj

def get_html(url):
    """Url is assumed to be a string in utf-8 encoding."""
    import urllib2
    u = urllib2.urlopen(goodbye_string(url), timeout = 60)
    html = u.read()
    html = welcome_string(html)
    return html

url = "http://jltfpw.jimdo.com/alle-informationen-zur-gesetzesänderung-der-petition-sternenkinder/faq/"
print "Note how Python stores the bytes for the ae symbol in the url:"
print url
print

url = u"http://jltfpw.jimdo.com/alle-informationen-zur-gesetzesänderung-der-petition-sternenkinder/faq/"
try:
    get_html(url) # will fail: Python will break the 'ä' and try to open a bad url.
except:
    print "urllib2.open requires a string - a sequence of bytes!"
    print

url = u"http://jltfpw.jimdo.com/alle-informationen-zur-gesetzesänderung-der-petition-sternenkinder/faq/".encode('utf-8')
html = get_html(url)
print printable_string(html)[:100],"..."

Unicode Hell: An Introduction

I'm currently working on a project involving texts in German. I tested my code on small inputs, was satisfied with the results, then left it to work on a large dataset for the night. I woke up in the morning and logged in to find this:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xb7' in position 13: ordinal not in range(128)

The traceback gave out the line numbers, so I wrapped it with try-except and ignored the records with these errors. Ran my code again, and got this error in a different line.

I tried horsing around:

"string".encode('utf-8')
"string".decode('utf-8')
"string".encode('ascii', 'ignore')
... and all sorts of other games. I even wrote a function replacing each umlauted character (there's a finite number of these in German) with some sort of English equivalent. That implied data loss (as "ä" is not exactly the same as "ae"), but I didn't care. Until that un-umlauting function starting throwing UnicodeEncodeErrors as well!

So I did some digging and here are my results. I've also listed a few recommended readings, and what I'll try to do in this post is to give the tl;dr versions of their words and also supply some code examples. On the way I'll point some surprising facts (at least to me) I've found about Python 2.xx.

I used to think that strings are ascii-strings, and unicode objects are some utf-8 versions of strings, or something like that. That's nonsense, and that has to go out first.

This is how your application should be like (according to Ned Batchelder's "Pragmatic Unicode" which I'll go on quoting again and again in this post:

Read Ned's slides about unicode and save yourself hours of suffering!!!

First steps towards understanding unicod

Python string objects really are sequences of bytes. When you check the length of "string", you can 6 (naturally). But check this out:

>>> "ö"
'\xc3\xb6'
>>> len("ö")
2

Python has to represent "ö" using 2 hex bytes. That's very nice of him. Unfortunatqely, that can cause all sorts of errors, even causing errors in a line such as 'print "ö"' on the Windows cmd!

So let's do what I used to think is the solution, and turn this weird umlauted o to unicode:

>>> unicode("ö")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

Look, ma! Decode!!!

What's ascii got to do with that? Isn't unicode utf-8? Appearently, not!

Here's the right solution. Note the length of the generated unicode object:

>>> unicode("ö",'utf-8')
u'\xf6'
>>> len(unicode("ö",'utf-8'))
1

Hey, utf-8 can be seen again! Guess that solved it all, right?

Not exactly.

Unicode is a concept, meaning there's a unified numeric code for each symbol. "ö" is represented by the (hex) number f6 (246). Note that's larger than 127 - with [0,127] being the numeric values of ascii symbols. Each symbol has a numeric value. Nice!

Files, on the other hands, are sequences of bytes. To store unicode objects (text in unicode; "strings" as we used to call them) we need some way to represent these as byte sequences. How do we do that?

There are several methods, with utf-8 being the king. Python uses utf-8 by default when handling unicode objects and making them play with strings (u'a' + 'b', u'c'.join(['def']) , etc). Let's turn our unicode object back to string (a sequence of bytes) and see that we get the same values as we originally got for it as a string:

>>> str(unicode("ö",'utf-8'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 0: ordinal not in range(128)
>>> unicode("ö",'utf-8').encode('utf-8')
'\xc3\xb6'

Look, ma! Encode!

To make a long story short: when dealing with strings which might not only contain ascii characters, you'd better use unicode objects instead of strings. And when you deal with data from the internet - that becomes not a suggestion but a requirement.

So what's demanded of you? Again:

"Unicode Sandwich - Bytes on the outside, Unicode on the inside."

- Ned Batchelder

What's the outside? Files you read, html you read via urllib2.urlopen(...).read(), user input, etc. And just as well - anything you're intending to write to a file or pickle, and the URLs you're trying to open.
What's the inside? Anything you store. Dictionary keys (and values!), assignments to your objects' properties, etc.

And how do you make the switch between outside and inside and outside again?

To quote again:
You can't decode a unicode, and you can't encode a str. Try doing it the other way around.
Or in my words (which are written on a sticky-note placed right above my screen until I'll know it by heart):

Decode - str to unicode
Encode - unicode to str

You de-string a string object into a unicode object, and en-string a unicode object into a sequence of bytes (string). Or whatever other mnemonic that works for you.

Solutions

An incoming string s has to be treated as soon as possible, like this:
s.decode('utf-8','ignore')

An outgoing unicode object u has to be treated as late as possible, like this:
u.encode('utf-8','ignore')

Unless you're printing it - the Windows cmd is working with the cp-666 (or something) encoding, and when Python will try to encode (to str!) your text for the printing it'd fail with a UnicodeEncodeError - so you should do something like this:
u.encode('ascii','ignore')

Just as well, files you are writing should be told to expect utf-8 byte streams, and so:

import codecs #the Python STL codecs library
with codecs.open("output_filename", 'w', 'utf-8', 'ignore') as f:
    f.write(unicode_object_decoded_in_utf8)

Now I know that sounds like a lot of responsibility, but it really isn't. When I converted my code unicode-safe, I thought I'll have tons of modifications to do. I was wrong. I didn't really read files from as many places as I've thought (which is where I placed my str → unicode conversions), and I didn't write them in so many places (which is where I placed my unicode → str conversions, or used the codecs shortcut for opening a file for writing out utf-8 . And after I did this, my code stopped crashing due to UnicodeWhateverErrors.

Here's the code I used, based on Kumar McMillan's "to_unicode_or_bust" function in "Unicode in Python, Completely Demystified":

"""Unicode sandwiching methods."""
def welcome_string(obj, encoding='utf-8'):
    """For objects entering the system. Returns utf-8 unicode objects."""
    if isinstance(obj, basestring):
        if not isinstance(obj, unicode):
            obj = unicode(obj, encoding)
    return obj
    
def goodbye_string(obj, encoding = 'utf-8'):
    """For objects leaving the system. Returns an str instance."""
    if isinstance(obj, basestring):
        if isinstance(obj, unicode):
            obj.encode(encoding, 'ignore')
    return obj
    
def printable_string(obj):
    """For unicode objects which should be printed to the screen."""
    if isinstance(obj,unicode):
        obj = obj.encode('ascii','ignore')
    return obj

Here is a usage example based on importing the functions above:

# -*- coding: utf-8 -*-
from unicode_sandwiching_methods import welcome_string, goodbye_string, printable_string
def get_html(url):
    """Url is assumed to be a string in utf-8 encoding."""
    import urllib2
    u = urllib2.urlopen(goodbye_string(url), timeout = 60)
    html = u.read()
    html = welcome_string(html)
    return html

url = "http://jltfpw.jimdo.com/alle-informationen-zur-gesetzesänderung-der-petition-sternenkinder/faq/"
print "Note how Python stores the bytes for the ae symbol in the url:"
print url
print

url = u"http://jltfpw.jimdo.com/alle-informationen-zur-gesetzesänderung-der-petition-sternenkinder/faq/"
try:
    get_html(url) # will fail: Python will break the 'ä' and try to open a bad url.
except:
    print "urllib2.open requires a string - a sequence of bytes!"
    print

url = u"http://jltfpw.jimdo.com/alle-informationen-zur-gesetzesänderung-der-petition-sternenkinder/faq/".encode('utf-8')
html = get_html(url)
print printable_string(html)[:100],"..."

Note that if you're going to construct some toy code examples including unicode texts, nclude the "magic string" stating your source code file's encoding as the top of your script (after the #! if you have one):
# -*- coding: utf-8 -*-

Further reading:

Ned Batchelder: Pragmatic Unicode (2012)
http://nedbatchelder.com/text/unipain.html

(The "Unicode Sandwich" slide:
http://nedbatchelder.com/text/unipain/unipain.html#35 )

Kumar McMillan, Unicode in Python, Completely Demystified (2008)
http://farmdev.com/talks/unicode/

Joel Spolsky, The Absolute Minimum Every Software Developer Absolutely, Positively Must Know
About Unicode and Character Sets (No Excuses!) (2003!)
http://www.joelonsoftware.com/articles/Unicode.html

encool - quickly generate crazy unicode strings for your tests:
http://fsymbols.com/generators/encool/
Ian Albert - a dude who printed out all the unicode symbols of his time, and, well, that's about that.

Notes:

German texts usually contains umläuts - which can be stylized back to English using the following transliteration dictionary (kinda like Russian and volapuk):

german_letters = {"Ä" : "AE",
"ä" : "ae",
"æ" : "ae",
"Ö" : "O",
"ö" : "o",
"Ü" : "U",
"ü" : "u",
"ß" : "ss"}

Note 2: People get confused due to the C char / string stuff that is used to teach C programming - uppercasing/lowercasing etc... More on that later.

Tuesday, January 14, 2014

Raising the Number of Incoming Links

Shameless self-promotion: this is a (Hebrew) article about the new Hadoop system we built in BGU.

http://in.bgu.ac.il/Pages/news/Hadoop.aspx

Couldn't do it without reading posts by Michael G. Noll. Start with his posts about installing Hadoop if you ever want to get it done right.
http://www.michael-noll.com/

Yay!