Friday, February 14, 2014

To Unicode and Back in Python 2.xx

(note)

Python syntax highlighting for this post was done using lusever's online syntax highlighter, to which I arrived after testing many other online highlighters and concluded that it's the best!)

tl;dr:

Read these:
http://nedbatchelder.com/text/unipain.html
http://farmdev.com/talks/unicode/

Always turn incoming (read from file, user or web) strings into unicode objects, and outgoing unicode objects (written to file, printed to screen, used for opening a url) back into strings:
# -*- coding: utf-8 -*-
"""Unicode sandwiching methods."""
def welcome_string(obj, encoding='utf-8'):
    """For objects entering the system. Returns utf-8 unicode objects."""
    if isinstance(obj, basestring):
        if not isinstance(obj, unicode):
            obj = unicode(obj, encoding)
    return obj
    
def goodbye_string(obj, encoding = 'utf-8'):
    """For objects leaving the system. Returns an str instance."""
    if isinstance(obj, basestring):
        if isinstance(obj, unicode):
            obj.encode(encoding, 'ignore')
    return obj
    
def printable_string(obj):
    """For unicode objects which should be printed to the screen."""
    if isinstance(obj,unicode):
        obj = obj.encode('ascii','ignore')
    return obj

def get_html(url):
    """Url is assumed to be a string in utf-8 encoding."""
    import urllib2
    u = urllib2.urlopen(goodbye_string(url), timeout = 60)
    html = u.read()
    html = welcome_string(html)
    return html

url = "http://jltfpw.jimdo.com/alle-informationen-zur-gesetzesänderung-der-petition-sternenkinder/faq/"
print "Note how Python stores the bytes for the ae symbol in the url:"
print url
print

url = u"http://jltfpw.jimdo.com/alle-informationen-zur-gesetzesänderung-der-petition-sternenkinder/faq/"
try:
    get_html(url) # will fail: Python will break the 'ä' and try to open a bad url.
except:
    print "urllib2.open requires a string - a sequence of bytes!"
    print

url = u"http://jltfpw.jimdo.com/alle-informationen-zur-gesetzesänderung-der-petition-sternenkinder/faq/".encode('utf-8')
html = get_html(url)
print printable_string(html)[:100],"..."

Unicode Hell: An Introduction

I'm currently working on a project involving texts in German. I tested my code on small inputs, was satisfied with the results, then left it to work on a large dataset for the night. I woke up in the morning and logged in to find this:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xb7' in position 13: ordinal not in range(128)

The traceback gave out the line numbers, so I wrapped it with try-except and ignored the records with these errors. Ran my code again, and got this error in a different line.

I tried horsing around:

"string".encode('utf-8')
"string".decode('utf-8')
"string".encode('ascii', 'ignore')
... and all sorts of other games. I even wrote a function replacing each umlauted character (there's a finite number of these in German) with some sort of English equivalent. That implied data loss (as "ä" is not exactly the same as "ae"), but I didn't care. Until that un-umlauting function starting throwing UnicodeEncodeErrors as well!

So I did some digging and here are my results. I've also listed a few recommended readings, and what I'll try to do in this post is to give the tl;dr versions of their words and also supply some code examples. On the way I'll point some surprising facts (at least to me) I've found about Python 2.xx.

I used to think that strings are ascii-strings, and unicode objects are some utf-8 versions of strings, or something like that. That's nonsense, and that has to go out first.

This is how your application should be like (according to Ned Batchelder's "Pragmatic Unicode" which I'll go on quoting again and again in this post:

Read Ned's slides about unicode and save yourself hours of suffering!!!

First steps towards understanding unicod

Python string objects really are sequences of bytes. When you check the length of "string", you can 6 (naturally). But check this out:

>>> "ö"
'\xc3\xb6'
>>> len("ö")
2

Python has to represent "ö" using 2 hex bytes. That's very nice of him. Unfortunatqely, that can cause all sorts of errors, even causing errors in a line such as 'print "ö"' on the Windows cmd!

So let's do what I used to think is the solution, and turn this weird umlauted o to unicode:

>>> unicode("ö")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

Look, ma! Decode!!!

What's ascii got to do with that? Isn't unicode utf-8? Appearently, not!

Here's the right solution. Note the length of the generated unicode object:

>>> unicode("ö",'utf-8')
u'\xf6'
>>> len(unicode("ö",'utf-8'))
1

Hey, utf-8 can be seen again! Guess that solved it all, right?

Not exactly.

Unicode is a concept, meaning there's a unified numeric code for each symbol. "ö" is represented by the (hex) number f6 (246). Note that's larger than 127 - with [0,127] being the numeric values of ascii symbols. Each symbol has a numeric value. Nice!

Files, on the other hands, are sequences of bytes. To store unicode objects (text in unicode; "strings" as we used to call them) we need some way to represent these as byte sequences. How do we do that?

There are several methods, with utf-8 being the king. Python uses utf-8 by default when handling unicode objects and making them play with strings (u'a' + 'b', u'c'.join(['def']) , etc). Let's turn our unicode object back to string (a sequence of bytes) and see that we get the same values as we originally got for it as a string:

>>> str(unicode("ö",'utf-8'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 0: ordinal not in range(128)
>>> unicode("ö",'utf-8').encode('utf-8')
'\xc3\xb6'

Look, ma! Encode!

To make a long story short: when dealing with strings which might not only contain ascii characters, you'd better use unicode objects instead of strings. And when you deal with data from the internet - that becomes not a suggestion but a requirement.

So what's demanded of you? Again:

"Unicode Sandwich - Bytes on the outside, Unicode on the inside."

- Ned Batchelder

What's the outside? Files you read, html you read via urllib2.urlopen(...).read(), user input, etc. And just as well - anything you're intending to write to a file or pickle, and the URLs you're trying to open.
What's the inside? Anything you store. Dictionary keys (and values!), assignments to your objects' properties, etc.

And how do you make the switch between outside and inside and outside again?

To quote again:
You can't decode a unicode, and you can't encode a str. Try doing it the other way around.
Or in my words (which are written on a sticky-note placed right above my screen until I'll know it by heart):

Decode - str to unicode
Encode - unicode to str

You de-string a string object into a unicode object, and en-string a unicode object into a sequence of bytes (string). Or whatever other mnemonic that works for you.

Solutions

An incoming string s has to be treated as soon as possible, like this:
s.decode('utf-8','ignore')

An outgoing unicode object u has to be treated as late as possible, like this:
u.encode('utf-8','ignore')

Unless you're printing it - the Windows cmd is working with the cp-666 (or something) encoding, and when Python will try to encode (to str!) your text for the printing it'd fail with a UnicodeEncodeError - so you should do something like this:
u.encode('ascii','ignore')

Just as well, files you are writing should be told to expect utf-8 byte streams, and so:

import codecs #the Python STL codecs library
with codecs.open("output_filename", 'w', 'utf-8', 'ignore') as f:
    f.write(unicode_object_decoded_in_utf8)

Now I know that sounds like a lot of responsibility, but it really isn't. When I converted my code unicode-safe, I thought I'll have tons of modifications to do. I was wrong. I didn't really read files from as many places as I've thought (which is where I placed my str → unicode conversions), and I didn't write them in so many places (which is where I placed my unicode → str conversions, or used the codecs shortcut for opening a file for writing out utf-8 . And after I did this, my code stopped crashing due to UnicodeWhateverErrors.

Here's the code I used, based on Kumar McMillan's "to_unicode_or_bust" function in "Unicode in Python, Completely Demystified":

"""Unicode sandwiching methods."""
def welcome_string(obj, encoding='utf-8'):
    """For objects entering the system. Returns utf-8 unicode objects."""
    if isinstance(obj, basestring):
        if not isinstance(obj, unicode):
            obj = unicode(obj, encoding)
    return obj
    
def goodbye_string(obj, encoding = 'utf-8'):
    """For objects leaving the system. Returns an str instance."""
    if isinstance(obj, basestring):
        if isinstance(obj, unicode):
            obj.encode(encoding, 'ignore')
    return obj
    
def printable_string(obj):
    """For unicode objects which should be printed to the screen."""
    if isinstance(obj,unicode):
        obj = obj.encode('ascii','ignore')
    return obj

Here is a usage example based on importing the functions above:

# -*- coding: utf-8 -*-
from unicode_sandwiching_methods import welcome_string, goodbye_string, printable_string
def get_html(url):
    """Url is assumed to be a string in utf-8 encoding."""
    import urllib2
    u = urllib2.urlopen(goodbye_string(url), timeout = 60)
    html = u.read()
    html = welcome_string(html)
    return html

url = "http://jltfpw.jimdo.com/alle-informationen-zur-gesetzesänderung-der-petition-sternenkinder/faq/"
print "Note how Python stores the bytes for the ae symbol in the url:"
print url
print

url = u"http://jltfpw.jimdo.com/alle-informationen-zur-gesetzesänderung-der-petition-sternenkinder/faq/"
try:
    get_html(url) # will fail: Python will break the 'ä' and try to open a bad url.
except:
    print "urllib2.open requires a string - a sequence of bytes!"
    print

url = u"http://jltfpw.jimdo.com/alle-informationen-zur-gesetzesänderung-der-petition-sternenkinder/faq/".encode('utf-8')
html = get_html(url)
print printable_string(html)[:100],"..."

Note that if you're going to construct some toy code examples including unicode texts, nclude the "magic string" stating your source code file's encoding as the top of your script (after the #! if you have one):
# -*- coding: utf-8 -*-

Further reading:

Ned Batchelder: Pragmatic Unicode (2012)
http://nedbatchelder.com/text/unipain.html

(The "Unicode Sandwich" slide:
http://nedbatchelder.com/text/unipain/unipain.html#35 )

Kumar McMillan, Unicode in Python, Completely Demystified (2008)
http://farmdev.com/talks/unicode/

Joel Spolsky, The Absolute Minimum Every Software Developer Absolutely, Positively Must Know
About Unicode and Character Sets (No Excuses!) (2003!)
http://www.joelonsoftware.com/articles/Unicode.html

encool - quickly generate crazy unicode strings for your tests:
http://fsymbols.com/generators/encool/
Ian Albert - a dude who printed out all the unicode symbols of his time, and, well, that's about that.

Notes:

German texts usually contains umläuts - which can be stylized back to English using the following transliteration dictionary (kinda like Russian and volapuk):

german_letters = {"Ä" : "AE",
"ä" : "ae",
"æ" : "ae",
"Ö" : "O",
"ö" : "o",
"Ü" : "U",
"ü" : "u",
"ß" : "ss"}

Note 2: People get confused due to the C char / string stuff that is used to teach C programming - uppercasing/lowercasing etc... More on that later.