Freitag, 1. Januar 2010

Python: cStringIO vs. StringIO

If a function expects a file handle as an input parameter, but the data you want to process is stored in a string variable, you could store its content in a temporary file on disk, then open that file and use its file handle for the above mentioned function.

A nicer way to do this is the python function StringIO in the StringIO library. It takes the string as an argument and returns a file handle, from which the function can read(). The conversion happens in memory, i.e. no tedious creation of temporary files.

Depending on how often you need this functionality and how large the data is, the StringIO.StringIO - which is programmed purely in Python - may be too slow.

The library cStringIO offers a function with the same name with the same functionality carried out by a faster C-implementation. This function should be a drop-in replacement for the Python version.

However, since more and more strings are stored in unicode this function has an unexpected side effect. This side effect is detailed in the library description, if you look close enough.

cStringIO.StringIO does not return the original encoded text, but the "representation of the Unicode string", which may differ from machine to machine:

>>>import StringIO
>>> a = StringIO.StringIO(u"test")
>>> a.read()
u'test'

>>>import cStringIO
>>> a = cStringIO.StringIO(u"test")
>>> a.read()
't\x00\x00\x00e\x00\x00\x00s\x00\x00\x00t\x00\x00\x00'


A patch to fix this was proposed but rejected due to backwards compatibility.

So be aware...