Python pickles

The pickle module implements a fundamental, but powerful algorithm for serializing and de-serializing a Python object structure.

According to the pickle documentation:

The pickle serialization format is guaranteed to be backwards compatible across Python releases.

Pickles created by a specific Python version can be unpickled by all later Python versions and can also be unpickled by all earlier Python versions that implement the pickle protocol specified during the pickle.dump operation. In particular, a pickle using protocol 2 should be able to be unpickled by Python 2.3 and all later versions.

In general, this portability of pickles across Python versions works like a charm. Unfortunately, there is one big exception when trying to unpickle a Python 2 pickle in Python 3:

if the Python 2 pickle contains any non-ASCII (8-bit) str instance, Python 3 will raise an exception like:

UnicodeDecodeError:
  'ascii' codec can't decode byte 0xdf in position 1:
  ordinal not in range(128)

Even worse, this is triggered by types like datetime.time where one would not even expect 8-bit strings to be involved. This is a long standing problem that is easy to trigger and hard to diagnose, but for some reason never got fixed.

The reason for the UnicodeDecodeError prohibiting the unpickling is that Python 3 insists that all pickled 8-bit strings must be ASCII.

I think this utterly wrong. Let me explain:

  1. In Python 2, 8-bit strings are used for two incompatible purposes:

    • Some 8-bit strings are used for text. In some instances, Python 2 enforces 8-bit strings, e.g., for names of classes and names of attributes.

      All such uses should be restricted to ASCII.

    • Other 8-bit strings are binary, e.g., file names, sha digests, or datetime pickles.

    A well designed Python 2 application will use unicode strings for textual data wherever possible, i.e., in all places where Python itself doesn't insist on 8-bit strings. In other words,

    any 8-bit strings explicitly created by such an application will contain binary data.

  2. In Python 2, pickle will use BINSTRING (and SHORT_BINSTRING) to dump 8-bit strings.

    pickle does not know if any particular 8-bit string contains text or binary data.

  3. In Python 3, pickle does not use BINSTRING and SHORT_BINSTRING to dump any data type.

    Instead if uses BINBYTES and SHORT_BINBYTES for bytes instances.

  4. Both Python 2 and Python 3 use BINUNICODE to dump unicode strings.

  5. For BINSTRING, Python 3 doesn't know the semantics of the pickled value. There isn't any single type that is the right choice for all cases.

    What Python 3 should do, according to the reasoning in 1):

    • convert ASCII values to (unicode) str
    • convert non-ASCII values to bytes.

    If the BINSTRING value was pickled by a well designed Python 2 application, that follows exactly the intended semantics.

    If the Python 2 application was not well designed, i.e., it used 8-bit strings for non-ASCII text, bytes is still the right choice for BINSTRING values. Hopefully, the application knows how to deal with the resulting bytes.

    In fact, I would argue that in the latter case the use of str for ASCII values is problematical!

    OTOH, ASCII values must be converted to str to avoid breaking names that Python 2 forced to be 8-bit strings.

Creative Commons Attribution-Share Alike 3.0 Unported License Except where otherwise noted, content on this site is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported License.