Python 3.3 and Windows API: Understanding Code Point vs. UTF-16 Code Units

We’re going to dive into the world of Python 3.3 and Windows API: Understanding Code Point vs. UTF-16 Code Units.

To start, let’s clear up some confusion. When you receive binary data (bytes) from a third party source, whether it be from a file or over a network, the best practice is to check that the data specifies an encoding. If it doesn’t, then it’s on you to ask.

All I/O happens in bytes, not text, and bytes are just ones and zeros to a computer until you tell it otherwise by informing it of an encoding. Here’s an example of where things can go wrong: You’re subscribed to an API that sends you a recipe of the day, which you receive in bytes and have always decoded using .decode(“utf-8”) with no problem. On this particular day, part of the recipe looks like this:

It looks as if the recipe calls for some flour, but we don’t know how much: Uh oh. There’s that ***** UnicodeDecodeError that can bite you when you make assumptions about encoding. You check with the API host. Lo and behold, the data is actually sent over encoded in Latin-1:

There we go.

Now code points vs. UTF-16 code units. In Python 3.3, there are a few built-in functions that relate to numbering systems and character encoding. These can be logically grouped together based on their purpose:

– ord() returns an integer representing the ASCII value of the first byte in a string (or Unicode code point if it’s not ASCII)
– chr() returns a one-character string whose ASCII or Unicode code is the given integer
– hex() and oct() return strings containing the hexadecimal or octal representation of an integer, respectively. They can be used to convert between integers and their textual representations in different numbering systems.
– bin() returns a string containing the binary representation of an integer with leading zeros (if necessary)

Now code points vs. UTF-16 code units. In Python 3.3, there are two ways to represent Unicode characters: as code points or as UTF-16 code units. Code points are the abstract numbers that correspond to each character in a given encoding (such as ASCII, Latin-1, or UTF-8). UTF-16 code units, on the other hand, are the actual 2-byte values used by Python’s internal representation of Unicode strings.

Here’s an example: The letter ‘é’ has a code point value of 0xE9 in UTF-8 and 0x00E9 (two bytes) in UTF-16. When you write the string “é” in Python, it is automatically converted to its internal representation as two bytes with the values 0xC3 and 0xA9 (the hexadecimal equivalents of 0xE9).

So why do we care about code points vs. UTF-16 code units? Well, for one thing, when you’re working with Unicode strings in Python, it can be helpful to understand the difference between these two concepts. For example:

– If you want to iterate over a string character by character (as opposed to byte by byte), you need to use UTF-16 code units instead of code points. This is because each Unicode character may consist of one or more bytes, depending on the encoding being used.
– When working with external data sources that provide binary data in a specific encoding (such as Latin-1 or UTF-8), it’s important to know whether you need to convert those bytes into code points before processing them further. This is because some encodings may not include all possible Unicode characters, and you don’t want to accidentally assume that a missing character is simply an empty byte sequence (which could cause errors when decoding the data).
– When working with external APIs or libraries that provide Unicode strings as input parameters, it can be helpful to understand how those strings are represented internally. For example: If you’re using a library that expects UTF-16 code units as input, but your data is in UTF-8 format, you may need to convert the data before passing it along.

3 and Windows API: Understanding Code Point vs. UTF-16 Code Units. It’s not exactly rocket science (or maybe it is), but hopefully this tutorial has helped clarify some of the more confusing aspects of working with Unicode strings in Python.

SICORPS