Unicode - mkleehammer/pyodbc GitHub Wiki

TL;DR

By default, pyodbc uses UTF-16 assuming native byte-order (i.e., UTF-16LE on little-endian machines like x86 and arm) and SQL_C_WCHAR for reading and writing all Unicode as recommended in the ODBC specification. Unfortunately many drivers behave differently so connections may need to be configured. I recommend creating a global connection factory where you can consolidate your connection string and configuration:

def connect():
    cnxn = pyodbc.connect(_connection_string)
    cnxn.setencoding('utf-8')  # (Python 3.x syntax)
    return cnxn

Configuring Specific Databases

Microsoft SQL Server

SQL Server's recent drivers match the specification, so no configuration is necessary. Using the pyodbc defaults is recommended.

However, if you want Python 2.7 str results instead of unicode results you may be able to improve performance by setting the encoding to match the database's collation. In particular, it is common to use a latin1 character set and Python has a built-in latin1 codec.

Check your SQL Server collation using:

select serverproperty('collation')

If it is something like "SQL_Latin1_General_CP1_CI_AS" and you want str results, you may try:

cnxn.setdecoding(pyodbc.SQL_CHAR, encoding='latin1', to=str)
cnxn.setencoding(str, encoding='latin1')

It is not recommended, but you can also set the encoding to "raw" which will pass bytes directly between Python str objects and database SQL_C_CHAR buffers. This should only be used if you are certain you know what you are doing as it may not be clear when it doesn't work. It will only work if the database bytes are in the same format is Python's internal format. This is compatible with pyodbc 3.x.

cnxn.setdecoding(pyodbc.SQL_CHAR, encoding='raw')
cnxn.setencoding(str, encoding='raw')

MySQL

Note: For proper support of "supplementary" Unicode characters (those outside of the BMP, e.g., emoji) with MySQL (i.e., MySQL's "utf8mb4" character set) you must use the "ANSI" — not the "Unicode" — version of the MySQL Connector/ODBC driver.

MySQL Connector/ODBC tends to use a single encoding and does not differentiate between "SQL_CHAR" and "SQL_WCHAR". Therefore you need to configure it to encode Unicode data as UTF-8 and to decode both C buffer types using UTF-8.

# Python 2.7
cnxn.setdecoding(pyodbc.SQL_CHAR, encoding='utf-8')
cnxn.setdecoding(pyodbc.SQL_WCHAR, encoding='utf-8')
cnxn.setencoding(str, encoding='utf-8')
cnxn.setencoding(unicode, encoding='utf-8')

# Python 3.x
cnxn.setdecoding(pyodbc.SQL_CHAR, encoding='utf-8')
cnxn.setdecoding(pyodbc.SQL_WCHAR, encoding='utf-8')
cnxn.setencoding(encoding='utf-8')

You will also need to add charset=utf8mb4 to the connection string if you will be working with supplementary Unicode characters (e.g., emoji):

# MySQL
cstring = 'DSN=mydsn;charset=utf8mb4'  # or perhaps just utf8 for Unicode characters in the BMP
cnxn = pyodbc.connect(cstring)

PostgreSQL

Note: For proper support of Unicode characters use the "Unicode" — not the "ANSI" — version of the driver.

When working with the current version of the "Unicode" ODBC driver just use the default encoding/decoding settings. See issue #1004 for details.

Teradata

First, if you are using Teradata on macOS, you are going to have to build pyodbc yourself using iODBC. Hopefully they will compile their driver against unixODBC sometime.

If you are using UTF-8, most of the lines will be similar to other databases, but unfortunately the metadata on macOS (this may not be necessary on Linux) will return metadata (column names) in UTF-32LE.

# Python 2.7
cnxn.setdecoding(pyodbc.SQL_CHAR, encoding='utf-8')
cnxn.setdecoding(pyodbc.SQL_WCHAR, encoding='utf-8')
cnxn.setdecoding(pyodbc.SQL_WMETADATA, encoding='utf-32le')
cnxn.setencoding(str, encoding='utf-8')
cnxn.setencoding(unicode, encoding='utf-8')

# Python 3.x
cnxn.setdecoding(pyodbc.SQL_CHAR, encoding='utf-8')
cnxn.setdecoding(pyodbc.SQL_WCHAR, encoding='utf-8')
cnxn.setdecoding(pyodbc.SQL_WMETADATA, encoding='utf-32le')
cnxn.setencoding(encoding='utf-8')

Exasol

When using Exasol you need to set the encoding for all the SQL_CHAR, SQL_WCHAR, SQL_WMETADATA and Python's strings as UTF-8:

# Python 2.7
cnxn.setdecoding(pyodbc.SQL_CHAR, encoding='utf-8')
cnxn.setdecoding(pyodbc.SQL_WCHAR, encoding='utf-8')
cnxn.setdecoding(pyodbc.SQL_WMETADATA, encoding='utf-8')
cnxn.setencoding(str, encoding='utf-8')
cnxn.setencoding(unicode, encoding='utf-8')

# Python 3.x
cnxn.setdecoding(pyodbc.SQL_CHAR, encoding='utf-8')
cnxn.setdecoding(pyodbc.SQL_WCHAR, encoding='utf-8')
cnxn.setdecoding(pyodbc.SQL_WMETADATA, encoding='utf-8')
cnxn.setencoding(encoding='utf-8')

Netezza

See Connecting to Netezza

Details

Encodings

The Unicode standard specifies numeric "codes" for each character, such as 65 for 'A' and 229 for 'å' (latin small letter a with ring above). However, it does not specify how these numbers should be represented in a computer's memory. The same characters can be represented in multiple ways which are called encodings. For example, here are the example characters above in some different encodings:

Character Encoding Bytes
A latin1 0x41
A utf-8 0x41
A utf-16-le 0x4100
A utf-16-be 0x0041
å latin1 0xe5
å utf-8 0xc3a5
å utf-16-le 0xe500
å utf-16-be 0x00e5

ASCII characters, such as "A", have values less than 128 and are easy to store in a single byte. Values greater than 127 sometimes are encoded with multiple bytes even if the value itself could fit in a single byte. Notice the UTF-8 encoding of "å" is 0xc3a5 even though its value fits in the single latin1 byte 0xe5.

IMPORTANT: The thing to note here is that when converting text to bytes, some encoding must be chosen. Even a string as simple as "A" has more than one binary format, as the table above makes clear.

ODBC Conversions

The ODBC specification defines two C data types for character data:

  • SQL_CHAR: A single-byte type like the C char data type, though the sign may differ.
  • SQL_WCHAR: A two-byte data type like a 2-byte wchar_t

Originally ODBC specified Unicode data to be encoded using UCS-2 and transferred in SQL_WCHAR buffers. Later this was changed to allow UTF-8 in SQL_CHAR buffers and UTF-16LE in SQL_WCHAR buffers.

By default pyodbc reads all text columns as SQL_C_WCHAR buffers and decodes them using UTF-16LE. When writing, it always encodes using UTF-16LE into SQL_C_WCHAR buffers.

Configuring

If the defaults above do not match your database driver, you can use the Connection.setencoding and Connection.setdecoding functions.

Python 3

cnxn.setencoding(encoding=None, ctype=None)

This sets the encoding used when writing an str object to the database and the C data type. (The data is always written to an array of bytes, but we must tell the database if it should treat the buffer as an array of SQL_CHARs or SQL_WCHARs.)

The encoding must be a valid Python encoding that converts text to bytes. Optimized code is used for "utf-8", "utf-16", "utf-16le", and "utf-16be".

The result is always an array of bytes but we must also tell the ODBC driver if it should treat the buffer as an array of SQL_CHAR or SQL_WCHAR elements. If not provided, pyodbc.SQL_WCHAR is used for the UTF-16 variants and pyodbc.SQL_CHAR is used for everything else.

cnxn.setdecoding(sqltype, encoding=None, ctype=None)

This sets the encoding used when reading a buffer from the database and converting it to an str object.

sqltype controls which SQL type being configured: pyodbc.SQL_CHAR or pyodbc.SQL_WCHAR. Use SQL_CHAR to configure the encoding when reading a SQL_CHAR buffer, etc.

There is also a special constant, pyodbc.SQL_WMETADATA, for configuring how column names are read. You would expect drivers to return them the same way they return data, but that's rarely the case.

When a buffer is being read, the driver will tell pyodbc whether it is a SQL_CHAR or SQL_WCHAR buffer. However, pyodbc can request the data be converted to either of these formats. Most of the time this parameter should not be supplied and the same type will be used, but this can be useful in the case where the driver reports SQL_WCHAR data but the data is actually in UTF-8.

Python 2

The Python 2 version of setencoding starts with a type parameter so that str and unicode can be configured independantly.

# cnxn.setencoding(type, encoding=None, ctype=None)

cnxn.setencoding(str, encoding='utf-8')
cnxn.setencoding(unicode, encoding='utf-8', ctype=pyodbc.SQL_CHAR)

Reference

How Python Does Unicode (blog post)