Improve support for UTF 8 - datacratic/rtbkit GitHub Wiki

Support for utf8 is lacking in RTBkit and needs to be improved. This document proposes a solution to this and hopes to open up the discussion.

Current Status

Currently, only plain ascii strings are stored in the std::string class while utf8 strings are stored in the Utf8String class. So far, the strategy was to maximize the usage of std::string and use the Utf8String string only when needed i.e. as little as possible. The rationale behind this is that there is some overhead associated with the usage of the Utf8String class.

The overhead is mostly in:

constructors & assignment (where the input string is validated)
iterator (where the bidirectional iterator moves to the next or the previous utf8 character)
encoding & decoding

Of course, indexing by character (or byte) is possibly risky and broken and thus, not available in the class's interface.

As the user base of RTBkit expands, we are seeing more cases where utf8 would be needed and adding them everywhere is degrading performance to some extend. Ideally, we would like to use std::string everywhere and assume it could contain utf8. But, processing utf8 is currently much more costly (performance wise) and in many cases the strings are internal and will never contain utf8.

Deducing the encoding of the std::string from the context is generally a bad design as the type semantics is completely lost.

Requirements

It should avoid mishandling utf8 string by error i.e. not deduce the encoding by the context
It should handle utf8 and ascii efficiently

Proposal

The proposed solution is simple but adds a level of abstraction. Simply keep the Utf8String class that will detect and keep the encoding as part of its structure so that operations can be optimally selected based on the actual internal encoding.

It should also be used for every field that can possibly contain utf8 e.g. bid request fields and ad server event fields.

Use of std::string will be exclusively ascii by convention.

Discussion

Why not use utf8 everywhere and thus have only 1 concept for std::string. There is little measurable performance costs.

Maybe but there are still cases where assuming the utf8 encoding will degrade performances e.g. regular expression filters.

Even if there is a performance cost, we're ready to take it to make the code simpler and easier to use.

Lots of work was (was, is and will be) put into RTBkit to make it fast and efficient. They sum up to many little things that do not look very costly individually but end up making a big difference. Since this has a direct impact on the cost of operation, great care should be taken not to take shortcuts.