6.12 Working with UTF8, UTF16 and UTF32 characters in CPlusPlus, and by the way with emojis as well... - naver/lispe GitHub Wiki

A code to identify them all

version française

When you analyze a text that contains all sorts of strange and bizarre characters, you can expect that a character corresponds to a single Unicode code... At least, this is the hope that lay users place in this encoding.

It's a beautiful dream that quickly comes crashing down on the painful wall of reality...

Especially, if you work in C++... Because there, the dream quickly turns into a nightmare...

std::wstring is not portable

If we read the documentation, we quickly discover that there is a std::wstring which seems to be very suitable for handling Unicode characters.

A wstring is implemented as a sequence of wchar_t.

Already at this stage of your discovery of Unicode, you discover a disturbing thing. If on most machines, a wchar_t is encoded on 32 bits. On Windows, it is 16 bits...

  • Windows uses UTF-16
  • The rest of the world uses UTF-32

UTF-16 on Windows

For the majority of characters on Windows, you will have the equivalence:

1 wchar_t is worth 1 Unicode character

For other characters, you will have to parse two wchar_t to get the right Unicode code...

And most emojis fall into this category. It usually takes two wchar_t to encode them...

So

  • 👨 in UTF-32 is: 128104
  • 👨 in UTF-16 is: 55357 56424

The value 128104 is re-encoded as two wchar_t: 55357 56424.

This has an impact on your program, you will have to take into account the OS to handle your strings.

UTF-8

So why not use UTF-8... In this case the strings are a simple sequence of bytes on 8 bits...

It is UNIVERSAL...

That's right... That's right... Except that... There is always an except that.

The handling of such a string is not always optimal. Let's take the following example:

l'été vient à Cannes, sur la Côte d'Azur (Summer is coming to Cannes, on the French Riviera)

How big is this string?

Let's imagine that it is stored in the following form:

std::string s = "l'été vient à Cannes, sur la Côte d'Azur";

If we call: s.size(), you get: 46...

But you will say to me, there are only 42 characters in this string? I counted three times...

The underlying bytes

If you look at what s actually contains, you get the following:

[108,39,195,169,116,195,169,32,97,114,114,105,118,101,32,195,160,32,67,97,110,110,101,115, 44,32,115,117,114,32,108,97,32,67,195,180,116,101,32,100,39,65,122,117,114,46]

More exactly, for each accented character, you get:

  • é : 195,169
  • à : 195,160
  • ô : 195,180

And for those who think that 195 is an accent, it has nothing to do with it...

  • All the characters whose code is lower than 128, are ASCII characters encoded on one byte.
  • For the higher Unicode characters, the Unicode code is spread over several bytes, up to a maximum of 4. The first byte has a binary encoding that makes it possible to know the number of bytes needed to encode this character.

Here is a C++ function that reconstructs the underlying Unicode code:

unsigned char c_utf8_to_unicode(unsigned char* utf, char32_t& code) {
    code = utf[0];

    //We examine the last four bits (XXXX....)
    unsigned char check = utf[0] & 0xF0;
    
    switch (check) {
        case 0xC0: //2 bytes
            if ((utf[1] & 0x80)== 0x80) {
                code = (utf[0] & 0x1F) << 6;
                code |= (utf[1] & 0x3F);
                return 1;
            }
            break;
        case 0xE0: //3 bytes
            if ((utf[1] & 0x80)== 0x80 && (utf[2] & 0x80)== 0x80) {
                code = (utf[0] & 0xF) << 12;
                code |= (utf[1] & 0x3F) << 6;
                code |= (utf[2] & 0x3F);
                return 2;
            }
            break;
        case 0xF0: //4 bytes
            if ((utf[1] & 0x80) == 0x80 && (utf[2] & 0x80)== 0x80 && (utf[3] & 0x80)== 0x80) {
                code = (utf[0] & 0x7) << 18;
                code |= (utf[1] & 0x3F) << 12;
                code |= (utf[2] & 0x3F) << 6;
                code |= (utf[3] & 0x3F);
                return 3;
            }
            break;
    }

    //1 byte
    return 0;
}

This function also returns the position of the last character used in the encoding.

The main point of this method is that you can only detect non-ASCII UTF-8 characters by going through the whole string from the beginning... Indeed, you cannot know in advance where the multi-byte characters will appear...

In particular, the only way to know the size of the string in characters is to go through it completely.

What about the string representation in LispE?

The choice in LispE was to use strings of type: std::u32string, which are available since C++11.

These strings have the advantage of being composed of characters on 32 bits whatever the platform.

In fact on Unix (or Mac OS) platforms, std::wstring is generally equivalent to std::u32string.

Converting a UTF-16 string to a UTF-32 string

For Windows (but also to exchange strings with the Mac OS GUI), we have built some methods (see str_conv.h ) to convert UTF-16 strings to std::u32string, which then makes it possible to manipulate strings in the same way everywhere.

//Only valid if your string is a UTF-16 string
wstring s = L"l'été vient à Cannes, sur la Côte d'Azur";
u32string u;

s_utf16_to_unicode(u, s);

I must admit that writing the function: c_utf16_to_unicode which is the basis of the conversion between the two encodings still gives me some nightmares. In particular this line:

r = ((((code & 0x03C0) >> 6) + 1) << 16) | ((code & 0x3F) << 10);

This line makes it possible to extract from the internal representation in bits of a UTF-16 character, the left part in bits of the final Unicode character... We won't dwell on the insane complexity of this encoding...

Converting a UTF-8 string to a UTF-32 string

As strings usually come to us in UTF-8 encoding, we have built a simple method that performs this conversion from UTF-8 to UTF-32: s_utf8_to_unicode

It is called as follows:

string s = "l'été vient à Cannes, sur la Côte d'Azur";
u32string u;

s_utf8_to_unicode(u, s);

And the emojis

As if encoding wasn't complicated enough, here come the emojis to add their dose of difficulty.

Because an emoji is rarely a single Unicode code... In fact, it's a combination of codes.

First of all, the list of Unicode characters can be found here: unicode table.

Let's take a simple example:

  • 🖐 is represented by the code: 128400
  • 🖐🏽 is represented by the codes: 128400, 127997

The second character is a combination of the first and a color: 🏽

And some characters can be even richer:

  • 👩🏾‍🚀: 128105, 127998, 8205, 128640
  • 🧑🏿‍❤️‍💋‍🧑🏻: 129489, 127999, 8205, 10084, 65039, 8205, 128139, 8205, 129489, 127995

Very rich... It is possible that your browser cannot display all of them...

How do we do it then?

We got all the codes here, and we built the big table emoji_sequences in the file emojis_alone.h

Then we created a class: Emojis which provides the following set of methods:

void store()

The store method makes it possible to translate this table into three automata:

  • a UTF-32 automaton
  • a UTF-16 automaton
  • a UTF-8 automaton

The first element of each sequence is stored in a corresponding dictionary.

A path consists of a sequence of objects: Emoji_arc.

A sequence is valid if, when traversing the string, it leads to an arc whose end field is true.

bool scan(std::u32string& u, long& i)

This method identifies a sequence of Unicode characters composing the emoji. It returns the final position of the last character composing this sequence.

bool get(std::u32string& u, std::u32string& res, long& i)

This method copies the complete sequence of the character into res and returns the position of the last character in the sequence.

bool store(std::u32string& u, std::u32string& res, long& i)

This method contains almost the same code as get, except that the characters are added to res.

Usage

To browse a string, you just have to do the following loop:

//e is an object of type Emojis
Emojis e;

//Strings of type UTF-32 are preceded by a 'U'.
u32string strvalue = U"éèà123👨‍⚕️👩🏾‍🚀🐕‍🦺";
//a working variable
u32string localvalue;

//our characters 
vector<u32string> result;

for (long i = 0; i < strvalue.size(); i++) {
    //if the character at the current position is an emoji
    //then localvalue contains it.
    //i then points to the last character of the sequence
    if (!e.get(strvalue, localvalue, i))
        localvalue = strvalue[i];

    result.push_back(localvalue);
}

At the end of this loop, our string will have been split into characters, some of which will be a long sequence of char32_t.

But if I want UTF-8

In fact, the code will be slightly different, we will use the method: get_one_char

This method retrieves a complete UTF-8 character, depending on whether it is one, two, three or four bytes long.

//e is an object of type Emojis
Emojis e;

//The UTF-8 strings are of type: std::string
string strvalue = "éèà123👨‍⚕️👩🏾‍🚀🐕‍🦺";

//a working variable
string localvalue;

//our characters 
vector<string> result;

for (long i = 0; i < strvalue.size(); i++) {
    //if the character at the current position is an emoji
    //then localvalue contains it.
    //i then points to the last character of the sequence
    if (!e.get(strvalue, localvalue, i))
        get_one_char(strvalue, localvalue, i); // this is the main difference

    result.push_back(localvalue);
}

Now you have everything you need to parse texts and get the corresponding characters.

Experimenting

We have created an independent example in LispE makefile that makes it possible for you to test this class directly.

Just do:

make testemoji

This example is implemented in the file: testemoji.cxx. It uses a version of the Emojis class whose implementation is in the following file: emojis_alone.h. This file also calls: std_conv.h.

This example parses a string encoded in UTF-8 and UTF-32, with some examples of conversion.

Note: The default version of C++ that we use here is a minima: C++11.

⚠️ **GitHub.com Fallback** ⚠️