6.18 Handling strings in WASM without burning yourself - naver/lispe GitHub Wiki

Introduction

Version française

Let's be honest, the arrival of ChatGPT has changed a bit the environment in which we were working quietly. However, we have to come back to reality, and a little reminder will do us all good. (By the way, I'm quite a fan of ChatGPT)

Computer science as a profession still exists...

For those who don't know, WebAssembly is this new W3C standard that consists in turning our favorite browsers into virtual machines. As it is said in the Bible: "nihil novi sub sole", "there's nothing new under the sole" or something similar.

Frankly, "docker" wasn't enough for you as an infinite source of bugs?

You had to put a VM in the browser.

Well... The idea in itself in -7A.G. (2015), was good, we'll extend JS capabilities with code in C++, C# or Rust(re), that we'll compile with LLVM to generate libs that we'll be able to execute in the browser.

Honestly, the first part, the compilation, is super simple. You install Emscripten and paf!!! You just have to replace gcc by em++ and go ahead.

Really it's not more complicated than that... Just take a look at the following Makefile to see for yourself.

Options

Let's have a look at the C++ compilation options that you have to master a little bit.

I have compiled a language of my own, called lispe which is written in C++. I have written a post about it.

 -o lispe.html -O3 -sEXPORT_ALL=1 -sWASM=1 -fexceptions -sINITIAL_MEMORY=47972352 -sSTACK_SIZE=20971520
  • -o lispe.html: in this case, it also generates a test HMTL file plus a loading JS file.
  • -O3: The level of optimization, speed and size of the bib at compile time
  • -sWASM=1: You have to tell it that the target of compilation is WebAssembly
  • -fexceptions: This is for handling C++ exceptions. It also allows you to export malloc to JS.
  • The rest is memory initialization to handle the library.

As you can see, to compile C++, it doesn't burn that many neurons...

Just a word about -o lispe.hmtl, if you replace it with -o lispe.wasm, it only compiles the WASM library.

Don't believe in Santa Claus

It's a principle of life.

Especially since it stings a bit.

Because from compiling to executing is rarely a clear path in the sunshine on a mild spring afternoon. Usually, that's when the nettles and brambles start to invade the muddy path, with steel spikes in the potholes, and a rain to cut through wrought iron.

You sigh with relief at having compiled your biniou (Breton bagpipe) and then you discover the first time you blow into it that there are holes everywhere.

For example, WebAssembly doesn't know what a string is.

Unchained

We say to ourselves, okay, let's take a better look at what a string is in JS. This is the normal, reassuring, mature approach.

JS handles strings encoded in... UTF-16. Was... Warum so viel Hass?

Yes, not UTF-8 or clean Unicode in UTF-32, no... UTF-16.

Just for fun I'll put a little routine I wrote to transform UTF-16 into UTF-32...

I give it to you for free:

bool c_utf16_to_unicode(char32_t& r, char32_t code, bool second) {
    //We realized that it was a code on 32 bits, we add the second part
    if (second) {
        r |= code & 0x3FF;
        return false;
    }
    
    //if the first byte is 0xD8000000, it is an encoding on four bytes
    if ((code & 0xFF00) == 0xD800) {
        //You like it, isn't it beautiful?
        r = ((((code & 0x03C0) >> 6) + 1) << 16) | ((code & 0x3F) << 10);
        return true;
    }
    
    //if r is the same as in UTF-32
    r = code;
    return false;
}

In fact, I've already forgotten how I ever wrote this code... Sport is a bitch... Don't start...

Table of numbers

We search, we wonder, we despair to understand one day what is on StackOverflow and sometimes we discover bits of explanation. (Anyway, StackOverflow is a divine punishment for those who still believe that computer science is learned from the demons of the 9th circle).

For example, to pass a string to WebAssembly, you have to pass it as an integer array.

But beware (see the remark about Santa Claus), the array of numbers must be declared in the common space within the WASM library.

Ok... What's the result?

I also give it to you for free (these functions are present in lispe_functions.js):

function provideStringAsInt32(code) {
    //We give ourselves some extra space
    nb = code.length + 1;
    nb = Math.max(20, nb);

    //first we create our integer array
    arr = new Int32Array(nb);
    //in which we arrange our string, character by character
    for (i = 0; i < code.length; i++) {
        arr[i] = code.charCodeAt(i);
    }
    arr[code.length] = 0;
    // Then we allocate an array of nb*4 bytes
    //An Int32 is stored on 4 bytes
    a_buffer = Module._malloc(nb * 4);
    //We store the values in our array
    //Note the division by 4 of a_buffer (>> 2) in order to get the correct index in the HEAP32 array.
    // Once again an Int32 is on 4 bytes.
    Module.HEAP32.set(arr, a_buffer >> 2)
    return a_buffer;
}

And some people say that JS is an easy language.

But, where it's even more fun is to do the opposite operation:

//sz is the number of elements in the array: array
function arrayToString(array, sz) {   
    str = "";        
    sz *= 4;
    //Each element is stored on 4 bytes
    //Same thing, we divide our array by 4 to have its exact address
    //And we walk around 4 bytes by 4 bytes, that we transform each one in character
    for (let pointer=0; pointer < sz; pointer+=4) {
        str += String.fromCharCode(Module.HEAP32[pointer + array>>2]);
    }
    //And we don't forget to free it
    Module._free(array);
    return str;
};

And on C++ side

There the landscape becomes clearer, everything becomes simple, everything is calm and quiet... Finally, a language where the most difficult concepts of computer science find their simplest, clearest, most limpid illustration. (There is a serious lack of C++ programmers, young people should be encouraged)

So here is what the code looks like (see mainwasm.cxx)

//First of all EMSCRIPTEN_KEEPALIVE indicates that this function is exported by lispe.wasm
//And is therefore accessible by JS
//Our function returns a string as an array of int32_t elements
// the "size" array will contain the size of the final result, another array of integers encoding a string
EMSCRIPTEN_KEEPALIVE int32_t* eval_lispe(int32_t* str_as_int, int32_t sz, int32_t* size) {
    string cde;
    //s_utf16_to_utf8 is a function (see https://github.com/naver/lispe/blob/master/src/tools.cxx)
    //that converts our integer array into a string encoded in UTF-8
    s_utf16_to_utf8(cde, str_as_int, sz);

    //We execute our code
    Element* executed_code= global_lispe()->execute(cde, ".");
    //We get the response as a string in UTF-32
    u32string response = executed_code->asUString(global_lispe());
    //We clean up our result pointer
    executed_code->release();
    
    //response is encoded in UTF-32 (it's like that in LispE)
    //We need to convert it to UTF-16
    wstring result;
    //There are also such methods (see tools.cxx)
    s_unicode_to_utf16(result, response);
    sz = result.size();
    //we keep the size of our string in size
    size[0] = sz;
    
    //Then we create an array whose size is the size of our string
    int32_t* value_as_int = new int32_t[sz];
    //We store our characters as 32 bits integers
    for (long i = 0; i < sz; i++) {
        value_as_int[i] = result[i];
    }
    
    //And we return the array in question
    return value_as_int;
}

I said it stings...

Roughly speaking, the complicated part of this code is the conversion of the array into a string, then the conversion of the answer into a UTF-16 array, whose size is returned in the size variable.

Now let's see what happens on the other side in JS in the file lispe_functions.js:

function callEval(code) {
    //First we get the pointer to our function in lispe.wasm
    entryFunction = Module['_eval_lispe'];

    //code is a string that we convert into an array of numbers
    string_to_array = provideStringAsInt32(code);
    
    //This is an array of two elements which will be used to get the size of the return array
    the_size = provideIntegers(2);

    //We execute our function, with the arguments expected by C++
    //C++: int32_t* eval_lispe(int32_t* str_as_int, int32_t sz, int32_t* size)
    var result = entryFunction(string_to_array, code.length, the_size);
    //Free our array containing the initial string
    Module._free(string_to_array);
    //decode_size is a small routine in lispe_functions.js that retrieves the size placed there by the C++ function
    sz = decode_size(the_size);
    //arrayToString will transform our array of numbers into a string
    //Note that result is also freed in arrayToString...
    return arrayToString(result, sz);
}

I admit it's a bit heavy to digest. Here we call our C++ method eval_lispe which takes arrays of numbers as input and two arguments, the first is the size of the array and the second is a small array to store the size of the array out.

The C++ function in turn returns an array of numbers, whose size is in size, this way we can retrieve the string calculated by LispE.

Conclusion

String manipulation in lispe.wasm is not trivial. It requires to understand a little bit what you are doing. But with the examples provided here, you should be able to handle these exchanges without difficulty.

The value of WebAssembly has been proven. I suggest you try the example I've put in example for fun.