How Lute stores data - jzohrab/lute GitHub Wiki


This documentation is deprecated/obsolete. Lute v2 has been replaced by Lute v3, a full Python rewrite. Please see the Lute v3 manual which includes notes on installation. If you have Lute v2, you can easily migrate to v3. Thank you!


Lute stores Terms (in the words table) and sentences (in sentences) as zero-width-space-joined strings, where the zero-width-space character joins text "tokens".

E.g, with [ZWS] denoting the zero-width space character, the Term "hello there" is stored as hello[ZWS] [ZWS]there, and the sentence "she said hello there." is stored as [ZWS]she[ZWS] [ZWS]said[ZWS] [ZWS]hello[ZWS] [ZWS]there[ZWS].[ZWS]. Using the zero-width space to mark the word borders (token borders) vastly simplifies data processing.

If you want to query the Sqlite db directly, you'll need to bear this in mind, and use the Sqlite representation of ZWS, char(0x200B), while querying. For example, using my database as illustration:

sqlite> select wotext from words where wotokencount = 3 and wotext like '%cardinal%';
punto cardinal

There are actually some zero-width spaces in that string!

sqlite> select replace(wotext, char(0x200B), '[ZWS]') from words where wotokencount = 3 and wotext like '%cardinal%';
punto[ZWS] [ZWS]cardinal

If I query without taking that into account:

sqlite> select wotext from words where wotext = 'punto cardinal';
sqlite> 

So do either of these:

sqlite> select wotext from words where replace(wotext, char(0x200B), '') = 'punto cardinal';
punto cardinal
sqlite> select wotext from words where wotext = 'punto' || char(0x200B) || ' ' || char(0x200B) || 'cardinal';
punto cardinal
⚠️ **GitHub.com Fallback** ⚠️