How to change linguistic resources - microsoft/BlingFire GitHub Wiki

Make the tools ready (needed to be done one time only)

Build the library and tools

git clone Bling-Fire-Git-Path
cd BlingFire
mkdir Release
cd Release
cmake -DCMAKE_BUILD_TYPE=Release ..
make

This will take a few minutes

Alternatively you can use a Visual Studio Code with CMake, CMake Tools and C/C++ plugins installed. Select Release mode for build and then your files are going to be in the build folder.

Make sure the tools are in the path

Now you need to install the tools into the location known in PATH or to set the PATH to see the BlingFire directory with the tools. For the later one run this command from the BlingFire directory:

. ./scripts/set_env

Let's make sure that the tools are actually in the PATH, type:

fa_nfa2dfa --help

All tools respond to --help, so you should see something like:

Usage: fa_nfa2dfa [OPTION] [< input.txt] [> output.txt]

This program converts non-deterministic finite-state machine into
deterministic one.

  --in=input-file  - reads input from the input-file,
    if omited stdin is used

  --out=output-file - writes output to the output-file,
    if omited stdout is used

  --out2=output-file - writes output to the output-file,
    if omited stdout is used

  --pos-nfa=input-file - reads reversed position NFA from input-file,
    needed for --fsm=pos-rs-nfa to store only ambiguous positions, if omited
    stores all positions

  --fsm=rs-nfa - makes convertion from Rabin-Scott NFA (is used by default)
  --fsm=pos-rs-nfa - makes convertion from Rabin-Scott position NFA,
    builds Moore Multi Dfa
  --fsm=mealy-nfa - makes convertion from Mealy NFA into a cascade of
    two Mealy Dfa (general case) or a single Mealy DFA (trivial case)

  --spec-any=N - treats input weight N as a special any symbol,
    if specified produces Dfa with the same symbol on arcs,
    which must be interpreted as any other

  --bi-machine - uses bi-machine for Mealy NFA determinization

  --no-output - does not do any output

  --verbose - prints out debug information, if supported

Edit linguistic sources and compile them into automata

Let's change the working directory into the root for linguistic sources:

cd ldbsrc

Note: we will add separate documentation on different format of the linguistic resources, for the moment we will modify the tokenization logic only like this:

touch wbd/wbd.lex.utf8

And now to recompile the wbd directory (word boundary disambiguation) or word-breaking or tokenization logic is defined in this directory. We need simply type:

make -f Makefile.gnu lang=wbd all

You should see something like this one the screen:

fa_build_conf \
  --in=wbd/ldb.conf.small \
  --out=wbd/tmp/ldb.mmap.small.txt
fa_fsm2fsm_pack --type=mmap \
  --in=wbd/tmp/ldb.mmap.small.txt \
  --out=wbd/tmp/ldb.conf.small.dump \
  --auto-test
fa_build_lex --dict-root=. --full-unicode --in=wbd/wbd.lex.utf8 \
  --tagset=wbd/wbd.tagset.txt --out-fsa=wbd/tmp/wbd.rules.fsa.txt \
  --out-fsa-iwmap=wbd/tmp/wbd.rules.fsa.iwmap.txt \
  --out-map=wbd/tmp/wbd.rules.map.txt
fa_fsm2fsm_pack --alg=triv --type=moore-dfa --remap-iws --use-iwia --in=wbd/tmp/wbd.rules.fsa.txt --iw-map=wbd/tmp/wbd.rules.fsa.iwmap.txt --out=wbd/tmp/wbd.fsa.small.dump
fa_fsm2fsm_pack --alg=triv --type=mmap --in=wbd/tmp/wbd.rules.map.txt --out=wbd/tmp/wbd.mmap.small.dump --auto-test
fa_merge_dumps --out=ldb/wbd.bin wbd/tmp/ldb.conf.small.dump wbd/tmp/wbd.fsa.small.dump wbd/tmp/wbd.mmap.small.dump

This means that make is doing it job and remaking all the dependent targets.

If you see "ERROR: XYZ" message on the screen, then find the one that appeared first and let try to understand which tool it came from, what was input to this tool and what were the command line parameters. Double check with --help that these parameters make sense. Let us know if you are stuck, we'll be happy to help.

How to verify the compiled file is working correctly

For the tokenizer you can use fa_lex tool. See fa_lex --help for more details

printf "Hi There! This is a simple test." | fa_lex --ldb=ldb/wbd.bin --tagset=wbd/wbd.tagset.txt

The output should be something like:

Hi/WORD There/WORD !/WORD This/WORD is/WORD a/WORD simple/WORD test/WORD ./WORD

For the single token related transformation you can use test_ldb tool. See test_ldb --help for more details
See tools.txt for details

What is the structure of linguistic sources

The Linguistic Data Base (LDB) files are simply containers of combined together address independent memory dumps of different structures such as: maps, multi maps, finite state automata, arrays.

To avoid usage mistakes such as the dictionary was collected in case sensitive way and someone looks it up in case insensitive and similar which are difficult to find. The runtime options are also compiled into one of those maps (configuration map) and are a part of the final LDB file. The compiled configuration map is defines which functions the LDB has resources for and what parameters should be used for each function at runtime.

ldbsrc                            -- main LDB root
    Name_1                        -- name of the project #1      
        ldb.conf.small            -- runtime configuration parameters for the project #1, required file
        options.small             -- LDB compilation options for the project #1, required file
        [other resources]
    Name_2                        -- name of the project #2       ...
    ldb                           -- a root for all the compiled binary files         
        name1.bin                 -- compiled binary for the project #1
        name2.bin                 -- compiled binary for the project #2           ... 
    Makefile.gnu                  -- make file for compilation