Installation of TreeTagger with German utf 8 parameters - presemt-ntnu/transglobal GitHub Wiki

Installation of the German utf 8 parameters does not work out of the box. Follow the instructions below to get it working.

  1. Download German-utf8 param file (ftp://ftp.ims.uni-stuttgart.de/pub/corpora/german-par-linux-3.2-utf8.bin.gz) from treetagger website to your local treetagger dir

  2. Next to install-treetagger.sh create a new file install-german-utf8 and run it:

    #!/bin/sh

    if [ -r german-par-linux-3.2-utf8.bin.gz ]
    then
        gzip -cd german-par-linux-3.2-utf8.bin.gz > lib/german-utf8.par
        echo 'German parameter file (LINX, UTF8) installed.'
    fi
  1. Create utf8 version of abbreviations:
    erwin@erwins-MacBook-Pro:~/local/src/treetagger/lib
    iconv -f latin1 -t utf-8 german-abbreviations >german-abbreviations-utf8 
  1. Create utf8 version of lexicon:
    erwin@erwins-MacBook-Pro:~/local/src/treetagger/lib
    iconv -f latin1 -t utf-8 german-lexicon.txt >german-lexicon-utf8.txt 
  1. Create a new command file tree-tagger-german-utf8 that uses utf8 tokenizer and german-utf8.par and grmane-abbreviations-utf8. Note that I commented out lexicon lookup, because it messes up utf-8 encoding (something wrong with lookup.perl):
    erwin@erwins-MacBook-Pro:~/local/src/treetagger/cmd
    cat tree-tagger-german-utf8
    #!/bin/sh

    # Set these paths appropriately

    BIN=/Users/erwin/local/src/treetagger/bin
    CMD=/Users/erwin/local/src/treetagger/cmd
    LIB=/Users/erwin/local/src/treetagger/lib

    OPTIONS="-token -lemma -sgml -pt-with-lemma"

    TOKENIZER=${CMD}/utf8-tokenize.perl
    TAGGER=${BIN}/tree-tagger
    ABBR_LIST=${LIB}/german-abbreviations-utf8
    PARFILE=${LIB}/german-utf8.par
    LEXFILE=${LIB}/german-lexicon-utf8.txt
    FILTER=${CMD}/filter-german-tags

    $TOKENIZER -a $ABBR_LIST $* |
    # external lexicon lookup
    # EM: commented out lexicon lookup, because it messes up utf-8 encoding 
    # perl $CMD/lookup.perl $LEXFILE |
    # tagging
    $TAGGER $OPTIONS $PARFILE  | 
    # error correction
    $FILTER