Installation of TreeTagger with German utf 8 parameters - presemt-ntnu/transglobal GitHub Wiki
Installation of the German utf 8 parameters does not work out of the box. Follow the instructions below to get it working.
-
Download German-utf8 param file (ftp://ftp.ims.uni-stuttgart.de/pub/corpora/german-par-linux-3.2-utf8.bin.gz) from treetagger website to your local treetagger dir
-
Next to install-treetagger.sh create a new file install-german-utf8 and run it:
#!/bin/sh
if [ -r german-par-linux-3.2-utf8.bin.gz ]
then
gzip -cd german-par-linux-3.2-utf8.bin.gz > lib/german-utf8.par
echo 'German parameter file (LINX, UTF8) installed.'
fi
- Create utf8 version of abbreviations:
erwin@erwins-MacBook-Pro:~/local/src/treetagger/lib
iconv -f latin1 -t utf-8 german-abbreviations >german-abbreviations-utf8
- Create utf8 version of lexicon:
erwin@erwins-MacBook-Pro:~/local/src/treetagger/lib
iconv -f latin1 -t utf-8 german-lexicon.txt >german-lexicon-utf8.txt
- Create a new command file
tree-tagger-german-utf8
that uses utf8 tokenizer and german-utf8.par and grmane-abbreviations-utf8. Note that I commented out lexicon lookup, because it messes up utf-8 encoding (something wrong with lookup.perl):
erwin@erwins-MacBook-Pro:~/local/src/treetagger/cmd
cat tree-tagger-german-utf8
#!/bin/sh
# Set these paths appropriately
BIN=/Users/erwin/local/src/treetagger/bin
CMD=/Users/erwin/local/src/treetagger/cmd
LIB=/Users/erwin/local/src/treetagger/lib
OPTIONS="-token -lemma -sgml -pt-with-lemma"
TOKENIZER=${CMD}/utf8-tokenize.perl
TAGGER=${BIN}/tree-tagger
ABBR_LIST=${LIB}/german-abbreviations-utf8
PARFILE=${LIB}/german-utf8.par
LEXFILE=${LIB}/german-lexicon-utf8.txt
FILTER=${CMD}/filter-german-tags
$TOKENIZER -a $ABBR_LIST $* |
# external lexicon lookup
# EM: commented out lexicon lookup, because it messes up utf-8 encoding
# perl $CMD/lookup.perl $LEXFILE |
# tagging
$TAGGER $OPTIONS $PARFILE |
# error correction
$FILTER