treealign - Helsinki-NLP/Lingua-Align GitHub Wiki

Table of Contents NAME SYNOPSIS DESCRIPTION OPTIONS Input options Training options Alignment options Runtime and other options SEE ALSO AUTHOR COPYRIGHT AND LICENSE

NAME

treealign - training tree alignment classifiers and aligning syntactic trees

SYNOPSIS

DESCRIPTION

This script allows you to train a tree alignment model and to apply them to parallel treebanks. Tree alignment is based on local binary classification and rich feature sets.

Currently, training data has to be in Stockholm Tree Aligner format. The output format is the same format. Here is a short example of this format (taking from the output of the TreeAligner):

OPTIONS

There is a number of options that can be specified on the command line.

Input options

 * -a parallel-treebank-file Name of the file that contains the parallel treebank. Default format is Stockholm Tree Aligner format (where the sentence alignment is implicitely given by tree node alignments). To use a different format use the option -A
 * -A format Format of the parallel treebank/corpus. Default is sta (Stockholm Tree Aligner format). Other options are, for example, 'opus' (CES XML format as it is used in the OPUS corpus)
 * -s source-treebank-file Name of the files that contains the source language treebank. This is useful to sepcify a file that is different from the one that is specified in the 'parallel-treebank-file'. For example, sentence alignment files from OPUS usually refer to non-parsed XML files. With -s we can overwrite this and refer to the parsed corpus instead. However, be aware that the same sentences have to be covered in the same order and appropriate IDs of these sentences have to be found when reading through the treebank files.
 * -S format Format of the source language treebank. Default is TigerXML (which is used in the Stockholm Tree Aligner)
 * -t target-treebank-file Name of the target language treebank file (similar to -s but for the target language)
 * -T format Format of the target language treebank (similar to -S)
 * -w Swap alignment direction when reading through the parallel treebank
 * -i Try to align index nodes as well (used in AlpinoXML)

Training options

Training will be enabled if a positive number of training sentences iss specified with the -n option OR the modelfile does not exist.

 * -n nr_sent Specify how many sentence (tree) pairs will be used for training a new tree-aligner model.
 * -f features:: Define features to be used in training. (For alignment, features are taken from the modelfile.feat file!!) 'features' is a string with feature types separated by ':'. There are various features that can be used and combined. For more details look at Lingua::Align::TreesFeatures. The default is 'insideST2:insideTS2:outsideST2:outsideTS2'
 * -m model-file Name of the file to store model parameters / read model parameters
 * -c classifier:: Classifier to be used. Default is 'megam'. Another possiblity is 'clue' which refers to a noisy-or like classifier with independent precision-weighted features (requires probabilistic values for each feature and supports only positive features). Other classifiers may be supported in future releases of LinguaAlign.
 * -M moses-dir Directory with the GIZA++ and Moses word alignment files that will be used for extracting certain features. Default is 'moses' and the treealigner expects to find files with the following names

Align::Trees, Lingua::Align::Features, Lingua::Align::Corpus

AUTHOR

Joerg Tiedemann

COPYRIGHT AND LICENSE

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.8 or, at your option, any later version of Perl 5 you may have available.

Copyright for MegaM by Hal Daume III see http://www.cs.utah.edu/~hal/megam/ for more information Paper: Notes on CG and LM-BFGS Optimization of Logistic Regression, 2004 http://www.cs.utah.edu/~hal/docs/daume04cg-bfgs.pdf