Home - Helsinki-NLP/Lingua-Align GitHub Wiki

Table of Contents Lingua::Align - a toolbox for Tree Alignment Download Installation Quickstart Tutorial Run tests on existing data sets Run with your own settings Training data Features & parameters Alignment strategies Library structure How to do word alignment Documentation & References Author Copyright and License More Information

Lingua::Align - a toolbox for Tree Alignment

Lingua::Align is a collection of command-line tools for automatic tree and word alignment of parallel corpora. The main purpose is to provide an experimental toolbox for experiments with various feature sets and alignment strategies. Alignment is based on local classification and alignment inference. The local classifier is typically trained on available aligned training data. We use a log-linear model for discriminative binary classification using the maximum entropy learning package megam (Hal Daume III).

Download

Lingua::Align is available from here: https://bitbucket.org/tiedemann/lingua-align

Installation

You can either install the perl modules and binaries as usual:

Or you can simply run the treealign script (and the other tools) in the `bin/` directory without changing anything. The only requirement is a recent version of Perl and `XML::Parser` installed on your system (the Perl wrapper for the Expat XML parser).

The Tree Aligner calls an external tool (megam) which is provided as a pre-compiled binary in the `bin/` directory. The default version is a i686 binary for Linux-based systems. The package also includes a binary for Intel-based Mac OS X. If you want to use this version, please change the link `bin/megam` to point to `bin/megam.osx`. For all other platforms please download the source from http://www.cs.utah.edu/~hal/megam/ and compile it on you platform. Make sure that the binary works and link it to `bin/megam`.

For some features you will need word alignment information. To produce these features you need to run tools such as Giza++ http://code.google.com/p/giza-pp/ and Moses http://statmt.org/moses/.

Quickstart Tutorial

The easiest way to use the Tree Aligner is to run the frontend script treealign in the bin directory. There are many options and command-line arguments that can be used to adjust the behaviour of the alignment tools. Have a look at Lingua::treealign for more information.

Run tests on existing data sets

For a simple test: go to the directory `europarl` and run `make test`.

This will run a simple test with only a few training sentences from the Europarl corpus and simple features for classification. The test consists of two calls to tree aligner scripts: `treealign` is used to train a classifier and to align unseen sentences from the given data set. `treealigneval` is used to compute scores of the alignment performed with that model and the alignment strategy that is chosen.

The example training data is stored in `europarl/nl-en-weak_125.xml` which has been produced by manual alignment (thanks to Gideon Kotzé) using the Stockholm Tree Aligner http://kitt.cl.uzh.ch/kitt/treealigner. The format looks like this:

The actual treebank data is stored in TigerXML (in this case) and links are pointers to these documents using the unique node IDs. This should be quite straightforward (looking at the example above). Other formats are also supported, for example, Penn Treebank format and AlpinoXML for storing treebanks. There is also support for other alignment formats like the tree alignment format used by the Dublin Subtree Aligner and word alignment formats used by Giza++, Moses and shared tasks on word alignment (WPT2003).

Run with your own settings

Basically you can call the tree-aligner frontend with your own data and settings like this:

The alignment file ALIGNFILE has to contain the tree alignments that will be used for training the classifier. The default format is the one explained above (similar to the one used by the Stockholm Tree Aligner). FEATURES is a string specifying the features to be used in classification. NR_TRAIN_SENT is the number of sentences to be used for training and NR_TEST_SENT is the number of test sentences. There are many more options that can be set on the command line. Please look at Lingua::treealign for more information.

Of course it is also possible to align treebanks using an existing alignment model. The only thing you need are the treebank files in both languages which have to be sentence aligned. Assuming that the alignment model is stored in the default file 'treealign.megam' and the two treebanks (`ep-00-12-15.125.en.penn`, `ep-00-12-15.125.nl.penn` from the sample files in `europarl/`) are stored in bracketed Penn Treebank format you can call the aligner like this:

This will assume that trees from both treebanks are aligned with each other in the same order as they appear in the given files (corresponding lines in this case). Features to be used for classification have to be stored in `treealign.megam.feat` (they should be if `treealign.megam` has been produced by Lingua-Align). Tree alignments will be stored in `alignments` in STA format.

Here are some more details about the things you need for running your own experiments:

Training data

Your own tree-aligned training data. The easiest way is to use the Stockholm Tree Aligner. The format produced by this tool can directly be used by Lingua::Align. You need at least 100 pairs of parse trees in order to obtain reasonable results. More is better of course. The corpus has to be parsed on both sides. You need to use TigerXML for the Stockholm Tree Aligner and this is also most convenient for the tree aligner later on (also for visualizing automatic alignment).

There is a tool to convert treebanks using the formats supported by Lingua::Align: `bin/convert_treebank`. For example, if your parse trees are stored in Penn Treebank format (`treebank`) you might try to use the following command:

Hopefully this will work to create a corpus that can be loaded into the Stockholm Tree Aligner. You might have to check the specifications in the XML header and adjust some (meta) information. You can, for example, validate your Tiger-XML against the schema:

You can also use another format which is similar to the one used by the Dublin Tree aligner which applies a bracketed format for tree structures and links in terms of references to the nodes in these trees. Here is an example (from `europarl/nl-en_125.dublin`):

The first row is the source language tree, the second one is the target language row and the third one contains the links between source and target nodes. This format is not entirely compatible with the Dublin Tree Aligner format as it does not support conflated unary productions.

If you like to use this format for your training data you can call the tree-aligner script with an extra parameter (-A) specifying the alignment format, for example:

Features & parameters

First of all, you need to decide what kind of features should be used in the classifier model. Quite a lot of features are supported by Lingua::Align and you can easily add new ones. Look at Lingua::Align::Features for more information on classification features. Features are given as a list of features type names separated by ':' using the command-line flag '-f'. Here are some simple exampels assuming that the training corpus is stored in Stockholm Tree Aligner format in a file called `nl-en_125.xml` (from `europarl/`):

This will train a model on the first 10 tree pairs using the `catpos` feature (pairs of category or POS labels) and then align the following 10 tree pairs. The classification model is stored in the default file `treealign.megam`. Have a look at the parameters if you like (it's just a plain text file). For other training options default settings are used (check the Lingua::treealign script for more details). Also for alignment standard settings are applied. The tree aligner will perform a two-step strategy using local classification and a greedy alignment inference.

The result is printed to STDOUT and piped to `aligned.xml` in the example above. Use the following command to evaluate the alignment just done:

This should give you very low precision and recall values (around 25%; Note that individual recall values for terminal nodes and non-terminal nodes are not correct because node type information is not available in the gold standard and IDs do not follow the standard to make a clear distinction).

Another simple example using two feature types ("tree level similarity" and "tree span similarity") is the following:

The classification model will look something like this:

Still, these features are not very informative and the scores will be still very low. Try now a combination of the three feature types mentioned above:

This will give you much better results already (around 50% F-scores).

Now you can start to experiment with contextual features, for example, `catpos` features of parent and children nodes:

And, surprise, this gives you another improvement (around 60% F-score). You can also get features from neighboring nodes using 'sister_' and 'neighborXY_' as prefix. With 'sister_' features will be extracted from ALL sister nodes, i.e. nodes that have the same parent. In case of real-valued features the average (arithmetic mean) of the feature values of these sister nodes will be used. For binary feature templates (for example 'catpos') all of them will be included. (This is exactly the same behaviour for 'children_').

The 'neighborXY_' prefix is more flexible. You can specify neighbors using X as the distance in the source language tree and Y as the distance in the target language. Negative values will be interpreted as left neighbors and positive values (don't use '+'!) for neighbors to the right. For terminal nodes: All surface words will be considered for retrieving neighbors. For nonterminals: only neighboring nodes with the same parent as the current node will be considered! Observe that the distances have to be less than 10 because the pattern only allows single digits! Here is an example for the use of neighbor feature:

This retrieves the 'catpos' feature from the current node pair, from the left source tree neighbor together with the current target node, and from the left source tree neighbor together with the right target node neighbor.

Note that these models so far do not use any other information than the features directly extracted from the parse trees and the alignment information available in the training data. There are also features that need external resources. For example, you may include word alignment information for the tree alignment. For this you need to run automatic word alignment first (on the treebank sentences you're using in your experiments) and you need to store the information in the format supported by Lingua::Align. You may use the Viterbi alignment produced by Giza++ (`moses/giza.src-trg/src-trg.A3.final.gz` and `moses/giza.trg-src/trg-src.A3.final.gz`):

You can see how effective these features are for tree alignment (well, at least they give you already around 55% F-scores with the tiny training data we are using in our examples). Of course, you can use word alignment features from context nodes as well (giving you around 65% F-scores):

Note that we use a combination of `gizae2f` and `gizaf2e` for the context nodes. Now try a combination of all features we mentioned so far. You should get a decent score of around 74% F-score. Nice, isn't it?

Another word alignment feature is based on the symmetrized alignments produced by Moses. Use them in the following way:

Don't ask me why the parameter is '-y' for the Moses alignment file. (It's basically because treealign only uses short command-line options and I was running out of letters ....)

Actually, you could leave out the file specifications in the examples above because we were just using the default names and paths. You can use the flag (`-M moses-dir`) if the file-names and sub-directories are the same but the main Moses work-directory is different (for example `my-moses-dir`):

Finally, we should introduce history features. For now we just did local classification without considering alignment decisions on other nodes. The classifier can also be trained with so-called history features -- features based on previous decisions. Using such features will force the tree aligner to use a sequential classification procedure, either bottom-up or top-down. In top-down classification will start with the root-nodes and the classifier uses alignment decisions on the parent nodes as additional features. You can use these so-called parent features like this (adding the flag `-P` to the command-line):

Compare this to the alignment without the `-P` flag and you will see the difference when running evaluation. In bottom-up classification, two types of history features are supported: proportion of links between immediate children nodes (`-C`) and proportion of links between all children nodes in the entire subtrees (`-U`).

Note that history features coming from parent links and coming from children cannot be combined (for obvious reasons). And don't expect improvements in all cases. Especially for rich feature sets no big improvements can be expected. Note that alignment will also be (even) slower.

Alignment strategies

In the default settings a two-step procedure is used: First all node pairs are classified using the local classifier, possibly including history features. The second step comprises the actual alignment step (inference) in which nodes are linked to each other according to the link likelihoods assigned by the local classifier in the first step. The default strategy is a "greedy" alignment procedure, starting with the node pair with the highest link likelihood and running greedily through the set of candidates. A necessary constraint is that all nodes are aligned at most once (on both sides).

You can use other strategies for example using an additional well-formedness constraint:

Compare this to the result obtained with the standard strategy (`-l greedy`). Another common technique is to use graph-theoretic algorithms modeling tree alignment as a maximum weighted bipartite matching problem. Lingua::Align includes a free implementation of the Hungarian algorithm (Kuhn-Munkres) that solves this problem in polynomial time.

Several other inference strategies can be used. The documentation of the ones included in Lingua::Align is still rather unexisting. Look at the code in the module Lingua::Align::LinkSearch for more information.

You can also do without simply using the decisions of the local classifier (default: scores above 0.5 indicate a link):

For simple feature sets these scores will be much lower. Alignment constraints such as the one-to-one link constraint and well-formedness of links are important in those cases. For richer feature sets this difference fades away.

One problem with the greedy strategies is that alignment is slow because all node pairs have to be considered as candidates for classification (and alignment). This is because "feature extraction" is actually the bottle neck in the entire alignment procedure (not classification nor alignment inference). There is a way to speed this up by combining local classification with a greedy alignment strategy. This is (again) called 'bottom-up' alignment but this time using classifier scores immediately for establishing links between nodes. Alignment starts at the leaf nodes and each node pair that receives a score above 0.5 will be aligned immediately (and not considered aferwards anymore). After this greedy bottom-up procedure the chosen alignment inference strategy will be used for the remaining unlinked nodes. Use the option `-b bottom-up`):

Observe that we can use history features again (but not `-P`). This should speed-up the alignment process a bit (not that much as you might have expected ...). You can get information about the runtime by including the verbose output flag (see above `-v`).

Library structure

There are several options that can be set. For further information have a look at the manpages linked below or just look at the source code. Extending the code is quite straightforward even though the documentation is not perfect and the code is partially awful (well, it's Perl ....). Here is a (hopefully up-to-date) list of modules (many of them are under-developed / experimental / non-functioning possible projects for the future):

 * top-level modules:

 * modules for feature extraction

 * modules for classification

 * modules for alignment inference

 * modules for data manipulation

How to do word alignment

Lingua::Align can, of course, also be used for word alignment. It is straightforward if you have parse trees available. Then you can just specify the flag `-L` (leafs only) to only consider terminal nodes during training and alignment. (Note that you can also align non-terminal nodes only using the flag `-N` and if you use both flags `-N -L` only nodes of the same type will be aligned).

Furthermore, you can also use the software to do word alignment on plain text files (this is still quite experimental). Look at the example in `europarl/wpt03` to see how to run the aligner. Again, you need some training data and you have to specify some features to be used for classification. Training data can be in the format of the shared task on word alignment WPT 2003/2005 (http://www.cse.unt.edu/~rada/wpt/):

As features you may use, for example, string similarity measures such as LCSR score (longest common sub-sequence ratio), Dice scores based on co-occurrence frequencies, Moses/Giza++ alignments, binary features such as the occurrence of suffix pairs etc. Run `make` in the `europarl/wpt03` to see an example alignment experiment.

To run your own experiments you can specify your own setup. Here is a simple example:

This uses the file `test.wa.nullalign` for training and testing which is in `WPT` format (`-A`) and aligns source languages texts (`test.e`) to the target language texts (`test.f`), both in plain text format. The features are string similarity (LCSR) between tokens that are at least 3 characters long, pairs of 4-character suffixes and "tree span similarity", which is in case of word alignment the relative position difference between the token witin the sentences.

For evaluation you can use the standard evaluation script just specifying that the gold standard is in WPT format:

If you want to use Moses/Giza++ alignments as features: Just use the same parameters as for tree alignment.

Another common feature is co-occurrence which can be measured in various ways. You can use the script `bin/coocfreq` to generate co-occurrence frequencies from arbitrary parallel corpora that can be plugged into the aligner as a feature. An example computing co-occurrence frequencies from tokens in the test corpus (which is much too small to compute reliable scores) is the following:

This uses the parallel corpus `test.e` and `test.f` (in Moses/Giza++ plain text format -- corresponding lines are aligned to each other) to count frequencies that will be stored in `word.src` (source language tokens) `word.trg` (target language tokens) and `word.cooc` (co-occurrence frequencies). These scores can then be used in the aligner as a feature:

Don't expect too much as these Dice scores are not reliable from such a small corpus! Of course, you can combine these scores with any other feature as described above.

Co-occurrence frequencies can be computed for various kinds of features and feature combinations. For example, you can compute frequencies of word suffixes with the following command:

In order to use several Dice scores in alignment you can give these feature types different names (they have to start with 'dice'):

It is maybe worth mentioning that these feature types (Dice, LCSR, suffix-pairs, etc) also can be used for tree alignment as explained earlier. Especially Dice scores can also be calculated for any feature connected to arbitrary nodes in a tree. Examples of such co-occurrence measures can be seen in the `smultron/` directory. Here is an example for computing co-occurrence frequencies for POS labels and parent category labels from parse tree pairs:

In tree alignment it would also make sense to use the contextual co-occurrence features, for example, `-f parent_dicecat=cat.cooc` (if `cat.cooc` includes the co-occurrence frequencies of category labels).

Finally, you can also visualize word alignment using a little tool in the bin directory of Lingua::Align:

This will print link matrices comparing the proposed links with the gold standard links. It also computes cumulative evaluation measures (precision, recall, AER). It looks like this:

Documentation & References

There are several man-pages generated from the "pod" information in the Perl modules and scripts included in Lingua::Align. Look at the following files:

Lingua::treealign

Lingua::treealigneval

Lingua::Align

Lingua::Align::Trees

Lingua::Align::Features

Lingua::Align::LinkSearch

Lingua::Align::Corpus

Lingua::coocfreq

Lingua::convert_treebank

Lingua::sta2moses

Lingua::sta2phrases

Here are some publications (please cite if you use the software):

 * Tiedemann, J. (2010) '''Lingua-Align: An Experimental Toolbox for Automatic Tree-to-Tree Alignment.''' In ''Proceedings of the 7th International Conference on Language Resources and Evaluation'' (LREC'2010), 2010. http://stp.lingfil.uu.se/~joerg/published/lrec2010.pdf

 * Tiedemann, J. and Kotzé, G. (2009) '''Building a Large Machine-Aligned Parallel Treebank.''' In ''Proceedings of the 8th International Workshop on Treebanks and Linguistic Theories'' (TLT'08), pages 197-208, EDUCatt, Milano/Italy, 2009. http://stp.lingfil.uu.se/~joerg/published/tlt09.pdf

 * Tiedemann, J. and Kotzé, G. (2009) '''A Discriminative Approach to Tree Alignment.''' In ''Proceedings of the International Workshop on Natural Language Processing Methods and Corpora in Translation, Lexicography and Language Learning'' (in connection with RANLP'09), pages 33 - 39, 2009. http://stp.lingfil.uu.se/~joerg/published/ranlp09_tree.pdf

Author

Joerg Tiedemann, <[email protected]></[email protected]>

Copyright and License

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.8 or, at your option, any later version of Perl 5 you may have available.

Copyright for MegaM by Hal Daume III see http://www.cs.utah.edu/~hal/megam/ for more information Paper: Notes on CG and LM-BFGS Optimization of Logistic Regression, 2004 http://www.cs.utah.edu/~hal/docs/daume04cg-bfgs.pdf

More Information

 * [[Lingua/Align.wiki]]
 * [[Lingua/Align/Corpus.wiki]]
 * [[Lingua/Align/Corpus/Parallel.wiki]]
 * [[Lingua/Align/Features.wiki]]
 * [[Lingua/Align/LinkSearch.wiki]]
 * [[Lingua/Align/Trees.wiki]]
 * [[convert_treebank.wiki]]
 * [[coocfreq.wiki]]
 * [[sta2moses.wiki]]
 * [[sta2phrases.wiki]]
 * [[treealign.wiki]]
 * [[treealigneval.wiki]]