PORTAGE_sharedWordAlignmentFormats - SamuelLarkin/LizzyConversion GitHub Wiki
Up: PortageII Previous: TextFileFormats Next: ConstructingModels
Using Word Alignments
When using the framework to train PORTAGE shared systems, you will see the word alignments stored in *.align.gz
files in directory models/wal
. These will be stored in the sri
format (see below) which is most common in the SMT community. We used to recalculate alignments on the fly each time we needed them, because IBM1 and IBM2 alignments are quick to generate. But with HMM and IBM4 alignments, that strategy is no longer reasonable.
Now, the sri
format can be hard to read, so we have tools to generate word alignment in several formats, and to manipulate them.
To inspect, evaluate, or use word alignments directly, one can use the program align-words
to output word alignments in any of the formats documented below.
The program eval_word_alignment
can be used to evaluate word alignments against a reference alignment. As input, it supports the sri
, hwa
and green
formats shown below.
The program word_align_tool
can be used to manipulate word alignment files. It can convert from sri
, green
and hwa
to any of the other formats shown below, as well as perform a couple transformations on alignments themselves.
The program gen_phrase_tables
can read GIZA++
alignment files, as an alternative to generating alignments on the fly. This is intended to allow the use of word alignment models trained with GIZA++
in the PORTAGE shared phrase extraction process. The align-words
program can also be used to convert GIZA++
alignment files to any of the formats listed below.
The program word_align_phrasepairs
produces word alignments that are consistent with a given phrase alignment (as generated during decoding, for example).
Word Alignment Formats
We illustrate the formats produced by align-words
with an example. This example is found in test-suite/unit-testing/word_align_tool
(run make
to generate all the output formats).
English input:
For years the Canadian Alliance has attempted to get the former finance minister to treat municipalities with respect .
French input:
L' Alliance canadienne tente depuis des ann�es d' obtenir que l' ancien ministre des Finances traite les municipalit�s avec respect .
aachen
Format: SENT: 0 S 0 0 S 1 1 S 2 2 S 3 6 S 4 3 S 4 5 S 5 4 S 6 6 S 7 7 S 8 8 S 9 8 S 10 9 S 11 10 S 11 11 S 12 12 S 13 11 S 14 12 S 15 14 S 16 13 S 17 15 S 18 16 S 19 17 S 20 18
hwa
Format: (Output is in a file, aligned.NNN
.)
0 0 (L', For) 1 1 (Alliance, years) 2 2 (canadienne, the) 3 6 (tente, attempted) 4 3 (depuis, Canadian) 4 5 (depuis, has) 5 4 (des, Alliance) 6 6 (ann�es, attempted) 7 7 (d', to) 8 8 (obtenir, get) 9 8 (que, get) 10 9 (l', the) 11 10 (ancien, former) 11 11 (ancien, finance) 12 12 (ministre, minister) 13 11 (des, finance) 14 12 (Finances, minister) 15 14 (traite, treat) 16 13 (les, to) 17 15 (municipalit�s, municipalities) 18 16 (avec, with) 19 17 (respect, respect) 20 18 (., .)
matrix
Format: |F|y|t|C|A|h|a|t|g|t|f|f|m|t|t|m|w|r|.| |o|e|h|a|l|a|t|o|e|h|o|i|i|o|r|u|i|e| | |r|a|e|n|l|s|t| |t|e|r|n|n| |e|n|t|s| | | |r| |a|i| |e| | | |m|a|i| |a|i|h|p| | | |s| |d|a| |m| | | |e|n|s| |t|c| |e| | | | | |i|n| |p| | | |r|c|t| | |i| |c| | | | | |a|c| |t| | | | |e|e| | |p| |t| | | | | |n|e| |e| | | | | |r| | |a| | | | | | | | | | |d| | | | | | | | |l| | | | | | | | | | | | | | | | | | | |i| | | | | | | | | | | | | | | | | | | |t| | | | | | | | | | | | | | | | | | | |i| | | | | | | | | | | | | | | | | | | |e| | | | | | | | | | | | | | | | | | | |s| | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |x| | | | | | | | | | | | | | | | | | |L' | |x| | | | | | | | | | | | | | | | | |Alliance | | |x| | | | | | | | | | | | | | | | |canadienne | | | | | | |x| | | | | | | | | | | | |tente | | | |x| |x| | | | | | | | | | | | | |depuis | | | | |x| | | | | | | | | | | | | | |des | | | | | | |x| | | | | | | | | | | | |ann�es | | | | | | | |x| | | | | | | | | | | |d' | | | | | | | | |x| | | | | | | | | | |obtenir | | | | | | | | |x| | | | | | | | | | |que | | | | | | | | | |x| | | | | | | | | |l' | | | | | | | | | | |x|x| | | | | | | |ancien | | | | | | | | | | | | |x| | | | | | |ministre | | | | | | | | | | | |x| | | | | | | |des | | | | | | | | | | | | |x| | | | | | |Finances | | | | | | | | | | | | | | |x| | | | |traite | | | | | | | | | | | | | |x| | | | | |les | | | | | | | | | | | | | | | |x| | | |municipalit�s | | | | | | | | | | | | | | | | |x| | |avec | | | | | | | | | | | | | | | | | |x| |respect | | | | | | | | | | | | | | | | | | |x|. +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
compact
Format: 0 1;2;3;7;4,6;5;7;8;9;9;10;11,12;13;12;13;15;14;16;17;18;19;
ugly
Format: Warning: this format does not contain all alignment information.
L'/For Alliance/years canadienne/the tente/attempted depuis/Canadian depuis/has des/Alliance ann�es/attempted d'/to obtenir/get que/get l'/the ancien/former ancien/finance ministre/minister des/finance Finances/minister traite/treat les/to municipalit�s/municipalities avec/with respect/respect ./.
green
Format: 0 1 2 6 3,5 4 6 7 8 8 9 10,11 12 11 12 14 13 15 16 17 18
sri
Format: 0-0 1-1 2-2 3-6 4-3 4-5 5-4 6-6 7-7 8-8 9-8 10-9 11-10 11-11 12-12 13-11 14-12 15-14 16-13 17-15 18-16 19-17 20-18
uli
Fromat: 1 0:0:unspec 1:1:unspec 2:2:unspec 3,6:6:unspec 4:3,5:unspec 5:4:unspec 7:7:unspec 8,9:8:unspec 10:9:unspec 11,13:10,11:unspec 12,14:12:unspec 15:14:unspec 16:13:unspec 17:15:unspec 18:16:unspec 19:17:unspec 20:18:unspec
Up: PortageII Previous: TextFileFormats Next: ConstructingModels