Samples in HDF5 format - presemt-ntnu/transglobal GitHub Wiki
Samples are stored in HDF5 format. One file per language pair. Samples can be inspected with e.g. HDFView. Samples file contains vocab and samples
Vocab
Vocabulary mapping (target-language) lemmas to integer indices. Stored as an array of strings of variable length.
Samples
Samples for each target-language lemma + POS tag pair that occurs as a translation candidate in am ambiguous entry in the translation dictionary. E.g. for lemma beam in de-en_samples.hdf5:
/samples/beam/n
/samples/beam/v
Samples are stored in sparse matrix format with
- ij a two dimensional array containing row (i) and column (j) indices, and
- data a one dimensional array containing the counts of context terms.
For example,
/samples/beam/n/ij = [0,0,...],[9689,3859,...](/presemt-ntnu/transglobal/wiki/0,0,...],[9689,3859,...)
/samples/beam/n/data = [1,1,...]
means that in the first sample context for the noun beam the vocabulary term with index 9689 (= tractor) occurred once, the term with index 3859 (= grip) occurred once, and so on for each term in each sample context.
print_samp.py
The print_samp.py script can be used to inspect samples for certain lemma + pos combinations:
print_samp.py -l beam -p n de-en_samples_filtered.hdf5 | head
==============================================================================
beam/n
==============================================================================
1 : grip:1, main:1, rock:1, screen:1, ship:1, tractor:1
2 : cause:1, destructively:1, difference:1, half:1, interfere:1, interferometer:1, introduce:1, mirror:2, move:1, one:1, optical:1, path:1, range:1, reflect:1, scan:1, time:1, use:1, various:1, wavelength:1
3 : assign:1, beam:3, chamber:1, component:1, enclose:1, focus:1, incident:1, monochromator:1, obscure:1, onto:2, optical:1, overall:1, reflect:1, require:1, space:1, structure:1, support:1, tilt:1, vacuum:1, within:2, without:1
4 : project:1
5 : accurately:1, automate:1, basically:1, chapter:1, common:1, describe:1, easily:1, exception:1, factory:1, fiber:1, ion:1, laser:1, method:1, solid:1, state:1, technique:1, type:1, use:1, variation:1, various:1, visible:1
6 : campus:1, collide:1, department:1, experiment:1, facility:1, faculty:1, focus:1, join:1, locate:1, physics:1, primarily:1, research:1
7 : asunder:1, bright:1, harshly:1, heart:1, heaven:1, meet:1, mercy:1, prayer:1, reach:1, rive:1, store:1, unite:1, us:1
.
.
.
print_freq.py
The print_samp.py script shows the frequencies of the context terms for particular lemma + pos combinations:
print_freq.py -l beam -p n de-en_samples_filtered.hdf5 | head -50
==============================================================================
beam/n
==============================================================================
2158.0 1.54641022% beam
1459.0 1.04551090% laser
1365.0 0.97815104% light
1277.0 0.91509076% use
1038.0 0.74382475% will
841.0 0.60265570% can
840.0 0.60193910% one
637.0 0.45647049% two
612.0 0.43855563% power
520.0 0.37262897% see
510.0 0.36546303% make
475.0 0.34038223% take
470.0 0.33679926% electron
466.0 0.33393288% time
449.0 0.32175078% energy
426.0 0.30526912% high
414.0 0.29666999% system
411.0 0.29452020% mirror
405.0 0.29022064% length
405.0 0.29022064% like
390.0 0.27947173% eye
382.0 0.27373897% focus
378.0 0.27087260% say
374.0 0.26800622% also
344.0 0.24650839% get
327.0 0.23432629% produce
323.0 0.23145992% point
306.0 0.21927782% small
300.0 0.21497825% may
300.0 0.21497825% first
290.0 0.20781231% output
289.0 0.20709572% angle
286.0 0.20494593% speed
282.0 0.20207956% PM
279.0 0.19992977% single
278.0 0.19921318% line
276.0 0.19777999% begin
270.0 0.19348043% ton
270.0 0.19348043% position
269.0 0.19276383% target
265.0 0.18989746% large
264.0 0.18918086% go
263.0 0.18846427% reflect
261.0 0.18703108% just
258.0 0.18488130% foot
256.0 0.18344811% back
253.0 0.18129833% show