Samples in HDF5 format - presemt-ntnu/transglobal GitHub Wiki

Samples are stored in HDF5 format. One file per language pair. Samples can be inspected with e.g. HDFView. Samples file contains vocab and samples

Vocab

Vocabulary mapping (target-language) lemmas to integer indices. Stored as an array of strings of variable length.

Samples

Samples for each target-language lemma + POS tag pair that occurs as a translation candidate in am ambiguous entry in the translation dictionary. E.g. for lemma beam in de-en_samples.hdf5:

/samples/beam/n
/samples/beam/v

Samples are stored in sparse matrix format with

  • ij a two dimensional array containing row (i) and column (j) indices, and
  • data a one dimensional array containing the counts of context terms.

For example,

/samples/beam/n/ij = [0,0,...],[9689,3859,...](/presemt-ntnu/transglobal/wiki/0,0,...],[9689,3859,...)
/samples/beam/n/data = [1,1,...] 

means that in the first sample context for the noun beam the vocabulary term with index 9689 (= tractor) occurred once, the term with index 3859 (= grip) occurred once, and so on for each term in each sample context.

print_samp.py

The print_samp.py script can be used to inspect samples for certain lemma + pos combinations:

print_samp.py -l beam -p n de-en_samples_filtered.hdf5 | head
==============================================================================
beam/n
==============================================================================
1        : grip:1, main:1, rock:1, screen:1, ship:1, tractor:1
2        : cause:1, destructively:1, difference:1, half:1, interfere:1, interferometer:1, introduce:1, mirror:2, move:1, one:1, optical:1, path:1, range:1, reflect:1, scan:1, time:1, use:1, various:1, wavelength:1
3        : assign:1, beam:3, chamber:1, component:1, enclose:1, focus:1, incident:1, monochromator:1, obscure:1, onto:2, optical:1, overall:1, reflect:1, require:1, space:1, structure:1, support:1, tilt:1, vacuum:1, within:2, without:1
4        : project:1
5        : accurately:1, automate:1, basically:1, chapter:1, common:1, describe:1, easily:1, exception:1, factory:1, fiber:1, ion:1, laser:1, method:1, solid:1, state:1, technique:1, type:1, use:1, variation:1, various:1, visible:1
6        : campus:1, collide:1, department:1, experiment:1, facility:1, faculty:1, focus:1, join:1, locate:1, physics:1, primarily:1, research:1
7        : asunder:1, bright:1, harshly:1, heart:1, heaven:1, meet:1, mercy:1, prayer:1, reach:1, rive:1, store:1, unite:1, us:1
.
.
.

print_freq.py

The print_samp.py script shows the frequencies of the context terms for particular lemma + pos combinations:

print_freq.py -l beam -p n de-en_samples_filtered.hdf5 | head -50
==============================================================================
beam/n
==============================================================================
          2158.0      1.54641022%     beam
          1459.0      1.04551090%     laser
          1365.0      0.97815104%     light
          1277.0      0.91509076%     use
          1038.0      0.74382475%     will
           841.0      0.60265570%     can
           840.0      0.60193910%     one
           637.0      0.45647049%     two
           612.0      0.43855563%     power
           520.0      0.37262897%     see
           510.0      0.36546303%     make
           475.0      0.34038223%     take
           470.0      0.33679926%     electron
           466.0      0.33393288%     time
           449.0      0.32175078%     energy
           426.0      0.30526912%     high
           414.0      0.29666999%     system
           411.0      0.29452020%     mirror
           405.0      0.29022064%     length
           405.0      0.29022064%     like
           390.0      0.27947173%     eye
           382.0      0.27373897%     focus
           378.0      0.27087260%     say
           374.0      0.26800622%     also
           344.0      0.24650839%     get
           327.0      0.23432629%     produce
           323.0      0.23145992%     point
           306.0      0.21927782%     small
           300.0      0.21497825%     may
           300.0      0.21497825%     first
           290.0      0.20781231%     output
           289.0      0.20709572%     angle
           286.0      0.20494593%     speed
           282.0      0.20207956%     PM
           279.0      0.19992977%     single
           278.0      0.19921318%     line
           276.0      0.19777999%     begin
           270.0      0.19348043%     ton
           270.0      0.19348043%     position
           269.0      0.19276383%     target
           265.0      0.18989746%     large
           264.0      0.18918086%     go
           263.0      0.18846427%     reflect
           261.0      0.18703108%     just
           258.0      0.18488130%     foot
           256.0      0.18344811%     back
           253.0      0.18129833%     show