Pympi - langdoc/FRechdoc GitHub Wiki

Pympi is a Python package that allows working with ELAN files. Works extremely well, smartly designed.

Example with converting Praat Textgrid to our ELAN

This workflow for automatic segmenting of ELAN files has been recently circulating online.

https://blogs.soas.ac.uk/elar/2017/06/15/elanpraat-machine-segmenting/

It seems to me that indeed the old Praat automatic segmenting may be better than many other alternatives, for example those built into ELAN itself. This results in a Praat Textgrid file, and important that to ELAN is a bit hazardous as one has to construct somehow the more elaborate tier structure. So I came up with a small Pympi script, although not yet very polished one, which can do that job fairly well. I haven't yet turned that into a function, though but that is easy, basically it should work so that you give it a Praat Textgrid and it spills out an ELAN file with the same name but different extension.

import pympi

# We have to define somehow who is the principal speaker and who are the others
# Надо как нибыд сохранить информация об это кто принципальный говорющи, для этого что все сегменты будут в одну слою

main_speaker = ['ZPF-F-1926']
other_speakers = ['NTP-M-1986', 'MSF-F-1968']

# Here we give the Praat file
# Здесь надо дать Praat файль 

praat_file = pympi.TextGrid(file_path="kpv_izva20160622-04-b16.TextGrid")
segment_file = praat_file.to_eaf()

elan_file = pympi.Elan.Eaf(file_path=None, author='Niko Partanen')

# Here we add types and other nonrecursive stuff 
# Здесь у нас типы и другие который только раз надо

elan_file.add_linguistic_type(lingtype='refT', timealignable=True, graphicreferences=False)
elan_file.add_linguistic_type(lingtype='orthT', timealignable=False, graphicreferences=False, constraints='Symbolic_Association')
elan_file.add_linguistic_type(lingtype='wordT', timealignable=False, graphicreferences=False, constraints='Symbolic_Subdivision')
elan_file.add_linguistic_type(lingtype='posT', timealignable=False, graphicreferences=False, constraints='Symbolic_Subdivision')
elan_file.add_linguistic_type(lingtype='lemmaT', timealignable=False, graphicreferences=False, constraints='Symbolic_Subdivision')
elan_file.add_linguistic_type(lingtype='morphT', timealignable=False, graphicreferences=False, constraints='Symbolic_Subdivision')
elan_file.add_language(lang_def='http://cdb.iso.org/lg/CDB-00131321-001', lang_id='kpv', lang_label='Komi-Zyrian (kpv)')

# This is a function for adding speakers
# С этом метхода можем каждый говорющии сохранить

def add_speaker(elan_file, participant):
    elan_file.add_tier(tier_id='ref@' + participant, ling='refT')
    elan_file.add_tier(tier_id='orth@' + participant, ling='orthT', parent='ref@' + participant)
    elan_file.add_tier(tier_id='word@'  + participant, ling='wordT', parent='orth@' + participant, language='kpv')

for participant in main_speaker + other_speakers:
    add_speaker(elan_file, participant)

# We copy here the content of tier "silences" and merge that with the wanted upper level tier
# Здесь копируемся материаль из этого новая слоя и сливаемся эту с который нам надо

segment_file.copy_tier(elan_file, 'silences')
elan_file.merge_tiers(tiers=['silences', 'ref@' + ''.join(main_speaker)], tiernew = 'ref@' + ''.join(main_speaker))

# These tiers are not needed
# Эти слой нам не надо, удалимся

elan_file.remove_tiers(['default', 'silences'])

# Media file
# Медиа файль

elan_file.add_linked_file('kpv_izva20160622-04.wav')

# Here we write the file
# Сохранямся файль

elan_file.to_file(file_path="kpv_izva20160622-04.eaf")

The result looks like this:

Screenshot

So there is still quite much work left, but maybe it is still faster than to do it manually? Now all segments are on one tier, so dragging them around is a necessity, and if you have to listen the whole tape to fix it anyway, have you saved any time in the end? Who knows. But this is one way to approach it, also one could run some automatic speaker diarization tools after this to try to throw segments into right tiers. Maybe.

Use example with creating new ELAN files

If you connect something like this into a database or any machine readable source of session names and speaker id's, it should be easy to automatize the creation of new ELAN files.

import pympi

def new_elan_file(session_name = 'kpv_izva20140318-1skar-b', speakers = ['niko', 'rogier', 'micha', 'marina']):

    elan_file = pympi.Elan.Eaf(file_path=None, author='FU-Lab')

    elan_file.add_linguistic_type(lingtype='refT', timealignable=True, graphicreferences=False)
    elan_file.add_linguistic_type(lingtype='orthT', timealignable=False, graphicreferences=False, constraints='Symbolic_Association')
    elan_file.add_linguistic_type(lingtype='wordT', timealignable=False, graphicreferences=False, constraints='Symbolic_Subdivision')
    elan_file.add_linguistic_type(lingtype='posT', timealignable=False, graphicreferences=False, constraints='Symbolic_Subdivision')
    elan_file.add_linguistic_type(lingtype='lemmaT', timealignable=False, graphicreferences=False, constraints='Symbolic_Subdivision')
    elan_file.add_linguistic_type(lingtype='morphT', timealignable=False, graphicreferences=False, constraints='Symbolic_Subdivision')
    elan_file.add_linguistic_type(lingtype='ft-rusT', timealignable=False, graphicreferences=False, constraints='Symbolic_Association')
    elan_file.add_linguistic_type(lingtype='ft-engT', timealignable=False, graphicreferences=False, constraints='Symbolic_Association')

    for speaker in speakers:
        elan_file.add_tier(tier_id='ref@' + speaker, ling='refT')
        elan_file.add_tier(tier_id='orth@' + speaker, ling='orthT', parent='ref@' + speaker)
        elan_file.add_language(lang_def='http://cdb.iso.org/lg/CDB-00131321-001', lang_id='kpv', lang_label='Komi-Zyrian (kpv)')
        elan_file.add_tier(tier_id='word@' + speaker, ling='wordT', parent='orth@' + speaker, language='kpv')

    elan_file.remove_tier(id_tier='default')

    elan_file.to_file(file_path = session_name)

new_elan_file('test1.eaf', ['s1', 's2', 's3'])
new_elan_file('test2.eaf', ['s3', 's5', 's6'])

Use example with merging tiers

So I had a situation that forced alignment software created for every speaker individual ELAN files with one tier called 'tier1'. I wanted to merge those into one ELAN file, and that I did with this:

import pympi
import glob
import os

elan_files = glob.glob('{}/*.eaf'.format('.'))

new_eaf = pympi.Elan.Eaf()
new_eaf.add_linguistic_type(lingtype='utterance', timealignable=True)

for eaf in elan_files:
    eaf_ob = pympi.Elan.Eaf(eaf)
    current_tier = os.path.splitext(os.path.basename(eaf))[0]
    eaf_ob.copy_tier(eaf_obj = new_eaf, tier_name = 'tier1')
    new_eaf.rename_tier(id_from = 'tier1', id_to = current_tier)

new_eaf.to_file(file_path = 'test.eaf')

So it basically does this:

  • Find all ELAN files
  • Create an empty ELAN file
  • Add the needed linguistic type
  • For each ELAN file, take the tier called tier1 and copy it to the new file and rename
  • Save to file

And it produced an ELAN file like this:

Merged ELAN file

Which corresponds to data in this plot:

Matched items

This is run as later step after another script:

for wav in `ls MONO*WAV | egrep 'MONO\-\d\d\d\.WAV+'`
do
  output=$(echo $wav | sed 's/.WAV/-song1.WAV/g')
  sox $wav $output trim 0 284
done

for wav in `ls MONO*song1*WAV`
do
  output=$(echo $wav | sed 's/.WAV/.eaf/g')
  aeneas_execute_task \
     $wav \
     song1.txt \
     "task_language=ukr|os_task_file_format=eaf|is_text_type=plain" \
     $output
done

So in principle one just runs:

bash sox_and_aeneas.sh
python merge_eaf.py

So the first cuts from the files those seconds and applies the forced alignment into this, and later script accesses the folder song1, populated by the first script.

Please see the folder FRechdoc/forced_alignment for the most up to date versions of these scripts.