SemFi, SemUr (legacy stuff) - mikahama/uralicNLP GitHub Wiki

What are SemFi and SemUr

SemFi is a collection of Finnish words and their syntactic relations. SemFi stores the strength of the syntactic relations between words. SemUr is a collection of automatically translated versions of SemFi for other Uralic languages.

Downloading the models

On command line:

python -m uralicNLP.download --languages fin --semfi

Use the following script to download the semantic databases in Python:

from uralicNLP import semfi
semfi.download("fin")

Use semfi.supported_languages() to list the supported languages.

Queries

Look a word up

You can find information stored in SemFi about words with their lemma and pos.

semfi.get_word("kissa","N", "fin")
>> {'word': u'kissa', 'compund': 0, 'pos': u'N', 'frequency': 23214, 'relative_frequency': 0.000172062683057, 'id': u'kissa_N'}

You can also list homonyms without explicitly giving the pos.

semfi.get_words("kuusi", "fin")
>> [{'word': u'kuusi', 'compund': 0, 'pos': u'N', 'frequency': 3823, 'relative_frequency': 2.83361608221e-05, 'id': u'kuusi_N'}, {'word': u'kuusi', 'compund': 0, 'pos': u'Num', 'frequency': 19897, 'relative_frequency': 0.000147477005461, 'id': u'kuusi_Num'}]

Find related words

word = semfi.get_word("näätä","N", "fin")
semfi.get_all_relations(word, "fin", sort=True) #lists all related words
>> [{'zscore': 6.84208734905, 'frequency': 9, 'relation': u'ROOT', 'word2': {'word': u'olla', 'compund': 0, 'pos': u'V', 'frequency': 5301968, 'relative_frequency': 0.0392983044525, 'id': u'olla_V'}, 'relative_frequency': 0.1125, 'word1': {'word': u'näätä', 'compund': 0, 'pos': u'N', 'frequency': 276, 'relative_frequency': 2.0457181237e-06, 'id': u'näätä_N'}}]

semfi.get_by_relation(word, "dobj", "fin", sort=True) #lists words with a given syntactic relation
>> [{'zscore': 0, 'frequency': 1, 'relation': u'dobj', 'word2': {'word': u'tai', 'compund': 0, 'pos': u'C', 'frequency': 783, 'relative_frequency': 5.80361337268e-06, 'id': u'tai_C'}, 'relative_frequency': 1, 'word1': {'word': u'näätä', 'compund': 0, 'pos': u'N', 'frequency': 276, 'relative_frequency': 2.0457181237e-06, 'id': u'näätä_N'}}, ...]

word2 = semfi.get_word("syödä","V", "fin")
semfi.get_by_word(word, word2, "fin")
>> [{'zscore': 1.48741029327, 'frequency': 3, 'relation': u'ROOT', 'word2': {'word': u'syödä', 'compund': 0, 'pos': u'V', 'frequency': 128242, 'relative_frequency': 0.000950532549347, 'id': u'syödä_V'}, 'relative_frequency': 0.0375, 'word1': {'word': u'näätä', 'compund': 0, 'pos': u'N', 'frequency': 276, 'relative_frequency': 2.0457181237e-06, 'id': u'näätä_N'}}, ...]

SemFi provides many methods for finding related words. One can get words by all relations, by a given relation or find relations by another word. The results can be sorted by their frequency by sort=True.

Cite

If you use SemFi or SemUr, cite the following publication

Hämäläinen, Mika. (2018). Extracting a Semantic Database with Syntactic Relations for Finnish to Boost Resources for Endangered Uralic Languages. In The Proceedings of Logic and Engineering of Natural Language Semantics 15 (LENLS15)