SemFi, SemUr (legacy stuff) - mikahama/uralicNLP GitHub Wiki
What are SemFi and SemUr
SemFi is a collection of Finnish words and their syntactic relations. SemFi stores the strength of the syntactic relations between words. SemUr is a collection of automatically translated versions of SemFi for other Uralic languages.
Downloading the models
On command line:
python -m uralicNLP.download --languages fin --semfi
Use the following script to download the semantic databases in Python:
from uralicNLP import semfi
semfi.download("fin")
Use semfi.supported_languages() to list the supported languages.
Queries
Look a word up
You can find information stored in SemFi about words with their lemma and pos.
semfi.get_word("kissa","N", "fin")
>> {'word': u'kissa', 'compund': 0, 'pos': u'N', 'frequency': 23214, 'relative_frequency': 0.000172062683057, 'id': u'kissa_N'}
You can also list homonyms without explicitly giving the pos.
semfi.get_words("kuusi", "fin")
>> [{'word': u'kuusi', 'compund': 0, 'pos': u'N', 'frequency': 3823, 'relative_frequency': 2.83361608221e-05, 'id': u'kuusi_N'}, {'word': u'kuusi', 'compund': 0, 'pos': u'Num', 'frequency': 19897, 'relative_frequency': 0.000147477005461, 'id': u'kuusi_Num'}]
Find related words
word = semfi.get_word("näätä","N", "fin")
semfi.get_all_relations(word, "fin", sort=True) #lists all related words
>> [{'zscore': 6.84208734905, 'frequency': 9, 'relation': u'ROOT', 'word2': {'word': u'olla', 'compund': 0, 'pos': u'V', 'frequency': 5301968, 'relative_frequency': 0.0392983044525, 'id': u'olla_V'}, 'relative_frequency': 0.1125, 'word1': {'word': u'näätä', 'compund': 0, 'pos': u'N', 'frequency': 276, 'relative_frequency': 2.0457181237e-06, 'id': u'näätä_N'}}]
semfi.get_by_relation(word, "dobj", "fin", sort=True) #lists words with a given syntactic relation
>> [{'zscore': 0, 'frequency': 1, 'relation': u'dobj', 'word2': {'word': u'tai', 'compund': 0, 'pos': u'C', 'frequency': 783, 'relative_frequency': 5.80361337268e-06, 'id': u'tai_C'}, 'relative_frequency': 1, 'word1': {'word': u'näätä', 'compund': 0, 'pos': u'N', 'frequency': 276, 'relative_frequency': 2.0457181237e-06, 'id': u'näätä_N'}}, ...]
word2 = semfi.get_word("syödä","V", "fin")
semfi.get_by_word(word, word2, "fin")
>> [{'zscore': 1.48741029327, 'frequency': 3, 'relation': u'ROOT', 'word2': {'word': u'syödä', 'compund': 0, 'pos': u'V', 'frequency': 128242, 'relative_frequency': 0.000950532549347, 'id': u'syödä_V'}, 'relative_frequency': 0.0375, 'word1': {'word': u'näätä', 'compund': 0, 'pos': u'N', 'frequency': 276, 'relative_frequency': 2.0457181237e-06, 'id': u'näätä_N'}}, ...]
SemFi provides many methods for finding related words. One can get words by all relations, by a given relation or find relations by another word. The results can be sorted by their frequency by sort=True.
Cite
If you use SemFi or SemUr, cite the following publication
Hämäläinen, Mika. (2018). Extracting a Semantic Database with Syntactic Relations for Finnish to Boost Resources for Endangered Uralic Languages. In The Proceedings of Logic and Engineering of Natural Language Semantics 15 (LENLS15)