mns meeting
- Time: Jan 30th 2024
- Present: Csilla, Jack, Trond
Agenda:
- Does our working environment work after the infra reorganisation?
- how are things with the lexicon development? (coverage etc)
- the grammar issues
- speller suggesting mechanism
- plans ahead
- next meeting
- Missing wordforms
Does our working environment work after the infra reorganisation?
Evidently, yes.
- We are now down to some verbs not having infinitives recognised, the rest ok.
how are things with the lexicon development? (coverage etc)
- Today’s Luima seripos result: 240130: 1-(37942/708153) = 0.946
There are several unknown naes.
- 91 Ханты-Мансийскат is an exception, а and not ы before -т.
- Russian case forms: 50 России, 49 Севера
See Missing wordforms below.
the grammar issues
Verb prefixes
These are problems in the participle forms.
Case inflected participles
Jack and Csilla will have a look. We look at these before the prefixes.
speller suggesting mechanism
Status:
- tools/spellcheckers/editdist.default.txt
- tools/spellcheckers/strings.default.txt
- tools/spellcheckers/final_strings.default.txt
Letters:
а а̄ 1
е е̄ 1
о о̄ 1
и ӣ 1
у ӯ 1
я я̄ 1
э э̄ 1
ы ы̄ 1
ю ю̄ 1
н ӈ 4
strings:
сь:щ 1
ти:ки 2
ки:ти 2
тӣ:кӣ 2
те:ке 2
те̄:ке̄ 2
рр:р 2
р:рр 2
я:ья
Final strings
на:н 1
ӯв:ув
- To next meeting: Trond to find a working testbench for us to work with
- Csilla could add to
test/data/typos.txt: *конякконьяк*
plans ahead
Meeting with FM
Report to follow.
Beta speller release?
We have the speller online:
https://divvun.org/proofing/online-speller.html
TODO:
- suggestion mechanism (Trond: Fix the testbench)
- abbreviations
- coverage: grammar, names, general vocabulary
- add a speller corpus (Trond)
We will evaluate where we stand at the next meeting.
next meeting
Feb oth 1300 Norw time. Todolist as above.
Missing wordforms
Command:
cat test/data/Luima_Seripos_2013-2017.txt | hfst-tokenise tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst | preprocess --corr=test/data/typos.txt| hfst-tokenise -cg tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst | grep " ?"|cut -d'"' -f2 > misc/luima_seripos.missing.DATE
cat misc/luima_seripos.missing.DATE |sort|uniq -c|sort -nr > misc/luima_seripos.missing.DATE.freq
Many of the missing verbs are OC forms. .. perhaps short/long in the suffix
160 ловиньтэлы̄н
145 намаим past participle (wrong suff?)
126 млн -- abbreviation
117 ханищтахтан present participle ~ ханищтахтын
116 ла̄вме add Px to past participle ptc. ла̄вым - ptc.PxSg3 лавме (missing in paradigm)
103 таса̄вит
93 о̄лме, jf. ла̄вме
92 ханищтыяныл
88 ла̄внэ̄тэ present participle + PxSg3
86 хо̄тпатын
85 ханищтыянэ
83 во̄раим = same case as намаим
83 насати -- missing in lexicon? or ти/ки final sequence
82 янытлылӯв
79 а̄гмыӈмосыӈ
79 маттем
74 финноугорский -- hyphen missing
74 народов
73 финно-угор missing?
73 млрд milliard -- abbreviation
70 ханищтахтам -- past participle ~ханищтахтым
70 янытлыянӯв
67 ялмум = same as ла̄вме, except this is Sg1
66 по
63 ня̄врамытын
61 ма̄ньполь
60 на
59 рӯпитанэ̄тэ present participle + PxSg3
59 ла̄внэ̄ныл
58 ща̄нягеа̄щаге
58 манаса̄вит
57 о̄лнэ̄ныл
57 вуйвес
55 май
53 субсидияолныл
53 о̄ньщис
52 ёмщаквег
51 о̄лнэ̄тэ present participle + PxSg3
51 пӯмщиг
50 ляпатем
115 Маа
91 Ханты-Мансийскат
50 России
49 Севера
49 За
39 Ма̄ньполь typo
38 Титыт typo
35 А̄сугорский typ
35 Марий (mari el)
32 Э̄кваго̄йкаг
32 Акватэ "one" + PxSg3
32 А̄щум
31 Эл = mari el
30 Тыима̄гыс Тыи-ма̄гыс
29 Яты
29 Ята
28 Во̄нтыракве diminutive suff
28 Новости russ
26 Яныгпо̄ль january, should be ok
26 Спасти film festival name
23 А̄гитпыгыт
23 Моще̄ртын
23 О̄йттур
23 День
21 Финноугорский
20 Ща̄нягеа̄щаге
20 Со̄выркве
20 Яа