Mansi NLP Grammar

Finite state and Constraint Grammar based analysers, proofing tools and other resources

View the project on GitHub giellalt/lang-mns

mns meeting

Agenda:

Does our working environment work after the infra reorganisation?

Evidently, yes.

how are things with the lexicon development? (coverage etc)

There are several unknown naes.

See Missing wordforms below.

the grammar issues

Verb prefixes

These are problems in the participle forms.

Case inflected participles

Jack and Csilla will have a look. We look at these before the prefixes.

speller suggesting mechanism

Status:

Letters:
а        а̄        1
е        е̄        1
о        о̄        1
и        ӣ        1
у        ӯ        1
я        я̄        1
э        э̄        1
ы        ы̄        1
ю         ю̄        1
н        ӈ         4

strings:
сь:щ        1
ти:ки        2
ки:ти        2
тӣ:кӣ        2
те:ке        2
те̄:ке̄        2
рр:р        2
р:рр        2
я:ья

Final strings
на:н        1
ӯв:ув

plans ahead

Meeting with FM

Report to follow.

Beta speller release?

We have the speller online:

https://divvun.org/proofing/online-speller.html

TODO:

We will evaluate where we stand at the next meeting.

next meeting

Feb oth 1300 Norw time. Todolist as above.

Missing wordforms

Command:

cat test/data/Luima_Seripos_2013-2017.txt | hfst-tokenise tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst | preprocess --corr=test/data/typos.txt| hfst-tokenise -cg tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst | grep " ?"|cut -d'"' -f2 > misc/luima_seripos.missing.DATE

cat misc/luima_seripos.missing.DATE |sort|uniq -c|sort -nr > misc/luima_seripos.missing.DATE.freq

Many of the missing verbs are OC forms. .. perhaps short/long in the suffix

 160 ловиньтэлы̄н
 145 намаим past participle (wrong suff?)
 126 млн -- abbreviation
 117 ханищтахтан present participle ~ ханищтахтын
 116 ла̄вме add Px to past participle ptc. ла̄вым - ptc.PxSg3 лавме (missing in paradigm)
 103 таса̄вит
  93 о̄лме, jf. ла̄вме
  92 ханищтыяныл
  88 ла̄внэ̄тэ   present participle + PxSg3
  86 хо̄тпатын
  85 ханищтыянэ
  83 во̄раим = same case as намаим 
  83 насати -- missing in lexicon? or ти/ки final sequence
  82 янытлылӯв
  79 а̄гмыӈмосыӈ
  79 маттем
  74 финноугорский -- hyphen missing
  74 народов
  73 финно-угор  missing?
  73 млрд milliard -- abbreviation
  70 ханищтахтам  -- past participle ~ханищтахтым
  70 янытлыянӯв
  67 ялмум = same as ла̄вме, except this is Sg1
  66 по
  63 ня̄врамытын
  61 ма̄ньполь
  60 на
  59 рӯпитанэ̄тэ  present participle + PxSg3
  59 ла̄внэ̄ныл
  58 ща̄нягеа̄щаге
  58 манаса̄вит
  57 о̄лнэ̄ныл
  57 вуйвес
  55 май
  53 субсидияолныл
  53 о̄ньщис
  52 ёмщаквег
  51 о̄лнэ̄тэ   present participle + PxSg3
  51 пӯмщиг
  50 ляпатем

 115 Маа
  91 Ханты-Мансийскат
  50 России
  49 Севера
  49 За
  39 Ма̄ньполь typo
  38 Титыт typo
  35 А̄сугорский typ
  35 Марий (mari el)
  32 Э̄кваго̄йкаг 
  32 Акватэ "one" + PxSg3
  32 А̄щум
  31 Эл = mari el
  30 Тыима̄гыс Тыи-ма̄гыс
  29 Яты
  29 Ята
  28 Во̄нтыракве diminutive suff
  28 Новости russ
  26 Яныгпо̄ль january, should be ok
  26 Спасти film festival name
  23 А̄гитпыгыт
  23 Моще̄ртын
  23 О̄йттур
  23 День
  21 Финноугорский
  20 Ща̄нягеа̄щаге
  20 Со̄выркве
  20 Яа