Mansi NLP Grammar

Finite state and Constraint Grammar based analysers, proofing tools and other resources

View the project on GitHub giellalt/lang-mns

mns meeting

Agenda

Status

fst

Fst went slightly up after last night work on verbs, from 0.95259 to 0.95267.

Results from the suggestion mechanism, on the other hand, as gone down.

             typos      Avrg pos        % missp        % missp
             .txt       for corr        in 1st         in top-5     
-----------------------------------------------------------------
240405:       395        1.15          59.80          63.10          
240410:       473        1.08          54.59          57.72        
240411:       473        1.08          53.42          56.41        

The difference between the first two is probably due to the added typos, but the second difference is harder to explain. In severeal instances, the speller does not behave in a predictable manner, which makes improvment difficult.

Now it turns ot there is a bug in the speller building mechanism.

Trond will follow this up.

Csilla

has been working on typos and on Jack’s words. Thereafter the missing list:

test/data/Luima_Seripos_2013-2017.missing.freq.240208

Extract:

     1   219 а̄гитпыгыт  typo а̄гит-пыгыт
     2   209 арыгтем    should be analysed as арыгкем
     3   160 ловиньтэлы̄н        should be analysed as ловиньтаӈке+V+Act+Imprt+ScPl2+OcSg
     4   145 намаим     should be analysed as намаяӈкве+V+Ger
     5   129 матыра̄ти   typo as матыр-а̄ти
     6   125 млн        added to abbreviations
     7   117 ханищтахтан        should be analysed as ханищтахтуӈкве+V+PrsPrc
     8   116 ла̄вме      should be analysed as ла̄вуӈкве+V+PrfPrc+PxSg3
     9   107 а̄стламе    а̄стлаӈкве+V+PrfPrc+PxSg3
    10   100 а̄гирищитпыгрищит   typo а̄гирищит-пыгрищит
 ...
   243    17 иӈыт       added to adverbs
   244    17 дума       added to foreign
   245    17 Евра       added to proper nouns
   246    16 уральтахтуӈкве     added to verbs
   247    16 та̄нкина̄нылн        та̄нкина̄ныл+Lat (reflexive pronoun) 
   248    16 потыртаманыл       потыртаӈкве+V+PrfPrc+PxPl3
   249    16 Ня̄врамытын Ня̄врам+Pl+Lat
   250    16 ӯргалыяныл ӯргалаӈкве+V+Act+Ind+Prs+ScPl3+OcSg
   251    16 хусапсовыт 

If the lines are marked consistently, we may extract different tasks from the file:

grep "should be" test/data/Luima_Seripos_2013-2017.missing.freq.240208 |wc -l
      95
grep "added" test/data/Luima_Seripos_2013-2017.missing.freq.240208 |wc -l
      75
...

Work forward, for Csilla:

Jack

Jack has been adding derivational endings to -уӈкве verb types from Csilla’s work in docs/. Also, he has been working on gospel texts, there are deviances compared to LS, partly due to the orthpgraphy reform.

Gospel of Matt:

total             13426
unique tokens      3125
unrecognized tokens 704

Coverage: 1-(704/13426) = .94756. This is a new domain, and the coverage result is unexpectedly good.

Trond

Has been testing + trying to improve the linguistic basis for the suggestion mechanism.

Speller

Typos and speller suggestions

Csilla will continue adding typos, but systematic evaluation must await debugging of the speller (by Divvun), see above.

Plan for speller release

Important when awaiting testing:

Suggestion for Csilla: Copy the textbook into the online speller (after Trond has updated it) and get a feeling of how it works + report back on this feeling.

Grammar

аӈкве and уӈкве verbs

Most derived уӈкве forms were lexicalised already. There were not that many аӈкве вербс

Csilla has literature on this, from Rombandeeva. She is not too systematic, but has very many forms (for the western grammarians, the situation is the opposite).

More texts

LS

We should start with LS after 2017.

Book

Trond will discuss inclusion of books with Børre, and return to the issue.

Religious texts

So far Gospel of Mark.

Next meeting

April 22nd, 1300 Norwegian time.