Mansi NLP Grammar

Finite state and Constraint Grammar based analysers, proofing tools and other resources

View the project on GitHub giellalt/lang-mns

mns meeting

Agenda:

status

Composed у + macron

Words were not analysed. The long u is not our long u. There are 4000 instances of this long u.

  HINTRSCT .generated/generator-raw-gt-desc.tmp1.hfst
/usr/local/bin/hfst-compose-intersect: warning: 
Found output multi-char symbols ("ӯ") in 
transducer in file <stdin> which are not found on the
input tapes of transducers in file morphology/.generated/phonology.rev.hfst.

FST

Lemmas have been added. Jack has looked at the twolc files (work underway).

Lexicons

There are many doublets. Jack will fix them:

1. totally identical entries: unify
2. identical lemma: unify with +v1, +v2, thereafter 
3. Csilla to go through and relegate errouneous +v2s to +Err/Orth

Entry format

@U.VPref.akwan@акван-сагуӈкве+V:@U.VPref.akwan@саг V_U "11_odd_переплести /свить//сплести/" ;
@U.VPref.akwan@акван-сагуӈкве+V:@U.VPref.akwan@саг V_U "11_odd_переплетать /сплетать/" ;

==>
@U.VPref.akwan@акван-сагуӈкве+V:@U.VPref.akwan@саг V_U "11_odd_переплести /свить//сплести/ ~ 11_odd_переплетать /сплетать" ;

@U.VPref.akwan@акван-сагуӈкве+V:@U.VPref.akwan@саг V_U "сплетать" ;
@U.VPref.akwan@акван-сагуӈкве+V:@U.VPref.akwan@саг V_U "свить" ;
@U.VPref.akwan@акван-сагуӈкве+V:@U.VPref.akwan@саг V_U ;
@U.VPref.akwan@акван-сагуӈкве+V:@U.VPref.akwan@саг V_U ; ! сплетать

@U.VPref.akwan@акван-сагуӈкве+V:@U.VPref.akwan@саг V_U "переплести /свить//сплести/ переплетать /сплетать" ; !11_odd 

To compare, Russian:

прошибать:прошиба́ нсв_1a "weight: 1.6053081314452122" ;
сшибать:сшиба́ нсв_1a "weight: 1.6722549210758255" ;
расшибать:расшиба́ нсв_1a "weight: 1.3042781357812314" ;

Inari Sámi:

tekstâviestâ+Sem/Dummytag:tekstâ#viestâ 2ALGA "tekstiviesti" ; !
aldâkkâs+Sem/Dummytag:aldâkkâss 4KUNAGAS "salama" ;     !¢
klassikko+N+Pl+Nom+Err/Orth+Sem/Dummytag:klassikoh ENDLEX "klassikko" ; ! ij koolgâ suuijâđ klassikko gen. klassiko (PM, MLO)

Test results

coverage etc

We are now over 95:

cat test/data/Luima_Seripos_2013-2017.txt hfst-tokenise -g tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst grep ‘” ?’ rev csort rev uniq -c wc

15190 48527 436315

cat test/data/Luima_Seripos_2013-2017.txt \  
hfst-tokenise tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst \  
preprocess –corr=test/data/typos.txt \  
hfst-tokenise -cg tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst \  
grep “ ?” cut -d’”’ -f2 wc -l

33786

This is embarrassingly impressive, given that we still have only 3700 or so nouns.

Yaml tests

Test results

We are slightly down:

TODO for next meeting: Look at what has happened. Trond will have a look and report.

Probable explanation: Csilla has removed forms from the yaml lists that have not been reflected in the FST. The two should be in synch.

Work on past participles

This was what we did not know how to address last week.

Poem, heritage speaker:

Ща̄ньтем э̄ргум э̄рге ам та э̄ргалтамлум.
Ща̄нь-те-м э̄рг-ум э̄рг-е ам та э̄ргалт-ам-лум.
mother-dim-PxSg1 sing-PrfPrc song-PxSg3 Sg.1 thus sing-PrfPrc-PxSg1.OcSg
Here, I was singing the song sung by my mom.

Uneducated native speaker. A bit unsecure here.

Минаме̄н мус акв а̄гитэ колэ̄н масхатуӈкв ха̄йтталас, ма̄нь писале матрыг ёт а̄лмаямтэ.
Мин-ам-е̄н мус акв а̄ги-тэ кол-э̄-н масхатуӈкв ха̄йттал-ас, ма̄нь писал-е матрыг ёт а̄лмая-м-тэ.
go-PrfPrc-? until one girl-PxSg3(Det) house-PxSg3(Det)-Lat run-Prt.ScSg3 small gun-PxSg3(Det) for.some.reason with pick.up-PrfPrc-PxSg3.OcSg
Before they left, one of the girls run into the house to grab the small gun with her, for some reason.

Rombandeeva’s niece T.D. Slinkina:

Ты э̄лыпа̄лт нэ̄глум «Лӯима̄ сэ̄рипос» газетат (№22) Людмила Алгадьева хансум статьятэ̄т («Ма̄ньщи газетав янытлаӈкве са̄в хо̄тпа ёхталас») тав ам потрум атнув торгамтамтэ.
Ты э̄лы-па̄л-т нэ̄гл-ум «Лӯима̄ сэ̄рипос» газета-т (№22) Людмила Алгадьева ханс-ум статья-тэ̄-т («Ма̄ньщи газета-в янытлаӈкве са̄в хо̄тпа ёхтал-ас») тав ам потр-ум ат-нув торгамт-ам-тэ.
this befor.Loc precede-PrfPrc Luima Seripos newspaper-Loc L. A. write-PrfPrc article-PxSg3-Loc (Mansi newspaper-PxPl1 praise many person arrive-Prt.ScSg3 sg.3 sg.1 speech-PxSg1 no-? understand-PrfPrc-PxSg3.OcSg
In her article (Many people came to celebrate the jubilee of our Mansi newspaper) in the previous number of Luima Seripos Lyudmila Algadeva didn't quite get what I was saying.

Csilla. Present participles have Px suffixes, past participles have Vx ones.


рӯпитаӈкве+V+PrsPrc+PxSg3: рӯпитанэ̄тэ = zero object
рӯпитаӈкве+V+PrsPrc+PxSg3: рӯпитанэ̄тэ = single object

рӯпитаӈкве+V+PrfPrc+PxSg3: рӯпитаме  = zero object (+PxSg3)
рӯпитаӈкве+V+PrfPrc+PxSg3: рӯпитамтэ = single object +PxSg3+... (+Sg/+PdSg/+OcSg/+OxSg)

Four alternatives for tagging:
рӯпитаӈкве+V+PrfPrc+Sg+PxSg3: рӯпитамтэ = it is a singular form, right? <==
рӯпитаӈкве+V+PrfPrc+PdSg+PxSg3: рӯпитамтэ = Possessed
рӯпитаӈкве+V+PrfPrc+OcSg+PxSg3: рӯпитамтэ = Modeled like the finites
рӯпитаӈкве+V+PrfPrc+OxSg+PxSg3: рӯпитамтэ = Ox matches Px, 

We tentatively (!) go for the first of the gang of four.

Trond will change the yaml files, Jaska will synchronise the fsts :-)

Missing wordforms

Add them to the lexicon.

speller suggesting mechanism

typos

Csilla to add typos markup (errorcorrect) to the file `test/data/Luima_Seripos_2013-2017.missing.freq.240208`

Trond to set up testing infrastructure and test.

plans ahead

Lexicon

Continue the Add Lemmas To Lexicon Project

     681 abbr
    2197 Adjectives
     271 adverbs
      21 con
      17 inter
     364 mns-propernouns
    3698 Nouns  <======== quite bad
      80 Numerals
       2 Participles
      91 Postpositions
     174 Pronouns
    4180 Verbs  <======== not bad
   11776 All told

We aim at 20000 or so. This is for Csilla and Jack. Starting point: Jack’s reverse list.

Solitary personal pronouns

     ам+Pron+Pers+Sg1+Nom: [амкем, амккем, амке̄мт, амкке̄мт] # dialect variation
     ам+Pron+Pers+Du1+Nom: [ме̄нккеме̄н, ме̄нкеме̄н, ме̄нкемнт] # dialect variation

This means “I alone, you alone, …”.

Csilla: Reverse order

Speller suggestion

We have a plan (above)

The big picture is that we are coming closer to a working beta spellchecker. Goal: A beta version by summer.

next meeting

Friday, Feb. 23rd at 1400 Finnish time.