mns meeting
- Time: Feb 16th 2024
- Present: Csilla, Jaska, Trond
Agenda:
- status
- Test results
- Missing wordforms
- speller suggesting mechanism
- plans ahead
- next meeting
status
Composed у + macron
Words were not analysed. The long u is not our long u. There are 4000 instances of this long u.
- ӯ COMBINING MACRON
- ӯрим
HINTRSCT .generated/generator-raw-gt-desc.tmp1.hfst
/usr/local/bin/hfst-compose-intersect: warning:
Found output multi-char symbols ("ӯ") in
transducer in file <stdin> which are not found on the
input tapes of transducers in file morphology/.generated/phonology.rev.hfst.
FST
Lemmas have been added. Jack has looked at the twolc files (work underway).
Lexicons
There are many doublets. Jack will fix them:
1. totally identical entries: unify
2. identical lemma: unify with +v1, +v2, thereafter
3. Csilla to go through and relegate errouneous +v2s to +Err/Orth
Entry format
@U.VPref.akwan@акван-сагуӈкве+V:@U.VPref.akwan@саг V_U "11_odd_переплести /свить//сплести/" ;
@U.VPref.akwan@акван-сагуӈкве+V:@U.VPref.akwan@саг V_U "11_odd_переплетать /сплетать/" ;
==>
@U.VPref.akwan@акван-сагуӈкве+V:@U.VPref.akwan@саг V_U "11_odd_переплести /свить//сплести/ ~ 11_odd_переплетать /сплетать" ;
@U.VPref.akwan@акван-сагуӈкве+V:@U.VPref.akwan@саг V_U "сплетать" ;
@U.VPref.akwan@акван-сагуӈкве+V:@U.VPref.akwan@саг V_U "свить" ;
@U.VPref.akwan@акван-сагуӈкве+V:@U.VPref.akwan@саг V_U ;
@U.VPref.akwan@акван-сагуӈкве+V:@U.VPref.akwan@саг V_U ; ! сплетать
@U.VPref.akwan@акван-сагуӈкве+V:@U.VPref.akwan@саг V_U "переплести /свить//сплести/ переплетать /сплетать" ; !11_odd
To compare, Russian:
прошибать:прошиба́ нсв_1a "weight: 1.6053081314452122" ;
сшибать:сшиба́ нсв_1a "weight: 1.6722549210758255" ;
расшибать:расшиба́ нсв_1a "weight: 1.3042781357812314" ;
Inari Sámi:
tekstâviestâ+Sem/Dummytag:tekstâ#viestâ 2ALGA "tekstiviesti" ; !
aldâkkâs+Sem/Dummytag:aldâkkâss 4KUNAGAS "salama" ; !¢
klassikko+N+Pl+Nom+Err/Orth+Sem/Dummytag:klassikoh ENDLEX "klassikko" ; ! ij koolgâ suuijâđ klassikko gen. klassiko (PM, MLO)
Test results
coverage etc
We are now over 95:
- 240216: 1-(33786/707721) = 0.95226
| cat test/data/Luima_Seripos_2013-2017.txt | hfst-tokenise -g tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst | grep ‘” ?’ | rev | csort | rev | uniq -c | wc |
15190 48527 436315
| cat test/data/Luima_Seripos_2013-2017.txt | \ | |
| hfst-tokenise tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst | \ | |
| preprocess –corr=test/data/typos.txt | \ | |
| hfst-tokenise -cg tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst | \ | |
| grep “ ?” | cut -d’”’ -f2 | wc -l |
33786
This is embarrassingly impressive, given that we still have only 3700 or so nouns.
Yaml tests
Test results
We are slightly down:
- 231129: gt-norm fst(s): PASSES: 8416
- 240208: gt-norm fst(s): PASSES: 8394
- 240216: gt-norm fst(s): PASSES: 8352
TODO for next meeting: Look at what has happened. Trond will have a look and report.
Probable explanation: Csilla has removed forms from the yaml lists that have not been reflected in the FST. The two should be in synch.
Work on past participles
This was what we did not know how to address last week.
Poem, heritage speaker:
Ща̄ньтем э̄ргум э̄рге ам та э̄ргалтамлум.
Ща̄нь-те-м э̄рг-ум э̄рг-е ам та э̄ргалт-ам-лум.
mother-dim-PxSg1 sing-PrfPrc song-PxSg3 Sg.1 thus sing-PrfPrc-PxSg1.OcSg
Here, I was singing the song sung by my mom.
Uneducated native speaker. A bit unsecure here.
Минаме̄н мус акв а̄гитэ колэ̄н масхатуӈкв ха̄йтталас, ма̄нь писале матрыг ёт а̄лмаямтэ.
Мин-ам-е̄н мус акв а̄ги-тэ кол-э̄-н масхатуӈкв ха̄йттал-ас, ма̄нь писал-е матрыг ёт а̄лмая-м-тэ.
go-PrfPrc-? until one girl-PxSg3(Det) house-PxSg3(Det)-Lat run-Prt.ScSg3 small gun-PxSg3(Det) for.some.reason with pick.up-PrfPrc-PxSg3.OcSg
Before they left, one of the girls run into the house to grab the small gun with her, for some reason.
Rombandeeva’s niece T.D. Slinkina:
Ты э̄лыпа̄лт нэ̄глум «Лӯима̄ сэ̄рипос» газетат (№22) Людмила Алгадьева хансум статьятэ̄т («Ма̄ньщи газетав янытлаӈкве са̄в хо̄тпа ёхталас») тав ам потрум атнув торгамтамтэ.
Ты э̄лы-па̄л-т нэ̄гл-ум «Лӯима̄ сэ̄рипос» газета-т (№22) Людмила Алгадьева ханс-ум статья-тэ̄-т («Ма̄ньщи газета-в янытлаӈкве са̄в хо̄тпа ёхтал-ас») тав ам потр-ум ат-нув торгамт-ам-тэ.
this befor.Loc precede-PrfPrc Luima Seripos newspaper-Loc L. A. write-PrfPrc article-PxSg3-Loc (Mansi newspaper-PxPl1 praise many person arrive-Prt.ScSg3 sg.3 sg.1 speech-PxSg1 no-? understand-PrfPrc-PxSg3.OcSg
In her article (Many people came to celebrate the jubilee of our Mansi newspaper) in the previous number of Luima Seripos Lyudmila Algadeva didn't quite get what I was saying.
- Particples are participles that can have Px
- E.R: They are finite forms that are conjugated
Csilla. Present participles have Px suffixes, past participles have Vx ones.
рӯпитаӈкве+V+PrsPrc+PxSg3: рӯпитанэ̄тэ = zero object
рӯпитаӈкве+V+PrsPrc+PxSg3: рӯпитанэ̄тэ = single object
рӯпитаӈкве+V+PrfPrc+PxSg3: рӯпитаме = zero object (+PxSg3)
рӯпитаӈкве+V+PrfPrc+PxSg3: рӯпитамтэ = single object +PxSg3+... (+Sg/+PdSg/+OcSg/+OxSg)
Four alternatives for tagging:
рӯпитаӈкве+V+PrfPrc+Sg+PxSg3: рӯпитамтэ = it is a singular form, right? <==
рӯпитаӈкве+V+PrfPrc+PdSg+PxSg3: рӯпитамтэ = Possessed
рӯпитаӈкве+V+PrfPrc+OcSg+PxSg3: рӯпитамтэ = Modeled like the finites
рӯпитаӈкве+V+PrfPrc+OxSg+PxSg3: рӯпитамтэ = Ox matches Px,
We tentatively (!) go for the first of the gang of four.
Trond will change the yaml files, Jaska will synchronise the fsts :-)
Missing wordforms
Add them to the lexicon.
speller suggesting mechanism
typos
Csilla to add typos markup (error
Trond to set up testing infrastructure and test.
plans ahead
Lexicon
Continue the Add Lemmas To Lexicon Project
681 abbr
2197 Adjectives
271 adverbs
21 con
17 inter
364 mns-propernouns
3698 Nouns <======== quite bad
80 Numerals
2 Participles
91 Postpositions
174 Pronouns
4180 Verbs <======== not bad
11776 All told
We aim at 20000 or so. This is for Csilla and Jack. Starting point: Jack’s reverse list.
Solitary personal pronouns
ам+Pron+Pers+Sg1+Nom: [амкем, амккем, амке̄мт, амкке̄мт] # dialect variation
ам+Pron+Pers+Du1+Nom: [ме̄нккеме̄н, ме̄нкеме̄н, ме̄нкемнт] # dialect variation
This means “I alone, you alone, …”.
Csilla: Reverse order
Speller suggestion
We have a plan (above)
The big picture is that we are coming closer to a working beta spellchecker. Goal: A beta version by summer.
next meeting
Friday, Feb. 23rd at 1400 Finnish time.