Finite state and Constraint Grammar based analysers, proofing tools and other resources
Words were not analysed. The long u is not our long u. There are 4000 instances of this long u.
HINTRSCT .generated/generator-raw-gt-desc.tmp1.hfst
/usr/local/bin/hfst-compose-intersect: warning:
Found output multi-char symbols ("ӯ") in
transducer in file <stdin> which are not found on the
input tapes of transducers in file morphology/.generated/phonology.rev.hfst.
Lemmas have been added. Jack has looked at the twolc files (work underway).
There are many doublets. Jack will fix them:
1. totally identical entries: unify
2. identical lemma: unify with +v1, +v2, thereafter
3. Csilla to go through and relegate errouneous +v2s to +Err/Orth
@U.VPref.akwan@акван-сагуӈкве+V:@U.VPref.akwan@саг V_U "11_odd_переплести /свить//сплести/" ;
@U.VPref.akwan@акван-сагуӈкве+V:@U.VPref.akwan@саг V_U "11_odd_переплетать /сплетать/" ;
==>
@U.VPref.akwan@акван-сагуӈкве+V:@U.VPref.akwan@саг V_U "11_odd_переплести /свить//сплести/ ~ 11_odd_переплетать /сплетать" ;
@U.VPref.akwan@акван-сагуӈкве+V:@U.VPref.akwan@саг V_U "сплетать" ;
@U.VPref.akwan@акван-сагуӈкве+V:@U.VPref.akwan@саг V_U "свить" ;
@U.VPref.akwan@акван-сагуӈкве+V:@U.VPref.akwan@саг V_U ;
@U.VPref.akwan@акван-сагуӈкве+V:@U.VPref.akwan@саг V_U ; ! сплетать
@U.VPref.akwan@акван-сагуӈкве+V:@U.VPref.akwan@саг V_U "переплести /свить//сплести/ переплетать /сплетать" ; !11_odd
To compare, Russian:
прошибать:прошиба́ нсв_1a "weight: 1.6053081314452122" ;
сшибать:сшиба́ нсв_1a "weight: 1.6722549210758255" ;
расшибать:расшиба́ нсв_1a "weight: 1.3042781357812314" ;
Inari Sámi:
tekstâviestâ+Sem/Dummytag:tekstâ#viestâ 2ALGA "tekstiviesti" ; !
aldâkkâs+Sem/Dummytag:aldâkkâss 4KUNAGAS "salama" ; !¢
klassikko+N+Pl+Nom+Err/Orth+Sem/Dummytag:klassikoh ENDLEX "klassikko" ; ! ij koolgâ suuijâđ klassikko gen. klassiko (PM, MLO)
We are now over 95:
cat test/data/Luima_Seripos_2013-2017.txt | hfst-tokenise -g tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst | grep ‘” ?’ | rev | csort | rev | uniq -c | wc |
15190 48527 436315
cat test/data/Luima_Seripos_2013-2017.txt | \ | |
hfst-tokenise tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst | \ | |
preprocess –corr=test/data/typos.txt | \ | |
hfst-tokenise -cg tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst | \ | |
grep “ ?” | cut -d’”’ -f2 | wc -l |
33786
This is embarrassingly impressive, given that we still have only 3700 or so nouns.
We are slightly down:
TODO for next meeting: Look at what has happened. Trond will have a look and report.
Probable explanation: Csilla has removed forms from the yaml lists that have not been reflected in the FST. The two should be in synch.
This was what we did not know how to address last week.
Poem, heritage speaker:
Ща̄ньтем э̄ргум э̄рге ам та э̄ргалтамлум.
Ща̄нь-те-м э̄рг-ум э̄рг-е ам та э̄ргалт-ам-лум.
mother-dim-PxSg1 sing-PrfPrc song-PxSg3 Sg.1 thus sing-PrfPrc-PxSg1.OcSg
Here, I was singing the song sung by my mom.
Uneducated native speaker. A bit unsecure here.
Минаме̄н мус акв а̄гитэ колэ̄н масхатуӈкв ха̄йтталас, ма̄нь писале матрыг ёт а̄лмаямтэ.
Мин-ам-е̄н мус акв а̄ги-тэ кол-э̄-н масхатуӈкв ха̄йттал-ас, ма̄нь писал-е матрыг ёт а̄лмая-м-тэ.
go-PrfPrc-? until one girl-PxSg3(Det) house-PxSg3(Det)-Lat run-Prt.ScSg3 small gun-PxSg3(Det) for.some.reason with pick.up-PrfPrc-PxSg3.OcSg
Before they left, one of the girls run into the house to grab the small gun with her, for some reason.
Rombandeeva’s niece T.D. Slinkina:
Ты э̄лыпа̄лт нэ̄глум «Лӯима̄ сэ̄рипос» газетат (№22) Людмила Алгадьева хансум статьятэ̄т («Ма̄ньщи газетав янытлаӈкве са̄в хо̄тпа ёхталас») тав ам потрум атнув торгамтамтэ.
Ты э̄лы-па̄л-т нэ̄гл-ум «Лӯима̄ сэ̄рипос» газета-т (№22) Людмила Алгадьева ханс-ум статья-тэ̄-т («Ма̄ньщи газета-в янытлаӈкве са̄в хо̄тпа ёхтал-ас») тав ам потр-ум ат-нув торгамт-ам-тэ.
this befor.Loc precede-PrfPrc Luima Seripos newspaper-Loc L. A. write-PrfPrc article-PxSg3-Loc (Mansi newspaper-PxPl1 praise many person arrive-Prt.ScSg3 sg.3 sg.1 speech-PxSg1 no-? understand-PrfPrc-PxSg3.OcSg
In her article (Many people came to celebrate the jubilee of our Mansi newspaper) in the previous number of Luima Seripos Lyudmila Algadeva didn't quite get what I was saying.
Csilla. Present participles have Px suffixes, past participles have Vx ones.
рӯпитаӈкве+V+PrsPrc+PxSg3: рӯпитанэ̄тэ = zero object
рӯпитаӈкве+V+PrsPrc+PxSg3: рӯпитанэ̄тэ = single object
рӯпитаӈкве+V+PrfPrc+PxSg3: рӯпитаме = zero object (+PxSg3)
рӯпитаӈкве+V+PrfPrc+PxSg3: рӯпитамтэ = single object +PxSg3+... (+Sg/+PdSg/+OcSg/+OxSg)
Four alternatives for tagging:
рӯпитаӈкве+V+PrfPrc+Sg+PxSg3: рӯпитамтэ = it is a singular form, right? <==
рӯпитаӈкве+V+PrfPrc+PdSg+PxSg3: рӯпитамтэ = Possessed
рӯпитаӈкве+V+PrfPrc+OcSg+PxSg3: рӯпитамтэ = Modeled like the finites
рӯпитаӈкве+V+PrfPrc+OxSg+PxSg3: рӯпитамтэ = Ox matches Px,
We tentatively (!) go for the first of the gang of four.
Trond will change the yaml files, Jaska will synchronise the fsts :-)
Add them to the lexicon.
Csilla to add typos markup (error
Trond to set up testing infrastructure and test.
Continue the Add Lemmas To Lexicon Project
681 abbr
2197 Adjectives
271 adverbs
21 con
17 inter
364 mns-propernouns
3698 Nouns <======== quite bad
80 Numerals
2 Participles
91 Postpositions
174 Pronouns
4180 Verbs <======== not bad
11776 All told
We aim at 20000 or so. This is for Csilla and Jack. Starting point: Jack’s reverse list.
ам+Pron+Pers+Sg1+Nom: [амкем, амккем, амке̄мт, амкке̄мт] # dialect variation
ам+Pron+Pers+Du1+Nom: [ме̄нккеме̄н, ме̄нкеме̄н, ме̄нкемнт] # dialect variation
This means “I alone, you alone, …”.
Csilla: Reverse order
We have a plan (above)
The big picture is that we are coming closer to a working beta spellchecker. Goal: A beta version by summer.
Friday, Feb. 23rd at 1400 Finnish time.