Finite state and Constraint Grammar based analysers, proofing tools and other resources
Fst went slightly up after last night work on verbs, from 0.95259 to 0.95267.
Results from the suggestion mechanism, on the other hand, as gone down.
typos Avrg pos % missp % missp
.txt for corr in 1st in top-5
-----------------------------------------------------------------
240405: 395 1.15 59.80 63.10
240410: 473 1.08 54.59 57.72
240411: 473 1.08 53.42 56.41
The difference between the first two is probably due to the added typos, but the second difference is harder to explain. In severeal instances, the speller does not behave in a predictable manner, which makes improvment difficult.
Now it turns ot there is a bug in the speller building mechanism.
Trond will follow this up.
has been working on typos and on Jack’s words. Thereafter the missing list:
test/data/Luima_Seripos_2013-2017.missing.freq.240208
Extract:
1 219 а̄гитпыгыт typo а̄гит-пыгыт
2 209 арыгтем should be analysed as арыгкем
3 160 ловиньтэлы̄н should be analysed as ловиньтаӈке+V+Act+Imprt+ScPl2+OcSg
4 145 намаим should be analysed as намаяӈкве+V+Ger
5 129 матыра̄ти typo as матыр-а̄ти
6 125 млн added to abbreviations
7 117 ханищтахтан should be analysed as ханищтахтуӈкве+V+PrsPrc
8 116 ла̄вме should be analysed as ла̄вуӈкве+V+PrfPrc+PxSg3
9 107 а̄стламе а̄стлаӈкве+V+PrfPrc+PxSg3
10 100 а̄гирищитпыгрищит typo а̄гирищит-пыгрищит
...
243 17 иӈыт added to adverbs
244 17 дума added to foreign
245 17 Евра added to proper nouns
246 16 уральтахтуӈкве added to verbs
247 16 та̄нкина̄нылн та̄нкина̄ныл+Lat (reflexive pronoun)
248 16 потыртаманыл потыртаӈкве+V+PrfPrc+PxPl3
249 16 Ня̄врамытын Ня̄врам+Pl+Lat
250 16 ӯргалыяныл ӯргалаӈкве+V+Act+Ind+Prs+ScPl3+OcSg
251 16 хусапсовыт
If the lines are marked consistently, we may extract different tasks from the file:
grep "should be" test/data/Luima_Seripos_2013-2017.missing.freq.240208 |wc -l
95
grep "added" test/data/Luima_Seripos_2013-2017.missing.freq.240208 |wc -l
75
...
Work forward, for Csilla:
Jack has been adding derivational endings to -уӈкве verb types from Csilla’s work in docs/
. Also, he has been working on gospel texts, there are deviances compared to LS, partly due to the orthpgraphy reform.
Gospel of Matt:
total 13426
unique tokens 3125
unrecognized tokens 704
Coverage: 1-(704/13426) = .94756
. This is a new domain, and the coverage result is unexpectedly good.
Has been testing + trying to improve the linguistic basis for the suggestion mechanism.
Csilla will continue adding typos, but systematic evaluation must await debugging of the speller (by Divvun), see above.
Important when awaiting testing:
Suggestion for Csilla: Copy the textbook into the online speller (after Trond has updated it) and get a feeling of how it works + report back on this feeling.
Most derived уӈкве forms were lexicalised already. There were not that many аӈкве вербс
Csilla has literature on this, from Rombandeeva. She is not too systematic, but has very many forms (for the western grammarians, the situation is the opposite).
We should start with LS after 2017.
Trond will discuss inclusion of books with Børre, and return to the issue.
So far Gospel of Mark.
April 22nd, 1300 Norwegian time.