Finite state and Constraint Grammar based analysers, proofing tools and other resources
View the project on GitHub giellalt/lang-mns
Present: Csilla, Jack, Trond
Agenda
Csilla working with missing from textbook. The ones that were on the list due to the stems missing are added. The rest are partly typos, partly grammatical errors.
Verbs: Many verbforms missing, since the yaml paradigm is missing.
Verbs with prefixes: Jack discussed with Sjur and later with Csilla.
Prefixes are hyphened with stems. This hyphen may ahve disappeared in conversion pdf to lexc.
Hyphen is not present if there is neg part or disc particle between prefix and verb the hyphen is disappeared.
Trond has been testing.
We want:
pref-stem-infsuff:pref-stem contlex ;
We a bit too often have (due to conversion errors from pdf):
prefstem-infsuff:prefstem contlex ;
As for negation and discourse particles, we should be able to get
prefix (negation) particle/disc VerbstemVerbinfl
хансыглаӈкве+V:хансыгл V_A ;
хансыглаӈкве+V+Pss+Ind+Prt+ScPl3+Pref/jol+Neg+Disc/ta
ёл ат та хансыглаве̄сыт
Does the former string contain 4, 3 or 1 words?
ёл-хансыглаӈкве in lexc like this:
ёл-хансыглаӈкве+V:ёл-хансыгл V_A ;
ёл-хансыглаӈкве+V:хансыгл V_A ; ! not dict
ёл-хансыглаӈкве+V:ёл-хансыгл V_A ;
(1) @U.Pref.jol@+Pref/jol:@U.Pref.jol@ёл-
at, ta
(1.a) @U.Pref.jol@ёл-хансыглаӈкве+V:@U.Pref.jol@хансыгл V_A ;
(1.b) @U.Pref.jol@+Pref/jol:@U.Pref.jol@
1 word:
ёл ат та хансыглаве̄сыт
ёл-хансыглаӈкве+V+Pss+Ind+Prt+ScPl3+Pref/jol+Neg+Disc/ta
Strings:
ёл вос паты ёл ул поваре̄н ёл ул вос щё̄питаве ?ёл ул вос та пуваве
There are thus 3 slots before the verb stem (where any line
from each column may combine with any line from the next)
| prefix | negation | discourse | infinitive
| ------ | -------- | --------- | ----------
| (empty)| (empty) | (empty) | the infinitives
| ёл | ат | та | ...
| хот | ул | вос | ...
| хот | | вос та | ...
| ... | ... | ... | ...
This is our grammar for now. Jack should have a go at it.
Two alternative:
1. Make the string *prefix (neg) (particle) stem* as one lemma + wordform
2. Let them be one lemma each
Ad (1). The verb should have an optional neg tag and an optional particle tag, so that the wordform *prefix (neg) (particle) stem* gets the analysis
`prefix-verbstemInfsuff+V+Ind+Prs+Sg3+Neg+DiscourseParticleA`
The easy solution is (2). We (1).
## Missing words in textbook:
4 ха̄пкве # dim 4 ма̄кве # dim 3 областьт # parallel form 3 сарилум # missing form in paradigm 3 хумна # dialect form 3 мувл # typo 2 та̄нкина̄ныл # etc. 2 котьле̄нт 2 нё̄тнэ̄г 2 коныпа̄л 2 холасыт 2 тӯрытын 2 ӯльтта 2 поляве 2 ма̄тем 2 Мощхум 2 холас 2 минэн 2 во̄йл 2 ӯйит 2 хот 2 мин 2 кг
Over 100 for LS:
680 палт # typo 473 та̄нти # variant of pron та̄нки 373 о̄вылтыт # ordinal !! 266 Мо̄лты # Missing adjective 249 ма̄вит # typo hyph 218 а̄гитпыгыт # typo hyph 216 ма̄во̄й # typo 209 арыгтем # missing from stems 196 арыгкем 189 сыресыр 181 акванатхатыгласыт 173 места 160 ловиньтэлы̄н 159 ӯнттувес 157 акваннё̄тхатым 145 намаим 131 мӯсхалыг 129 матыра̄ти 125 млн 121 мирхал 117 ханищтахтан 116 ла̄вме 112 сапра̄ни 103 таса̄вит 101 тотыяныл 100 а̄гирищитпыгрищит ```
Many of these are in the typos.txt already:
ма̄вит ма̄-вит
Todo for Trond: Include typos.txt in the missing words procedure.
If you have preprocess
(check with preprocess --help
)
cat test/data/Readings_20230901.txt |\ #1 hfst-tokenise tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst |\ #2 preprocess –typos=test/data/typos.txt|\ #3 hfst-tokenise -cg tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst|\ #4 grep “ ?” #5
Csilla: Still do missing lists:
- missing stems where they belong
- typos in typos.txt
- report grammar issues to J&T
Jack: The prefixes
Trond: Look at testing pipelines, perhaps also suggestion mechanism.
Wednesday 29 at 14 finnish time.