Mansi NLP Grammar

Finite state and Constraint Grammar based analysers, proofing tools and other resources

View the project on GitHub giellalt/lang-mns

Page Content

Working meeting on Mansi, in Hki

Agenda

Lexicon

Now, we have 18000 or so words in the LS missing list, and 11000 stems in src/fst/stems/.

Goal: Extend the stems lexicon. Doubling it would be a natural goal.

Commands along the line of this:

cat test/data/Luima_Seripos_2013-2017.missing.freq.231129 |grep "ӈкве$"|cut -c6-|rev|csort|rev > x
paste -d":" x x|rev|csort|rev|sed 's/:/+V:/'|sed 's/юӈкве$/й V_U ;/'|sed 's/яӈкве$/й V_A ;/'|sed 's/.ӈкве$/й V_ ;/'

Grammar

This is the prefixes-as-flag. We postpone the issue for today.

Suggestions for the speller

Csilla to look for likely (final) string substitutions. Filenames to edit:

tools/spellcheckers/editdist.default.txt
tools/spellcheckers/final_strings.default.txt
tools/spellcheckers/strings.default.txt

Speller release

We already are above 92 % for newspaper corpus coverage. Removing half of the missing list should give us several percentages more.

We should also look for remaining holes in the grammar.

Then, the suggestion mechanism should give good results for the typos.txt list (we will have to return to the threshold.).

After these are done we will have a speller ready for first release. We should be able to do that in the spring.

After that next steops would be more work on the lexicon, using dictionaries and corpus texts outside LS (e.g. school text books).