Finite state and Constraint Grammar based analysers, proofing tools and other resources
Number of words (standing in lang-mns
):
cat test/data/Readings_20230901.txt |\
hfst-tokenise tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst |wc -l
Number of unknown words:
cat test/data/Readings_20230901.txt |\
hfst-tokenise tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst |\
preprocess --corr=test/data/typos.txt|\
hfst-tokenise -cg tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst |\
grep " ?"|cut -d'"' -f2|wc -l
The file is: Mansi readings (test/data/Readings_20230901.txt
)
Coverage:
The file was: Mansi readings (test/data/Mansi_readings.txt
). This is an old version of the textbook, data kept here for reference.
Coverage:
cat test/data/Readings_20230901.txt | hfst-tokenise -cg tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst |vislcg3 -g src/cg3/disambiguator.cg3 |grep -v '^[:]'|cut -d'"' -f2|uniq|grep -v '^<'|sort|uniq -c|sort -nr|cut -c6-|grep '[яшертыуиопюжасдфгчйкльъэщзхцвбнмм]'|mnshun|grep "+?"|cut -f1|wc -l
231012: 1193 231116: 1178
(probably delete these)
The table shows the number of typos tested, as well as some data, for hfst-ospell. The last results, and perhaps all, were due to technical errors with composed long vowels.
```
typos Avrg pos % missp % missp
.txt for corr in 1st in top-5
—————————————————————–
hfst-ospell:
240405: 395 1.15 59.80 63.10
240410: 473 1.08 54.59 57.72
240411: 473 1.08 53.42 56.41
240422: 547 1.09 46.75 49.54
240422: 579 1.16 45.71 48.86
—————————————————————–