Balkan Romani NLP Grammar

Finite state and Constraint Grammar based analysers, proofing tools and other resources

View the project on GitHub giellalt/lang-rmn

Test diary

This document writes down test statistics

The overal test command: make check

yaml

The command:

sh test/yaml-check.sh

(data forthcoming)

Lexical coverage

rmn Number of words (standing in lang-rmn):

cat test/data/freecorpus.txt |\
hfst-tokenise tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst |wc -l

Number of unknown words:

cat test/data/freecorpus.txt |\
 hfst-tokenise tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst |\
 preprocess --corr=test/data/typos.txt|\
 hfst-tokenise -cg tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst |\
 grep " ?"|cut -d'"' -f2|wc -l

Test with the full corpus (free + bound):

Lexical coverage of freecorpus

The file is test/data/freecorpus.txt.

Coverage: