Finite state and Constraint Grammar based analysers, proofing tools and other resources
View the project on GitHub giellalt/lang-smj
This document contains test results for the Lule Saami parser. We will move to an automatic test regime, but while waiting, the first initial steps will be documented here.
The following table records recall for word forms in various texts. Here we measure coverage of the vocabulary, by recording all word forms that are not recognised.
-----------------------------------------------------------------------------------
zcorp/gt/smj/bible/nt/lule_sami_new_testament.html.xml
Token recall testing Type recall testing
-----------------------------------------------------------------------------------
Test 1 lex Wf-total Wf-tkn %-recall Tytot Wf-typ %-recall
070627 120070 119752 99,7 %
060228 19742 135662 131212 96,7 % 13289 11831 89,0 % ← 978 inc missing.
060228 18307 135662 125367 92,4 % 13289 11385 85,7 % ← More rare words.
060227 17997 135662 123368 90,1 % 13289 9938 74,8 % ← More Kintel, äöŋ fix
060226 17723 135662 108573 80,0 % 13289 8952 67,4 % ← More Kintel
060222 135662 82748 70,0 % 13289 2195 16,5 % ← First Kintel import
060124 3368 135662 75018 55,3 % 13289 2084 15,6 % ← Still no lexicon
----------------------------------------------------------------------------------
Lower token than type percentage indicates that the parser misses common words more often than seldom ones. Lower type than token percentage (which is the case) indicates that the parser is good at the core vocabulary, but has lower coverage of more seldom words.
Each text is given a separate section in the table, ordered chronologically, with the oldest test case (Test 1) at the bottom. The first line of each section gives the name of the file. Each line represents a test run. The first colum gives the test date (in the format ddmmyy), the second (WFtot) the total number of words in the file question, the third (Wf-tkn) the number of recognised word form tokens, and the percentage compared to the total. The next columns does the same for wordform types (cf. below for the commands used to calculate the numbers).
Test 1 does not cover proper nouns, as they are not added to the lexicon yet. The commands used to get the numbers are: