North Sami NLP Grammar

Finite state and Constraint Grammar based analysers, proofing tools and other resources

View the project on GitHub giellalt/lang-sme

Test results for the morphology and lexicon files

This document documents the testing of the parser and disambiguator. Background info and test plan is found in the test plan document.What is found here is an overview of what has been tested, both vocabulary testing, testing of the disambiguator, and testing of the morphological analysis.

Test results for morphology and lexicon

Vocabualry testing

The following table records recall for word forms in various texts. Here we measure coverage of the vocabulary, by recording all word forms that are not recognised.

---------------------------------------------------
zcorp/gt/sme/facta/AGOR-15.06.05_2.doc.xml 
Test 9   Wftot Wf-tkn %-recall Tytot  Wf-typ %-recall
051204   40030  38880 96.9 %    6974    6221 89,2 %

minaigiweb-050824.txt
Test 8  Wftot Wf-tkn  %-recall Tytot  Wf-typ %-recall
050916 276798 255144  92.2 %   28480   22176 77.9 % Very many formatting errors.

callinravvagat 2000.xml
Test 7  Wftot Wf-tkn  %-recall Tytot  Wf-typ %-recall
050518  26948  25168  93.4 %    6686   5435  81.3 %  w/o caps on
050518  26948  25208  93.5 %    6686   5455  81.6 %  w caps on
---------------------------------------------------
nisson_ovddasteapmi.txt
Test 6  Wftot Wf-tkn  %-recall Tytot  Wf-typ %-recall
040903  38360  35704  93.0 %   20660  19102 92.4 %
---------------------------------------------------
hjh-nod1iid.txt
Test 5  Wftot Wf-tkn  %-recall Tytot  Wf-typ %-recall
040903   1580   1532  96.7 %     683   636  93.1 %
---------------------------------------------------
sd-divas-2002-{1,2}.txt
Test 4  Wftot Wf-tkn  %-recall Tytot  Wf-typ %-recall
041005  32835  31834  96.9 %    6664  6054  90.8 %
040913  32883  31255  95.0 %    6759  5856  86.6 %
---------------------------------------------------
sd-divas-2001-1.txt
Test 4  Wftot Wf-tkn  %-recall Tytot  Wf-typ %-recall
040903  60522  58549  96.7 %    8610  7610  88.4 %
040329  62459  60159  95.3 %    8496  7406  87.2 %
---------------------------------------------------
handlingsplan_samisk.txt
Test 3  Wftot Wf-tkn  %-recall Tytot  Wf-typ %-recall
031120   2148   2053  95,6 %    1044   984  94.3 %
040329   2461   2389  97.1 %     955   898  94.0 % (new preprocessor)
---------------------------------------------------
Test 2  Wftot Wf-tkn  %-recall Tytot  Wf-typ %-recall
Collection    225355                 32467 (test closed)
020815        203080  90.1 %         22721  70.0 %
020918        204315  90.7 %         22956  70.7 %
030210 227062~214845  94.6 %   31474~24398  77.5 %
---------------------------------------------------
Test 1  Wftot Wf-tkn  %-recall    Wf-types %-recall
New Testament 139681                 14888 (test closed)
011110         36471  26.1 %          4983  33.5 %
011116         36980  26.5 %          5050  33.9 %
011214         37736  27.0 %          5177  34.8 %
011218         40741  29.2 %          5955  40.0 % (closed classes added)
020129        126765  90.6 %         11676  78.4 % (proper names added)
020205        128702  92.1 %         12340  82.9 %
020206        129857  92.9 %         12328  82.8 % (nom_nom compound)
020207        131846  94.4 %         12500  84.0 %
020212        132394  94.8 %         12621  84.8 %
020213        132878  95.1 %         12652  85.0 %
020217        132993  95.2 %         12674  85.1 %
020306        133791  95.8 %         12850  86.4 %
020307        133821  95.8 %         12878  86.5 %
020318        134042  95.9 %         12914  86.7 %
020321        135446  97.0 %         13292  89.3 %
020323        136120  97.5 %         13373  89.8 %
020404        136621  97.8 %         13524  90.8 %
020410        136974  98.1 %         13609  91.4 %
020417        137435  98.4 %         13762  92.4 %
020418        137977  98.8 %         13875  93.2 %
020423        138101  98.9 %         13964  93.8 %
021104        138254  99.0 %         14003  94.1 %
040216 166194 165330  99.5 %  14916  14298  95.9 %
050726 166191 165902  99.8 %  14920  14666  98.3 %
---------------------------------------------------

Explaining the table

Lower token than type percentage indicates that the parser misses common words more often than seldom ones.

Lower type than token percentage (which is the case) indicates that the parser is good at the core vocabulary, but has

Each text is given a separate section in the table, ordered chronologically, with the oldest test case (Test 1) at the bottom. The first line of each section gives the name of the file (note: the files of the test cases 2 and 3 are so changed that these two test cases are closed). Each line represents a test run. The first colum gives the test date (in the format ddmmyy), the second (WFtot) the total number of words in the file question, the third (Wf-tkn) the number of recognised word form tokens, and the percentage compared to the total. The next columns does the same for wordform types (cf. below for the commands used to calculate the numbers).

-------------------------------------------------------------------------
Wftot:
cat filename | preprocess --abbr=bin/abbr.txt | wc -l

Non_recognised_wf:
cat filename | preprocess --abbr=bin/abbr.txt | lookup -flags mbTT -utf8 bin/sme.fst
 | grep '\?' | grep -v CLB | wc -l

Wf-tkn = Wftot - Non_recognised_wf

%-recall = Wf-tkn * 100 / Wftot
-------------------------------------------------------------------------
Tytot (Total number of wordform types):
cat filename | preprocess --abbr=bin/abbr.txt | sort | uniq | wc -l

Non_recognised_wt (Number of non-analysed wordform types:
cat filename | preprocess --abbr=bin/abbr.txt | sort | uniq |
lookup -flags mbTT -utf8 bin/sme.fst | grep '\?' | grep -v CLB | wc -l

Wf-typ (Number of recognised wordform types)
Wf-typ = Tytot - Non_recognised_wt

%-recall = Wf-typ * 100 / Tytot

If the text is taken from our new /usr/local/share/sme/gt corpus, then
the "cat filename" part should be replaced with
catxml --title --input /usr/local/share/sme/gt/
and thereafter catalogue name and file name.
-------------------------------------------------------------------------

Grammatical testing

February 2005: Here we will fill in an overview of which grammatical paradigms and wich sublexica we have tested.

Part of speech testing

Adjectives

May 05. Adjectives are tested, both morphologically and lexically.

Nouns

May 05: Nominal morphology is tested

Verbs

May 05: Verbs are about to be gone through

Testing the disambiguator

Test sentences

Here we test accuracy, precision and recall

Accuracy # Here we list tests that are not manually checked.
Date   Tokens  Parses Ambiguity Type
051204  40753         


Syntax
Date   Tokens Parses CorrTag Ambiguity Precision Recall Text           Type
050429   9219   9736    9092    1.056      0.93   0.99 Nickel          Grammar
050512   1972   2089    1928    1.059      0.92   0.98 Luossa          Traditional
050719   9179  14749    9044    1.607      0.61   0.98 Nickel          Grammar
051025   8770  11498            1.311                  Hálddášanáššiid Admin
051027   2578   2686    2498    1.042      0.93   0.97 TaleE036.txt    Admin
070207   2059   2106    1973    1.023      0.94   0.96 MinÁigi070202   news


  
Morphology
Date   Tokens Parses CorrTag Ambiguity Precision Recall Text           Type
050429   9219   9736    9152    1.056      0.94   0.99 Nickel          Grammar
050512   1972   2089    1955    1.059      0.94   0.99 Luossa          Traditional
050719   9179  14749    9100    1.607      0.62   0.99 Nickel          Grammar
051025   8770  11498            1.311                  Hálddášanáššiid Admin
051027   2578   2686    2519    1.042      0.94   0.98 TaleE036.txt    Admin
070207   2059   2106    1999    1.023      0.95   0.97 MinÁigi070202   news

Where:
Tokens =     Number of tokens in the text 
             (cat file.txt | preprocess --abbr=bin/abbr.txt | wc -l)
             (or: "catxml -i" instead of "cat", the catxml command can only be used
             on victorio, on the files in /usr/local/share/corp/ )
Parses =     Number of parses given (number of the following command, minus the
             number of tokens)
             (cat file.txt | preprocess --abbr=bin/abbr.txt | lookup -flags mbTT
             -utf8 bin/sme.fst | lookup2cg | vislcg --grammar=src/sme-dis.rle | wc -l
CorrTag =    Number of tokens that did not have their correct tag removed
             (This number must be manually arrived at: Tokens - errouneous_analyses
Ambiguity =  #Parses / #Tokens
Precision =  #CorrTag / #Parses
Recall =     #CorrTag / #Tokens
Syntax =     Tested wrt. the syntactic (@XXX) tags
Morphology = Tested wrt. the morphological tags (all except the @XXX ones), but run
             with the MAPPING section activated (standard mode, that is)