Finite state and Constraint Grammar based analysers, proofing tools and other resources
Texts from various domains should be tested at regular intervals
The corpus should be inspected, e.g. with the following command (~/gt/sme/ as the working directory):
cat corp/* | preprocess abbr=bin/abbr.txt | lookup -flags mbTT mbTT bin/sme.fst | grep '\?' | grep -v CLB | sort | uniq -c | sort -nr | less
The result will contain all non-Saami words in the lexicon. In order to remove these foreign words from the list, the following command may be used:
cat corp/* | preprocess abbr=bin/abbr.txt | lookup -flags mbTT mbTT bin/s\ me.fst | grep '\?' | grep -v CLB | cut -f1 | lookup -flags mbTT bin/foreign.fst | grep '\?' | sort | uniq -c | sort -nr | less
The resulting list is an overview over words not recognised by the parser. All-capital words should be ignored, or they could be tested separately, with the command
... | lookup -flags mbTT -f bin/cap-sme | ...
By using this script words written in CAPITALS are analysed as well, but run in this mode, the parser is to slow to analyse the full one-million word corpus.
The remaining words should be inspected. Failure of recognising words has one of three reasons:
In simple cases, errors should just be corrected. Otherwise they should be reported to the Bugzilla database. Misspellings may be ignored, or, if they are frequent, they should be added to the lexicon, with a tag (at present the tag is “!SUB”). When developing a spell checker, misspellings become interesting in their own right, but for the development of the disambiguator, we are more interested in actually analysing the words, than in pointing out that they are misspelled.
Clear formatting errors may be corrected in the corpus files, with the following command:
perl -i -pe 's/formatting_error/corrected_formatting/g' corp/filename
This should be done only in our old corpus, and only when it is totally clear that the input string cannot be interpreted as anything else than a formatting error. In our common corpus database we deal with formatting errors with our file-specific conversion tools.
Words missing in the lexicon should be added, with their proper lexicon.
Words listed in the lexicon, but with one or more word forms not analysed, are the most challenging ones. This implies that there is an error in the morphophonological file twol-sme.txt or more probably in the morphological section (for nouns, verbs and adjectives this means sme-lex.txt). In case of morphological errors, the path through the morphological derivation should be traced and inspected. In case of morphophonological errors there are procedures within twolc for detecting them (see the twolc manual).
We want to test the following:
There is a discussion of this on the newsgroup. TODO: copy that discussion into this document.
Status quo and directions for actively testing the parser:
The best way of testing the morphology is perhaps the command
make n-paradigm WORD=johka
, as described in the testing tools. This
method is fine for the inflection of nouns, verbs and adjectives. As of
september 2004, the basic noun paradigms in Nickel have all ben tested,
as have the CG patterns. Priority should now be given to adjectives, and
to the verbs. The sublexica should all be run through the generator.
As for the adjectives, there are several subtypes that are not covered by the existing lexica. One possible way of monitoring the situation would be to write a perl script (or shell script) that takes as input a list of adjectives, and gives their nom.sg., attributive form, gen.sg, comparative nominative, comparative genitive, superlative and superlative genitive forms, and then run representative lists of adjectives through the script.
As for the verbs, the verb file should be read through and checked for transitivity (the question is whether the verbs are assigned correct sublexicon).
TODO for a person with Saami as mother tongue: Read through the pp-sme-lex.txt and adv-sme-lex.txtfiles and evaluate the division into prepositions, postpositions, adpositions and adverbs.
Perhaps a script could be made to run all pronouns through a test.
The chapter on numerals is still not properly written. Wait with testing this until the code is more stable.
When we test whether words are let through or not, we do not test whether the parser actually gives correct analyses. A word may thus be misanalysed, in two ways:
The first issue is of major concern to the spell checker project, and will not be dealt with here.
The second issue has great importance to the disambiguator, and to the form generator isme.fst. Errors of this type pop up in two contexts: When the parser is used as input to the disambiguator (and the correct reading is missing from the input), and as a result of regularly reading through the analysis of a shorter, non-disambiguated text.
Disambiguating is tested in the following way:
A token is correctly disambiguated when at least one of the readings (parses) of the token is correct.
In the ideal case each token is uniquely and correctly disambiguated with the correct parse. Here, both recall and precision will be 1.0. A text where each token is annotated with all possible parses, the recall will be 1.0, but the precision will be low. A high recall thus comes with a price of low presicion. In other words: A recall of less than 100% indicates that some correct analyses were removed, and a precision of less than 100% indicates that some wrong analyses were not removed. The goal is to have both recall and precision as high as possible.
Testing procedure:
cat file.txt | preprocess --abbr=bin/abbr.txt | lookup -flags mbTT -utf8 bin/sme.fst | lookup2cg | egrep '\t' | wc -l
.
The number of tokens is
cat file.txt | preprocess --abbr=bin/abbr.txt | wc -l
cat file.txt | preprocess --abbr=bin/abbr.txt | lookup -flags mbTT -utf8 bin/sme.fst | lookup2cg | vislcg --grammar=src/sme-dis.rle | less
.During parser construction the recall and presicion data need not be a goal in themselves. Another, equally important goal is to identify errors and try to correct them. Deleting correct readings is a more serious error than leaving the token ambigous.
At regular intervals, new, previously unseen texts should be tested for type and token recall. The test prcedure, as well as test results, are explained in the sme test diary.
Although the parser might give correct output, the internal lexicon structure may not be optimal. At some point, the code should be read through with this in mind.