GiellaLT provides rule-based language technology aimed at minority and indigenous languages
This page tells how to test the language model, thereby having control over the developmental work.
There are in principle two types of testing:
The former method is good for regression testing (ensuring your model does not get worse). The latter requires knowledge of the language in question. We look at the two methods in turn.
Most regression tests in the GiellaLT infrastructure may be run in one go, with the command
Depending upon you setup, the make check procedure will test the following. The headlines correspond to output from the make check command given in the terminal). Each text snippet Making check in refers to a folder under
lang-XXX. Some of them contain tests, other do not. We skip the ones that typically contain no tests.
When scrolling through the output of
make check, you will see summaries in green, like e.g. this one:
All 5 tests behaved as expected (3 expected failures)
The test in question is summarised above the green message, offering more detail about what has happened. The following text goes through the different tests:
These tests are written in the
phonology.twolc file. The tests are of the format shown here (€ = euro), where the upper line is input from lexc and the lower line is output text.
!!€ example^DELVOW !!€ exampl00
make check will pick these tests from phonology.twolc and report on whether the rule has worked or not.
orthography folder contains rules for turning initial capital letters into small (thus, both Tables and tables are plural of
table), and the
inituppercase test tests for this.
This test finds all tags of the format
+Tag in the *.lexc files, and check whether they are declared under
root.lexc. If not, they are listed here. The error is one of two:
root.lexc. Do it.
The goal is that no tags should be listed, the test will fail until the list is empty.
The test routine will list tests like
You can add or remove tests for adjectives, nouns, propernouns and verbs in
GENERATION_TESTS_IN=generate-adjective-lemmas.sh.in \ generate-noun-lemmas.sh.in \ generate-propernoun-lemmas.sh.in \ generate-verb-lemmas.sh.in
List files that you know do not pass under
XFAIL_TESTS= further down in the file
Makefile.am (thereby making them green in the test report).
The standard setup for this test is that the language is like Uralic languages: Baseform in nominative, no gender, and verbs in infinitive. If languages deviate from this (as e.g. Norwegian or Romani do) the test setup for this test must be done for each language separately, by editing
Similar tests may be set up for lexc. See
lang-sma for examples.
Make so-called yaml files in
Examples are found for all the Saami languages, for
For some of the tests, we have separate commands to do standalone tests (these tests are covered by the make check command as well):
Test that all tags are declared and written correctly.
Test that lemmas can be generated:
Run yaml tests:
By this impressive title we mean tests without a predefined correct answer. Here, there will thus not be any report of
PASS, here the linguist must check the output herself or himself.
We have a set of routines generating lemmas for words or classes of words :
sh devtools/verb_minip.sh '^lemma[:+]' sh devtools/noun_minip.sh '^lemma[:+]' sh devtools/prop_minip.sh '^lemma[:+]' sh devtools/adj_minip.sh '^lemma[:+]'
You can also look at the generation of all members of one continuation lexicon:
You can edit the list of forms in the paradigm files which are mentioned in the scripts, e.g.
We have a routine for generating tables of large classes of words. The result is an html file giving a birds’ perspective of the analyser.
The command is as follows, one command for each part of speech:
sh devtools/generate-adj-wordforms.sh sh devtools/generate-noun-wordforms.sh sh devtools/generate-prop-wordforms.sh sh devtools/generate-verb-wordforms.sh
NOTE! For languages with gender we typically split the noun file in generate-msc-wordform.sh, etc.
You can edit the list of forms (include as many or few forms as you like):
morf_codes="+N+Sg+Nom \ +N+Sg+Gen"
You can edit which cont.lexes to test:
You can edit how many lemmas of each cont.lex to test:
The following test setup may be used to test for lexical coverage:
Here is how it is done:
For reference text, you may use
test/data/freecorpus.txt (if it exists), or eventually pick a text yourself. Your own text you may save in
Analyse it with the following command (change
todaysdate to just that, evt. with a, b, … if you plan to test several versions today):
cat test/data/freecor pus.txt | hfst-tokenise -cg tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst \ | grep ? | sort \ | uniq -c | sort -nr > misc/freecorpus.missing.todaysdate
The resulting file will be what we refer to as a
missing list, a frequency sorted list of unknown wordforms. These should be added to the analyser.
After having worked on the analyser for a while, repeat the procedure. The result is then two files (the old and the new). These may then be compared as follows.
These adjustments are for the yaml tests referred to in the section on regression testing above.
Remove all yamltests (check in your local modifications first!):
Get the yaml-file you want to test, e.g.:
svn up test/src/gt-norm-yamls/V-mato_gt-norm.yaml sh test/yaml-check.sh
This example is adding all verbs into one file:
head -11 test/src/gt-norm-yamls/V-AI-matow_gt-norm.yaml > test/src/gt-norm-yamls/U-all_gt-norm.yaml tail +11 test/src/gt-norm-yamls/V* | grep -v "==" >> test/src/gt-norm-yamls/U-all_gt-norm.yaml
This example is adding all nouns with final -y into one file:
head -11 test/src/gt-norm-yamls/N-AN-amisk_gt-norm.yaml > test/src/gt-norm-yamls/A-Ny-all_gt-norm.yaml tail +11 test/src/gt-norm-yamls/N*y_gt-norm.yaml | grep -v "==" >> test/src/gt-norm-yamls/A-Ny-all_gt-norm.yaml
The example is for the inanimate noun ôtênaw. Use an already functioning yaml-file as a starting point (here N-AN-amiskw_gt-norm.yaml). You still have to do a little editing afterwords, like correcting the docu about the lemma, and making it more readable by adding empty lines. And you must of course correct the output.
head -12 test/src/gt-norm-yamls/N-AN-amisk_gt-norm.yaml\ > test/src/gt-norm-yamls/N-IN-otenaw_gt-norm.yaml cat test/data/NI-par.txt | sed 's/^/ôtênaw/' | dcrk |\ tr '\t' ':' | sed 's/:/: /' | grep -v '^$' |\ sed 's/^/ /' >> test/src/gt-norm-yamls/N-IN-otenaw_gt-norm.yaml
Comment: The last sed-command should give 5 whitespaces