Automatic testing
We now have a test bench for automatic testing of the spellers, using different data sets. The data sets/tests serve different purposes, and are the following:
- regression-test:
tests the speller against a set of known problematic misspellings and correct words, to check that newer versions don’t break earlier fixes; the data set will often contain “constructed” words made to highlight certain morphological constructions - typos-test:
tests the speller against a collection of real typographic errors and their corrections, as found in our corpus documents or elsewhere in real texts — the purpose of the test is to see how well the speller handles real errors, both when it comes to detecting them, and to give the correct suggesion; the data set should never contain constructed errors - baseform-test:
extracts all baseforms (=nominative singular, infinitive) found in our lexicons, and send them through the speller; to ensure that the lexicon is well-formed and that the speller actually recognises all (baseforms of the) words it should - correct-test:
runs complete documents manually marked up with error/correction tags through the speller; this test will check both lexical coverage, calculate precision, recall and coverage, as well as give an idea of the quality of the suggestions; for details on marking up documents to be used as input, see this page
Below we have briefly described how to run these automatic tests, how to read the test reports, and then some more details on each test.
Running automatic tests, storing results
To run each of the automatic tests above, just make
the test name as
given, and the TARGET
as usual (in the gt/
directory), e.g.:
make regression-test TARGET=sme
There is one exception, and that is the correct-test, which also requires a DOC input parameter - the correct document used as input data:
make correct-test TARGET=sme DOC=somedoc.correct.doc.xml
There is a short-cut make
target that will run all but the
correct-test
at once:
make spelltest TARGET=smj #will run regression, typos & baseform
In addition, it is possible to specify the tool used for the actual
testing, that is, the speller engine, by giving make
the parameter
TESTTOOL
, with one of the following values:
- pl:
polderland command-line speller - mw:
Microsoft Word as the engine, iterating over each of the words in the input data, and asking Word about its spelling status; AppleScript is used to tell Word what to do, and to collect the response from Word
In the future, more spelling engines will be added, like hunspell (hu) and possibly aspell (as).
The mw test engine has some shortcomings due to Word’s AppleScript implementation (or our inability to find our way through the Word AppleScript dictionary), but it also has the nice feature to be comletely independent of the real speller engine behind Word. This means that it is possible to test other spellers than ours, and compare the test results across languages and speller engines (given reasonably similar input data).
It is also possible to add the date of the test run as a parameter to
make
, if one for example would like to update an earlier test run with
corrected test data. This is done with the parameter DATE
. A full
make
command for the future hunspell tool would then look something
like:
make correct-test TARGET=sme DOC=somedoc.correct.doc.xml TESTTOOL=hu DATE=20071020
The output from each test is two xml files, both stored in
gt/doc/proof/spelling/testing/
. One is a bare-bones standardised xml
representation of the speller output, the other is a Forrest-doc xml
file presenting both the direct test results and some calculated
statistics. To save the test results for the future and at the same time
make them available for others, the xml files should be checked in in
cvs
.
Finally, to properly include the test results in the Forrest-driven site
of ours, the forrest-doc files should also be added to the menu system
by including a reference in the file gt/doc/site-proof-frag.xml
.
regression-test
Input data
The regression-test
input data is stored i the file
$TARGET/polderland/regressions.txt
. The format is quite simple, and
has two forms:
error<TAB>correction<TAB>#comment
correct<TAB><TAB>!comment
Comments can either start with # or !. The first variant is a so-called
negative test, where the speller should detect the error
and give
the correction
as one of its suggestions. The other variant is
consequently a positive test, where we check that the speller actually
recognises correct word forms. Often missing correct suggestions or
false negatives are caused by the correct form not being recognised.
The positive tests will help in detecting such cases.
Reading the test report
The test report for the regression tests have seven main sections:
- Overview
- True positives
- False positives
- False negatives
- True negatives
- Grouped by bug #
- Testpairs not in bugs
Each section is briefly described below.
Overview
This gives some basic statistics about the regression test. The most important figures here are the false negatives and false positives - they indicate how many testpairs are still failing.
True positives
Normally not very relevant reading - these are the correctly recognised misspellings.
False positives
This section lists correct input flagged as misspellings. Check this briefly to see if there are any patterns in the incorrectly flagged words. Often a few bugs are failing, so further investigation should be directed there.
False negatives
This is misspellings not detected by the speller. Again, check whether there is a pattern among the undetected misspellings.
True negatives
Normally not very relevant reading - these are the correctly recognised correct words.
Grouped by bug #
This is really the most relevant section. Here, all failings have a light red background, to make them stand out visually and be easy to spot. To get an overview of the situation for reported bugs, go directly to this section, and scroll through it looking for red rows.
All bugs with no red rows can be closed (or should be already), whereas bugs with red rows (ie broken tests) need further investigation.
For a test pair to show up in this section, the comment column in the test data has to start with the bug ID.
Testpairs not in bugs
This last section contains all test pairs not covered by the previous section, and is using the same redish background colour to indicate failed tests. It should be as small as possible, as we want most or all test pairs to be associated with a bug.
typos-test
Input data
The typos-test
input data is stored i the file
$TARGET/src/typos.txt
. The format is similar to the regression data
file:
error<TAB>correction<TAB>#comment
Comments can either start with # or !.
The data is a collection of true misspellings found in different sources. It should NOT contain any made-up examples (they can be put in the regression.txt file if relevant, otherwise don’t use such data).
As part of the testing, all the correct words are also extracted and used as input to the speller. These should all be accepted, and serve as positive test cases for the typos-test.
Reading the test report
The test report for typos-test
contains the same first five sections
as the regression-test report. The most important things to look at are
the following points:
- true positives without (correct) suggestions:
why are the suggestion(s) missing? - false negatives:
any pattern in the undetected misspellings - false positives:
any pattern in the wrongly flagged words - overall statistics:
our target is to detect and correct as many of the known typos as possible
baseform-test
Input data
The baseform-test
input data is generated as an extraction of all
lexical entries in our LexC files, and is used to ensure that we
actually recognise all the words that we put into the speller. Further,
since we’re really not interested in seeing the long list of
recognised baseforms, the data is sent two times through the speller.
The first round is used to identify all negative hits (ie all rejected
baseforms), and the second round is used to only analyse those, to get
both some statistics, suggestions (the suggestions can be quite telling
about why a certain word was rejected) and filter out some cases that
are actually recognised (the first filtering is a little over-active).
Reading the test report
The test report for baseform-test
contains the same first five
sections as the regression-test report. The most important things to
look at are the following points:
- number of false negatives:
this should really go down to zero - false negative patterns:
use any patterns to try to identify why groups of baseforms are rejected. - single entries:
a substantial part of the unrecognised baseforms will be undetected errors in the lexicon; they should just be corrected
correct-test
Input data
The correct-test
input data is an xml document with errors and
corrections marked up. The xml document is a conversion from a similarly
marked-up corpus document, and represent our real-world test scenario
for our spellers (the other test cases are different types of more
technical testing).
This test will usually have to be run several times on a new test document, as the first run will reveal inconsistencies and mistakes in the error/correction markup that needs to be fixed before we get reliable test results.
The test document is read by ccat
, which produces test data in a
format identical to the other test types, that is:
error<TAB>correction<TAB>
correct<TAB><TAB>
Since the input data is a complete document, it is possible to calculate reliable statistics on precision and recall.
Reading the test report
The test report for baseform-test
contains the same first five
sections as the regression-test report. The most important things to
look at are the following points:
- test statistics:
in thecorrect-test
, the precision and recall figures are real measures of the quality of our speller, and should be thoroughly followed between speller versions. - false negatives:
that is, undetected spelling errors - these should be as few as possible - false positives:
this number should also be low, although it is normally not possible to get down to zero - true positives without (correct) suggestions:
we want to be able to correct as many of the detected misspellings as possible, which makes this category an interesting study object; it should be as small as possible
Manual testing
Program Settings
In order to obtain measurable results, we set up the programs in the same way:
- Common settings:
- Check Upper case words (turn off “Ignore Upper case”)
- Check words with numbers (turn off “Ignore words with numbers”)
- Ignore words with numbers (leave this options on)
- MS Off/Mac:
Word>Preferences>Spelling and Grammar - MS Off/Win:
In the same location?
Types of testing
- Technical testing
- Linguistic testing
- Testing the proofing
- Testing the suggestions
Technical testing
Linguistic testing: Testing the proofing
We test precision
, recall
and accuracy
. Precision measures the
actions of the program: Given that it indicates an error, can we trust
that it actually is an error? The recall measures the robustness of the
program: Given that we have written a misspelled word, what are the
chances that the program finds it? These two measures are interlinked: A
strict program will flag for errors often, find many, but also too many.
On the contrary, a program acting on the safe side will flag an error
only when sure to have found one, at the expence of letting through some
erros. The former is better When users really want a correct text, and
the latter is better when the user is annoyed by false alarms, and
really just wants to get rid of the worst errors, at a minimal cost.
Accuracy measures the overall perforance, and takes both the other
measures into account
To obtain these measures we need the following data:
- words (wds):
The number of words in the text - true positives (tp):
The number of true errors found by the spellers (red errors) - false positives (fp):
The number of correctly written words claimed to be errors by the program (correct words in red) - true negatives (tn):
The number of correctly written words recognised as such (correct word, no red line) - false negatives (fn):
The numbers of errors not found by the speller (misspelling without redline)
We count wds, tp, fp, fn, and calculate tn as wds - (tp + fp + fn). The test values are calculated as follows (there is a spreadsheet available to do this automatically):
- precision = tp/(tp+fp)
- recall= tp/(tp+fn)
- accuracy = tp+tn/all
Linguistic testing: Testing the suggestions
Also here, we test for precision
, recall
and accuracy
.
To obtain these measures we need the following data:
- errors (err):
The number of errors in the text - true positives (tp):
The number of true suggestions - false positives (fp):
- true negatives (tn):
- false negatives (fn):
- precision = tp/(tp+fp)
- recall= tp/(tp+fn)
- accuracy = tp+tn/all