Preparing annotated text for testing
This document presents the pipeline for adding an error-marked text to the corpus and run it through grammarchecker testing for precision and recall.
Documents for testing should represent the target group of the grammar checker and potentially contain errors. They should be stored in \*corpus/orig/$LANG/catalogename/
where the cataloguename (and eventual subcatalogues) should be a catalogue reserved for annotated files for grammarchecker testing.
- Mark errors as explained in the Principles of error markup document
- Save the file as
filename.correct.txt
(i.e. the filename must end in .correct.txt) - Add the file to the corpus with the command
convert2xml filnamn.correct.txt
- This creates a file
filnamn.correct.txt.xsl
. In this file, change conversion_status fromstandard
tocorrect
. Add other metadata. Reference to original file may e.g. be given in the filename slot. - Convert from corrected file to goldstandard with the command
convert2xml --goldstandard filename.correct.txt
. Given an original fileorig/smn/testcorp/wiki/filename.correct.txt
he resulting file will by using this command be stored ingoldstandard/converted/smn/testcorp/wiki/filename.correct.txt.xml
- Supposing you have (one or) several files and/or catalogues under goldstandard/converted/smn, you may then run the command (with
smn
as an example):gtgramtool test -s $GTLANGS/lang-smn/tools/grammarcheckers/smn.zcheck xml goldstandard/converted/smn > <testfile-output>
- or eventually send it to standard output with nice colours:
gtgramtool test -c -s $GTLANGS/lang-smn/tools/grammarcheckers/smn.zcheck xml goldstandard/converted/smn
You may at any point reopen the file filename.correct.txt
, add or revise the error marking, and run the procedure again.