GiellaLT

GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology. Read more about Why. See also How to get started, and our Privacy document.

View GiellaLT on GitHub giellalt/giellalt.github.io

Extracting sentences for precision testing

This document explains how you may harvest sentences that get error tags in large corpora. The first command gives all the sentences, and the second splits them according to tag and formats them for yaml testing.

Commands

The commands assume you stand in lang-$lang/tools/grammarchecker/ and have compiled the grammarchecker.

Take all sentences from text and collect the positives

You need to preprocess the corpus so that you get one sentence on each line. With hfst-tokenise in place you do this as follows:

The corpustext used as input will vary from language to language. Be careful not to include texts used in goldcorpus markup. The texts to use here are documented on the grammarchecker documentation page for the language in question. We assume you stand in lang-smn/misc and have your test corpus in a file corpustext.txt (exchange smn with your own language, note that the xxx.zcheck uses 2-letter codes if available, thus sme but se, gle but ga)):

## in lang-smn, exchange smn with your code below.
cat corpustext.txt |\
hfst-tokenise -i tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst |\
sed 's/ \([.?!] \)/\1£/g;'|\
tr '£' '\n'|\
sed 's/ \([:;,]\)/\1/g;'|\
divvun-checker -a ../tools/grammarcheckers/smn.zcheck -n smngram|\
grep -v '{"errs":\[\],"text":"' > positives.csv

The file positives.csv will then contain all sentences where the grammarchecker has given an alarm (hence naming it positives).

Order the positives according to rule type

Each rule (type) has its tag. In order to test the effect of one specific rule we extract all sentences marked with the tag assigned by the the rule or rules in question. Here we use msyn-posspl-ill-gen as a tag example). We use the output from the last command as input, and store it in a file we call candidates-... (candidates to yaml tests), but you may of course choose any name.

cat misc/positives.csv |\
grep  '"msyn-posspl-ill-gen"'|\
rev|\
cut -d'"' -f2|\
rev|\
sed 's/^/  - "/'|\
sed 's/$/"/' | grep -v '""' \
> tools/grammarcheckers/tests/candidates-posspl-ill-gen.yaml

This command greps the tag from the positives.csv file. The sentence is at the end of the line. The number of fields may change from rule to rule, the command thus cuts the sentence from behind. The sentence is formatted so that it can be added to the yaml fileset in the grammarchecker/tests catalogue.

You may then make a list of all rule tags in the grammarchecker, search for each tag in positives.csv and store the result in one file for each tag, with a for loop. First, make a list taglist.txt containing all error tags used in the file grammarchecker.cg3 and store it in misc. Then copy the following (Note in the line beginning with Variant: you should exchange smn with the relevant language code), store it in misc, e.g. as candidates.sh, and run it (stand in misc and type the command sh candidates.sh):

#!/bin/bash

for i in `cat taglist.txt`
do
    echo 'Config:
  Spec: ../pipespec.xml
  Variant: smngram-dev

Tests:' > ../tools/grammarcheckers/tests/candidates-$i.yaml
    grep "\"$i\"" positives.csv | rev| \
        cut -d'"' -f2| rev| sed 's/$/"/'| \
        sed 's/^/  - "/' \
        >> ../tools/grammarcheckers/tests/candidates-$i.yaml
done

Integrating the result in regression testing

The resulting files are now stored in tools/grammarchecker/tests/candidates-*.yaml files, one for each tag. Some files may be empty (the smallest ones), delete them.. Then you may test for regression, e.g. with make check or (file by file) with the usual command (standing in tools/grammarchecker):

gtgramtools test -c yaml tests/candidates-neg-posspl-ill-gen.yaml

The sentences you may then (throw away or) edit and store in *-FAIL.yaml (for false positives) or *-PASS.yaml (for true positives). The false positives should then be fixed by editing the rule file.

Sitemap