GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology. Read more about Why. See also How to get started and our Privacy document.
Tokens = Number of tokens in the text
Alternative 1: Process with hfst:
cat file.txt|hfst-tokenise tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst
Alternative 2: Process with perl:
cat file.txt | preprocess --abbr=bin/abbr.txt | wc -l)
(or: "ccat -l sme" instead of "cat", if input is corpus files in our xml format)
Parses = Number of parses given (number of the following command, minus the
number of tokens)
Alternative 1: process with hfst
cat file.txt|hfst-tokenise -g tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst
Alternative 2: process with perl:
cat file.txt | preprocess --abbr=bin/abbr.txt | lookup -flags mbTT
-utf8 src/analyser-gt-desc.xfst | lookup2cg | vislcg3 --g=src/syntax/disambiguation.cg3 | wc -l
CorrTag = Number of tokens that did not have their correct tag removed
(This number must be manually arrived at: Tokens - errouneous_analyses
Ambiguity = #Parses / #Tokens
Precision = #CorrTag / # Parses
Recall = #CorrTag / # Tokens