GiellaLT

GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology. Read more about Why. See also How to get started and our Privacy document.

View GiellaLT on GitHub

Page Content

Tokens =    Number of tokens in the text

    Alternative 1: Process with hfst:
            cat file.txt|hfst-tokenise tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst

            Alternative 2: Process with perl:
            cat file.txt | preprocess --abbr=bin/abbr.txt | wc -l)
            (or: "ccat -l sme" instead of "cat", if input is corpus files in our xml format)

Parses =    Number of parses given (number of the following command, minus the
            number of tokens)

        Alternative 1: process with hfst
        cat file.txt|hfst-tokenise -g  tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst

            Alternative 2: process with perl:
        cat file.txt | preprocess --abbr=bin/abbr.txt | lookup -flags mbTT
            -utf8 src/analyser-gt-desc.xfst | lookup2cg | vislcg3 --g=src/syntax/disambiguation.cg3 | wc -l
CorrTag =   Number of tokens that did not have their correct tag removed
            (This number must be manually arrived at: Tokens - errouneous_analyses

Ambiguity = #Parses / #Tokens
Precision = #CorrTag / # Parses
Recall =    #CorrTag / # Tokens