GiellaLT

GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology. Read more about Why. See also How to get started and our Privacy document.

View GiellaLT on GitHub

Error markup for ISL - Icelandic

We want to extend (some of) the corpus files with markup for spelling and other errors, to use them as gold standards for testing our spellers (and in the future other tools as well). The markup is done manually, and needs to follow certain rules.

Description of the error classification for ISL:

1. Unclassified errors

These are errors of an unknown type.

2. Orthographic errors, non-words

These are traditional misspellings confined to single (error) strings, that is, errors that don’t need an analysis of the surrounding words to be detected and corrected. In the resulting xml, the element is named <errorort>. These errors do always lead to non-words in the text, such that a speller should be able to detect them.

Error types

vow - {error}${vow,position,subtype|correct}

Errors involving an incorrect vowel.

con - {error}${con,position,subtype|correct}

Errors involving an incorrect consonant.

typo - {error}${typo,position,subtype|correct}

Typographical error. Slips of the hand or fingers. Not the same as spelling errors.

cap - {error}${cap,position,subtype|correct}

An error in capitalization.

meta - {error}${meta,position,subtype|correct}

The metathesis of letters. Can be 2 or more, though 3 is the most seen.

abp - {error}${abp,subtype|correct}

An error in punctuation in abbreviations, resulting in an error.

cmp type 1 - {error}${cmp,subtype|correct}

Errors in compounding words. The wrong form of the former word is used, resulting in an error.

cmp type 2 - {error}${cmp,wordclasses,subtype|correct}

Errors in compounding words. Two or more words are written together as one word, resulting in an error.

cmp type 3 - {error}${cmp,slash,subtype|correct}

Errors in compounding words. A slash is used to compound words that should be separate words, resulting in an error.