GiellaLT

GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology.

View GiellaLT on GitHub

Page Content

Error markup

We want to extend (some of) the corpus files with markup for spelling and other errors, to use them as gold standards for testing our spellers (and in the future other tools as well). The markup is done manually, and needs to follow certain rules.

Language-specific markup

Markup TYPES

We differentiate between different types of errors that people make, depending on the type of analysis needed to detect and correct the error. We also use the annotation for errors in learner texts.

Synopsis (explanations, see below)

Unclassified errors

TEMPLATE: {wrong}§{correct}

Errors of unknown type. By default such errors will be treated as spelling errors (see below). In the resulting xml, the name of the element will be <error>.


    Hm. maahta {son}${pcle,vowc|sån} ahte tjoeverem {{daab}${dem,con|daam} bloggen
    {{darjoedh}${verb,vow|darjodh}}}£{noun,x,acksg,gensg,case|daam bloggem darjodh}
    {{vytnije}${noun,mix|vætnoe} {bloggine}}§{x,x|vætnoebloggine}.

Orthographic errors, non-words

TEMPLATE: {wrong}${error classification|correct}

Traditional misspellings confined to single (error) strings, that is, errors that don’t need an analysis of the surrounding words to be detected and corrected. In the resulting xml, the element is named <errorort>. These errors do always lead to non-words in the text, such that a speller should be able to detect them.

Orthographic errors, real-words

TEMPLATE: {wrong}¢{error classification|correct} (almost same as for non-words, see above)

Misspellings confined to single words, but still need an analysis of the surrounding words to be detected and corrected. In the resulting xml, the element is named <errorortreal> These errors, although orthographical in nature, lead to other. real words, such that a traditional speller is unable to detect them.

Morpho-syntactic errors

TEMPLATE: {wrong form}£{pos,gf,cat,orig,errtype|correct form}

Errors that require an analysis of (parts of) the sentence or surrounding words to be detected and corrected. In the resulting xml, the element is named <errormorphsyn>.

Syntactic errors

TEMPLATE: {redundantword}¥{pos,redun|} OR {word}¥{pos, missing|word missingword} OR word order errors {word1 word2}¥{pos_word1,wo|word2 word1} OR wrong clause type

Also these errors require a partial or full analysis of (parts of) the sentence or surrounding words to be detected and corrected. In the resulting xml, the element is named <errorsyn>.

Lexical errors

TEMPLATE: {wrong}€{wrong PoS,correct PoS|correct}

Errors where the real error is only in the chosen word used, that is, another word would be better or correct; to be able to detect and correct such errors, we need in addition to syntactic analysis also a dictionary component with sufficiently rich syntactic and semantic markup of the entries, as well as syntactic and semantic disambiguation. The possibility to detect and correct this type of errors is probably not in the nearest future, but the need to mark up texts for these errors is real now. In the resulting xml, the element is named <errorlex>.

Formatting errors

TEMPLATE: {wrong}‰{error classification|correct}

Formatting errors include punctuation, hyphens, citation marks and spacing.

Foreign language errors

TEMPLATE: {wrong}∞{error classification}

Formatting errors include text in foreign language and urls.

Markup SCOPE

We differentiate between different types of errors that people make, depending on the type of analysis needed to detect and correct the error. We also use the annotation for errors in learner texts.

Nesting

All types can be nested, this is still a bit undecided and will be updated in a bit. That is, the following nesting is allowed: formatting > syntactic > morpho-syntactic > lexical > spelling > syntactic compound.

Parentheses are used to identify the range of the error. When nesting error markup, parentheses are required. Parentheses are also required when the error is followed by punctuation that is not part of the error or correction - the parenthesis will make sure the punctuation stays outside the error correction markup.

What is a token?

1) Whatever is one token in our lexicon, i.e. usually one word, but in the case of multi word expressions, it can be several words 2) As many tokens/words as need to be changed to correct the error

In the case of “eara beaivi”, only “eara” should be marked

{eara}${error classification|eará} beaivi

In the case of “earret eara”, “earret eara” should be marked as it is a multi word expression

{earret eara}${error classification|earret eará}

Alternative corrections

If an error can be corrected in different ways, we order the corrections from more likely to less likely and separate the alternatives by three slashes

The following error can be corrected in two ways: 1) change period into comma 2) leave the period and capitalize the subsequent word:

— Leaibevuona sápmelaččaid váttisvuođaid{{.}‰{punct|,} muhto}///{. {muhto}‰{cap|Muhto}} dat lea sis boastut gáđaštit boazosápmelaččaid {dušse}${adv,typo|dušše} dainna go sii leat veaháš doarjaga ožžon.

Here the same word is corrected, make sure to put the errortype after ///:

ja geas {ii leat mangelágan čanastagat}£{noun,spred,nomsg,nompl,kongr|ii leat mangelágan čanastat}///£{noun,spred,nompl,nomsg,kongr|eai leat mangelágan čanastagat}báikái dahje beroštupmi dan buresbirgejupmái.

not like this:

ja geas {ii leat mangelágan čanastagat}£{noun,spred,nomsg,nompl,kongr|ii leat mangelágan čanastat}///{noun,spred,nompl,nomsg,kongr|eai leat mangelágan čanastagat}báikái dahje beroštupmi dan buresbirgejupmái.

Markup EXAMPLES

Here are some examples of error/correction markup and how they are converted to xml:

{nourra}${a,meta|nuorra}

<errorort pos="n" errtype="meta" corr="nuorra">nourra</errorort> 

{Nieiddat leat nuorra}£{a,spred,nompl,nomsg,agr|Nieiddat leat nuorat}.

<errormorphsyn cat="nompl" const="spred" correct="Nieiddat leat nuorat" errtype="agr" orig="nomsg" pos="adj">Nieiddat leat \
      <errorort correct="nuorra" errtype="meta" pos="adj">nourra</errorort></errormorphsyn>.

Mun riŋgen {nieidda lusa}¥{x,pph|niidii} ihttin.

Mun <errorsyn pos="x" errtype="pph" corr="riŋgen niidii">riŋgen nieidda lusa</errorsyn> ihttin.

Son lei {ovtta}¥{num,redun| } viesus.

Son lei <errorsyn pos="num" errtype="redun" corr="">ovtta</errorsyn> viesus.

Mun barggan nu {dábálaš}€{adv,adj,der|dábálaččat}.

Mun barggan nu <errorlex pos="adv" origpos="adj" errtype="der" corr="dábálaččat">dábálaš</errorlex>.

Nesting:

{Nieiddat leat nourra}${adj,meta|nuorra}}£{adj,spred,nompl,nomsg,agr|Nieiddat leat nuorat}.

<errormorphsyn pos="adj" const="spred" cat="nompl" orig="nomsg" errtype="agr" corr="Nieiddat leat nuorat">
Nieiddat leat <errorort pos="adj" errtype="meta" corr="nuorra">nourra</errorort></errormorphsyn>.

Mus leat {guokte ganddat§{n,á|gánddat}}£{n,nump,gensg,nompl,case|guokte gándda}.

Mus leat <errormorphsyn cat="gensg" const="nump" correct="guokte gándda" errtype="case" orig="nompl" pos="n">
guokte <error correct="gánddat">ganddat</error></errormorphsyn>.

Mus {leat {okta máná}£{n,spred,nomsg,gensg,case|okta mánná}}£{v,v,sg3prs,pl3prs,agr|lea okta mánná}.

Mus <errormorphsyn cat="sg3prs" const="v" correct="lea okta mánná" errtype="agr" orig="pl3prs" pos="v">
leat <errormorphsyn cat="nomsg" const="spred" correct="okta mánná" errtype="case" orig="gensg" pos="n">
okta máná</errormorphsyn></errormorphsyn>.

Markup CHALLENGES

Markup RULES

The following rules should be followed when marking up texts:

  1. The correction is always done in the original format - never in the xml file! That is, make a copy of the original doc, txt or html file, and name it corr.doc, corr.txt, or corr.html, and add the correction markup in this new file. This will create a “new” original, which is identical to the “real” original, except for the additional correction markup. The “new” original will be converted to xml by the script convert2xml.pl, which is run automatically every night. Corrections done to the converted xml files will be lost upon next conversion.
  2. $ is the spelling correction mark - use it directly after the wrongly spelled word, followed by the correction, as in {error}${correction}. Example: {volvo}${Volvo}. NB! there should be NO space on either side of the correction mark $.
  3. Skip foreign text - we assume that text in other languages are properly detected, or manually marked in the xsl file. That is: DON’T add spelling error markup to passages in Norwegian - instead, try to enforce or add xml markup designating the passage as being in Norwegian. Single words used as part of a sami sentence (in situ loans), should NOT be marked, either, since we can’t know what the correction should be (and in principle the word isn’t a misspelling if it is correctly spelled Norwegian).
  4. Enclose multiword corrections in parenthesis - since the conversion to xml needs a way of knowing where the correction ends, we need to tell it if it is not at the end of the first word after the correction symbol. Example: {Norggabealde}§{Norgga bealde}
  5. separate punctuation that is not part of the correction with a space, or use parentheses around the correction. Example: “{buolasta}§{buolašta}.” or “{buolasta}§{buolašta} .” (the example text is the text within the quotes, including the punctuation).
  6. Remember the case - the correction should have the same case pattern as the spelling error. Example: {Mannjá}§{Maŋŋá}, NOT {Mannjá}§{maŋŋá} (note the case of the initial letter). The exception is of course when the error is missing capitalisation, as in names spelled lower case, etc.
  7. Always provide a correction! The markup is useless if it isn’t complete.
  8. Both the untouched original and the corrected “original” should be stored in $CORPUSHOME/prooftest/orig/$LANG/$GENRE/. The converted xml file(s) will be found in $CORPUSHOME/prooftest/$CONTRACT/$LANG/$GENRE/. It is important that the untouched original is also stored in the prooftest/ hierarchy, otherwise it can easily be included when making new missing lists, which means that the coverage testing will become misleading without us noticing it.

Error types and their mark-up

Compound error types

MWE written as a compound

these are marked as spelling errors:

{nuppegežiid}${noun,notcmp|nuppe gežiid}

{albmaláhkai}${adv,notcmp|albma láhkai}

{gosaguvlui}${noun,notcmp|gosa guvlui}

{giinu}${indef,notcmp|gii nu}

{Goalmmátoassi}${noun,notcmp|Goalmmát oassi}

this is wrong (it should be marked as a formatting error):

6{.beaivve}${notcmp|. beaivve}

{2.beaivái}${notcmp|2. beaivái}

Case error in the first part of the compound

these are marked as spelling errors:

{stivračoahkkin}${noun,cmp,gensg,nomsg|stivrračoahkkin}

{meahcivaljiservviiguin}${noun,cmp,gensg,nomsg|meahcivalljiservviiguin}

{risko-lágán}${adj,cmp,nomsg,gensg|riskkulágán}

{giinu}${indef,notcmp|gii nu}

{Soljju-čiŋat}${noun,cmp,gensg,nomsg|Soljočiŋat}

Vowel/consonant error in the first part of the compound

these are marked as spelling errors:

{sámifeasttas}${noun,cmp,svow|sámefeasttas}

{sámiláganat}${noun,cmp,svow|sámeláganat}

{lihkodovdu}${noun,cmp,conc|lihkkodovdu}

{Fylkadikkeáirras}${noun,cmp,mix|Fylkkadiggeáirras}

{árgabeai’eallima}${noun,cmp,notpunkt|árgabeaieallima}

We are not sure how to annotate the last one yet

Compound written as a MWE

these are marked as syntactic errors as the alternative is that the words are syntactically related to each other:

{gulahallan olbmožat}¥{noun,cmp|gulahallanolbmožat}

{1600- logu}¥{noun,cmp|1600-logu}

{Gaska Nuortái}¥{prop,cmp|Gaska-Nuortái}

{guovddáš ulbmilin}¥{noun,cmp|guovddášulbmilin}

{80 jahkásačča}¥{adj,cmp|80-jahkásačča}

here is a nested one (two errors in the same phrase, but with a different scope)

{{blogg}${noun,vow|blogga} čállosa}¥{noun,cmp|bloggačállosa}

Split compounds

these are marked as syntactic errors as the alternative is that the words are syntactically related to each other:

omd {mánáid}¥{noun,hyph|mánáid-} ja {nuoraiddoaimmaguin}${noun,typo|nuoraiddoaimmaiguin}

not like this:

Ossodagat addet maiddái doarjaga dutkamii, {geahččalan ja ovdánahttinbargui}${noun,punct|geahččalan- ja ovdánahttinbargui}, ja servet riikkaidgaskasaš ovttasbargguide sin fágasurggiineaset.

Summary + new error types

(xml element name after conversion to xml is specified after the symbol used for the actual markup)

By following these guidelines the resulting files should be readily useable for (speller) testing, as soon as they are converted to xml.