GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology. Read more about Why. See also How to get started and our Privacy document.
We want to extend (some of) the corpus files with markup for spelling and other errors, to use them as gold standards for testing our spellers (and in the future other tools as well). The markup is done manually, and needs to follow certain rules.
We differentiate between different types of errors that people make, depending on the type of analysis needed to detect and correct the error. We also use the annotation for errors in learner texts.
TEMPLATE: {wrong}§{correct}
Errors of unknown type. By default such errors will be treated as spelling errors (see below). In the resulting xml, the name of the element will be <error>.
Hm. maahta {son}${pcle,vowc|sån} ahte tjoeverem {{daab}${dem,con|daam} bloggen
{{darjoedh}${verb,vow|darjodh}}}£{noun,x,acksg,gensg,case|daam bloggem darjodh}
{{vytnije}${noun,mix|vætnoe} {bloggine}}§{x,x|vætnoebloggine}.
TEMPLATE: {wrong}${error classification|correct}
Traditional misspellings confined to single (error) strings, that is, errors that don’t need an analysis of the surrounding words to be detected and corrected. In the resulting xml, the element is named <errorort>. These errors do always lead to non-words in the text, such that a speller should be able to detect them.
Detailed mark-up:
{SUV:at}${acr,suf|SUV:t}
TEMPLATE: {wrong}¢{error classification|correct} (almost same as for non-words, see above)
Misspellings confined to single words, but still need an analysis of the surrounding words to be detected and corrected. In the resulting xml, the element is named <errorortreal> These errors, although orthographical in nature, lead to other. real words, such that a traditional speller is unable to detect them.
TEMPLATE: {wrong form}£{pos,cat,orig-errtype,gf|correct form}
Errors that require an analysis of (parts of) the sentence or surrounding words to be detected and corrected. In the resulting xml, the element is named <errormorphsyn>.
Simple mark-up:
Mun liikon dien {girjji}£{girjái}.
Detailed mark-up:
Mun liikon dien {girjji}£{n,case,acc-ill|girjái}.
Detailed mark-up:
Jus mii nannet {máhttu}£{noun,obj,accsg,nomsg,case|máhtu} ja diđolašvuođa dáid birra dutkamušaid bokte, de lea álgoálbmogiidda álkit ákkastallat áigumušaid ja politihka, muitala Retter.
Detailed mark-up:
Sázo {linjá}£{noun,obj,accsg,nomsg,case|linjjá}
Simple mark-up:
Dat lea {eanemus dábálaš}£{dábálaččamus} váriin ja duoddariin.
Detailed mark-up:
Dat lea {eanemus dábálaš}£{adj,superl,analyt-synt|dábálaččamus} váriin ja duoddariin.
Detailed mark-up:
Mađi váddásut fáddá, dađi {unnit}£{adv,advl,comp,pos,infl|unnibut} gullojit minoritehtajienat.
Detailed mark-up:
{Doalut bistte}£{verb,fin,pl3prs,conneg,infl|Doalut bistet} gitta dii. {16:00}¥{missing|16:00 rádjai}.
Detailed mark-up:
{Mat lea}£{verb,fin,sg3prs,pl3prs,kongr|Mat leat} {tnjealját}${adj,typo|njealját} čiega koordináhtat?
{Illá jáhkken dat lei duohta.}£{Illá jáhkken ahte dat lei duohta.}
SHOULD BE a syntactic error instead:
Illá {jáhkken}¥{missing|jáhkken ahte} dat lei duohta.
TEMPLATE: {redundantword}¥{pos,redun|} OR {word}¥{pos, missing|word missingword} OR word order errors {word1 word2}¥{pos_word1,wo|word2 word1} OR wrong clause type
Also these errors require a partial or full analysis of (parts of) the sentence or surrounding words to be detected and corrected. In the resulting xml, the element is named <errorsyn>.
Examples:
gitta dii. {16:00}¥{missing|16:00 rádjai}.
If a subjunction that depends on a verb is missing, the verb should be marked:
Illá {jáhkken}¥{missing|jáhkken ahte} dat lei duohta.
TEMPLATE: {wrong}€{wrong PoS,correct PoS|correct}
Errors where the real error is only in the chosen word used, that is, another word would be better or correct; to be able to detect and correct such errors, we need in addition to syntactic analysis also a dictionary component with sufficiently rich syntactic and semantic markup of the entries, as well as syntactic and semantic disambiguation. The possibility to detect and correct this type of errors is probably not in the nearest future, but the need to mark up texts for these errors is real now. In the resulting xml, the element is named <errorlex>.
TEMPLATE: {wrong}‰{error classification|correct}
Formatting errors include punctuation, hyphens, citation marks and spacing.
Annotation: Attributes:
errtype { space | notspace | hyph | nothyph | cit | punct
| notpunct | }
Some explanations:
space = there should be a space
notspace = there should not be any space
singlespace = there should only be a single space
hyph = hyphenation is missing
nothyph = hyphenation should not be used
cit = citation
punct = punctuation
notpunct = there should not be punctuation
Space before a comma, perioid, exclamation mark or question mark:
mark previous word and following word or token:
Jus háliidehpet rievdadit evttohuvvon áiggi dieđihehket midjiide ovdal disdaga {čakčamánu 29.b.}‰{space|čakčamánu 29. b.}
{odne ,}‰{notspace|odne,} ihttin.
not like this:
odne{ ,}‰{notspace|,} ihttin.
too many spaces, where there should be a single space:
{1. Skovvi}‰{singlespace|1. Skovvi}
{1. Skovvi}‰{singlespace|1. Skovvi}
no space between ranges:
{2009- 2010}‰{notspace|2009-2010}
no space before and after brackets (single errors and one word enclosed in brackets with two space errors):
{( Musea )}‰{(Musea)}
{( Museas}‰{(Museas} leat ollu {mánát )}‰{mánát)}
citation mark errors (single errors and one word enclosed in two erroneous citation marks):
Nordkapp Sámiid Searvvi ovdaolmmoš: {«{Mearrasameakšuvdna}${noun,á|Mearrasámeakšuvdna}»}‰{cit|”Mearrasámeakšuvdna”} ii leat ávkin sámiide
Olmmáivákkis, Gáivuonas, lea visti mii lea ožžon nama {"Biru"}‰{cit|”Biru”} viessun.
Olmmáivákkis, Gáivuonas, lea visti mii lea ožžon nama {"}‰{cit|”}Biru baika{"}‰{cit|”} viessun.
punctuation mark errors:
— Leaibevuona sápmelaččaid váttisvuođaid{.}‰{punct|,} muhto dat lea sis boastut gáđaštit boazosápmelaččaid {dušse}${adv,typo|dušše} dainna go sii leat veaháš doarjaga ožžon.
Su mielas váttisvuođaid {buvttalii}${verb,á|buvttálii} ee. gieldda mearehis stuora viidodaga lassin maid sámegielat bálvalusaid {ollašuhttin}‰{punct|ollašuhttin.}
No comma:
{Maŋŋel doaluid, fertebehtet}‰{notcomma|Maŋŋel doaluid fertebehtet} {rehkenastit dietnasa ja čállet dan unna girjjážii}£{verb,infin,infinite,pl3prs,number|rehkenastit dietnasa ja čállit dan unna girjjážii} mii lea ruhta-kássas.
NOT a formatting error:
{1980 logus}‰{hyph|1980-logus}
SHOULD BE:
{1980 logus}¥{noun,cmp|1980-logus}
gullat eambbo {aht’}${cc,svow|ahte}
TEMPLATE: {wrong}∞{error classification}
Formatting errors include text in foreign language and urls.
Annotation: Attributes:
errtype { url | }
Some explanations:
url = this is an url
Url format:
mark url and say it’s an url:
Prošeavttas gávdno lassidiehtu mielčuovvu čujuhusas: {http://www.arcticgovernance.org}∞{url}/ Prošeaktajođiheaddji, dr Robert Corell muitalii prošeavtta duogážis ja mearkkašumis.
We differentiate between different types of errors that people make, depending on the type of analysis needed to detect and correct the error. We also use the annotation for errors in learner texts.
All types can be nested, this is still a bit undecided and will be
updated in a bit. That is, the following nesting is allowed:
formatting > syntactic > morpho-syntactic > lexical > spelling > syntactic compound
.
Parentheses are used to identify the range of the error. When nesting error markup, parentheses are required. Parentheses are also required when the error is followed by punctuation that is not part of the error or correction - the parenthesis will make sure the punctuation stays outside the error correction markup.
Examples:
Here is a nested spelling error and a syntactic compound error:
{njuolggo {linjás}${noun,conc|linjjás}}¥{noun,cmp|njuolggolinjjás}
Here are two morpho-syntactic errors with the same scope:
{{Sis geas lea ovddasvástádus}£{pers,subj,nompl,locpl,case|Sii geas lea ovddasvástádus}}£{rel,hab,nompl,nomsg,kongr|Sii geain lea ovddasvástádus} lágidit kaféa bohtet dii. 12.00 ja {kaféa {rahppasa}¢{verb,conc|rahpasa}}€{der|kaféa rahppojuvvo} dii. 13.00.
Two types of spelling errors and a lexical error:
dat maid dovddan ii leat diet ráhkisvuođa dovdu maid {{{áittoráhkistan}${vowc,á-a|aittoráhkistan}}${verb,notcmp|aitto ráhkistan}}€{verb,trans|aitto ráhkásmuvvan} olmmoš {dovda}${verb,á|dovdá}
How to check the hierarchy of nesting:
run the following commands in the terminal:
$> echo "njuolggo linjás"| divvun-checker
-a $GTLANGS/langs-sme/tools/grammarcheckers/se.zcheck
{"errs":[["njuolggo linjás",0,17,"double-space-before","Leat guokte gaskka ovdal \"linjás\"",["njuolggo linjás"],"Sátnegaskameattáhus"],["njuolggo linjás",0,17,"typo","Ii leat sátnelisttus",["njuolggo linjás","linjjás"],"Čállinmeattáhus"]],"text":"njuolggo linjás"}
$> echo "njuolggo linjás"| divvun-checker -a $GTLANGS/langs-sme/tools/grammarcheckers/se.zcheck
{"errs":[["linjás",9,15,"typo","Ii leat sátnelisttus",["linjjás"],"Čállinmeattáhus"]],"text":"njuolggo linjás"}
$> echo "njuolggo linjjás"| divvun-checker -a $GTLANGS/langs-sme/tools/grammarcheckers/se.zcheck
{"errs":[["njuolggo linjjás",0,16,"msyn-compound","\"njuolggo linjjás\" orru leamen goallossátni",["njuolggolinjjás"],"Goallosteapmi"]],"text":"njuolggo linjjás"}
The order of nesting is the following:
Other types:
Here is a nested morpho-syntactic error, a lexical error and a word order error (syntactic):
{{vuordedahtte {sjaddá}£{ind-pot|sjattasj}}€{w|dávk sjattasj}}¥{wo|sjattasj dávk}
1) Whatever is one token in our lexicon, i.e. usually one word, but in the case of multi word expressions, it can be several words 2) As many tokens/words as need to be changed to correct the error
In the case of “eara beaivi”, only “eara” should be marked
{eara}${error classification|eará} beaivi
In the case of “earret eara”, “earret eara” should be marked as it is a multi word expression
{earret eara}${error classification|earret eará}
If an error can be corrected in different ways, we order the corrections from more likely to less likely and separate the alternatives by three slashes
The following error can be corrected in two ways: 1) change period into comma 2) leave the period and capitalize the subsequent word:
— Leaibevuona sápmelaččaid váttisvuođaid{{.}‰{punct|,} muhto}///{. {muhto}‰{cap|Muhto}} dat lea sis boastut gáđaštit boazosápmelaččaid {dušse}${adv,typo|dušše} dainna go sii leat veaháš doarjaga ožžon.
Here the same word is corrected, make sure to put the errortype after ///:
ja geas {ii leat mangelágan čanastagat}£{noun,spred,nomsg,nompl,kongr|ii leat mangelágan čanastat}///£{noun,spred,nompl,nomsg,kongr|eai leat mangelágan čanastagat}báikái dahje beroštupmi dan buresbirgejupmái.
not like this:
ja geas {ii leat mangelágan čanastagat}£{noun,spred,nomsg,nompl,kongr|ii leat mangelágan čanastat}///{noun,spred,nompl,nomsg,kongr|eai leat mangelágan čanastagat}báikái dahje beroštupmi dan buresbirgejupmái.
Here are some examples of error/correction markup and how they are converted to xml:
{nourra}${a,meta|nuorra}
<errorort pos="n" errtype="meta" corr="nuorra">nourra</errorort>
{Nieiddat leat nuorra}£{a,spred,nompl,nomsg,agr|Nieiddat leat nuorat}.
<errormorphsyn cat="nompl" const="spred" correct="Nieiddat leat nuorat" errtype="agr" orig="nomsg" pos="adj">Nieiddat leat \
<errorort correct="nuorra" errtype="meta" pos="adj">nourra</errorort></errormorphsyn>.
Mun riŋgen {nieidda lusa}¥{x,pph|niidii} ihttin.
Mun <errorsyn pos="x" errtype="pph" corr="riŋgen niidii">riŋgen nieidda lusa</errorsyn> ihttin.
Son lei {ovtta}¥{num,redun| } viesus.
Son lei <errorsyn pos="num" errtype="redun" corr="">ovtta</errorsyn> viesus.
Mun barggan nu {dábálaš}€{adv,adj,der|dábálaččat}.
Mun barggan nu <errorlex pos="adv" origpos="adj" errtype="der" corr="dábálaččat">dábálaš</errorlex>.
Nesting:
{Nieiddat leat nourra}${adj,meta|nuorra}}£{adj,spred,nompl,nomsg,agr|Nieiddat leat nuorat}.
<errormorphsyn pos="adj" const="spred" cat="nompl" orig="nomsg" errtype="agr" corr="Nieiddat leat nuorat">
Nieiddat leat <errorort pos="adj" errtype="meta" corr="nuorra">nourra</errorort></errormorphsyn>.
Mus leat {guokte ganddat§{n,á|gánddat}}£{n,nump,gensg,nompl,case|guokte gándda}.
Mus leat <errormorphsyn cat="gensg" const="nump" correct="guokte gándda" errtype="case" orig="nompl" pos="n">
guokte <error correct="gánddat">ganddat</error></errormorphsyn>.
Mus {leat {okta máná}£{n,spred,nomsg,gensg,case|okta mánná}}£{v,v,sg3prs,pl3prs,agr|lea okta mánná}.
Mus <errormorphsyn cat="sg3prs" const="v" correct="lea okta mánná" errtype="agr" orig="pl3prs" pos="v">
leat <errormorphsyn cat="nomsg" const="spred" correct="okta mánná" errtype="case" orig="gensg" pos="n">
okta máná</errormorphsyn></errormorphsyn>.
How should this be marked up? As an orthographic error (non-word or realword)? A formatting error (missing space)? “oahppoja” is analyzed as a dynamic compound of “oahppu” and “idja”, so we get an analysis.:
Guovdageaidnu lea guovddáš sámi {oahppoja}${x,cmp|oahppo- ja} dutkanbáiki.
We decided on the following way
{{oahppoja}${typo,space|oahppo ja} dutkanbáiki}¥{cmp,hyph|oahppo- ja dutkanbáiki}
Should this be a formatting error, because of the space and the hyphen or should this be a syntactic error because sámiid should be a split compound?:
ahte sii gozihit {sámiid - ja eamiálbmogiid beliid}‰{notspace| sámiid- ja eamiálbmogiid beliid}
We decided on the following way
Spelling error or something else?:
{ge-}${space|ge -} dávjá čuhcet sidjiide, dadjá sámedikki {politihkakálaš}${adj,typo|politihkalaš} ráđđeaddi Johan Vasara.
Syntactic error or formatting error?:
{Jus lea mii nu mii váilu kássas- de fertebehtet čálistit unna {girjážii}${noun,conc|girjjážii}}¥{noun,nothyph|Jus lea mii nu mii váilu kássas, de fertebehtet čálistit unna girjjážii} mii lea ruhta-kássas.
The following rules should be followed when marking up texts:
doc
, txt
or
html
file, and name it corr.doc
, corr.txt
, or corr.html
, and
add the correction markup in this new file. This will create a “new”
original, which is identical to the “real” original, except for the
additional correction markup. The “new” original will be converted
to xml by the script convert2xml.pl
, which is run automatically
every night. Corrections done to the converted xml files will be
lost upon next conversion.$CORPUSHOME/prooftest/orig/$LANG/$GENRE/
. The
converted xml file(s) will be found in
$CORPUSHOME/prooftest/$CONTRACT/$LANG/$GENRE/
. It is important
that the untouched original is also stored in the prooftest/
hierarchy, otherwise it can easily be included when making new
missing lists, which means that the coverage testing will become
misleading without us noticing it.these are marked as spelling errors:
{nuppegežiid}${noun,notcmp|nuppe gežiid}
{albmaláhkai}${adv,notcmp|albma láhkai}
{gosaguvlui}${noun,notcmp|gosa guvlui}
{giinu}${indef,notcmp|gii nu}
{Goalmmátoassi}${noun,notcmp|Goalmmát oassi}
this is wrong (it should be marked as a formatting error):
6{.beaivve}${notcmp|. beaivve}
{2.beaivái}${notcmp|2. beaivái}
these are marked as spelling errors:
{stivračoahkkin}${noun,cmp,gensg,nomsg|stivrračoahkkin}
{meahcivaljiservviiguin}${noun,cmp,gensg,nomsg|meahcivalljiservviiguin}
{risko-lágán}${adj,cmp,nomsg,gensg|riskkulágán}
{giinu}${indef,notcmp|gii nu}
{Soljju-čiŋat}${noun,cmp,gensg,nomsg|Soljočiŋat}
these are marked as spelling errors:
{sámifeasttas}${noun,cmp,svow|sámefeasttas}
{sámiláganat}${noun,cmp,svow|sámeláganat}
{lihkodovdu}${noun,cmp,conc|lihkkodovdu}
{Fylkadikkeáirras}${noun,cmp,mix|Fylkkadiggeáirras}
{árgabeai’eallima}${noun,cmp,notpunkt|árgabeaieallima}
We are not sure how to annotate the last one yet
these are marked as syntactic errors as the alternative is that the words are syntactically related to each other:
{gulahallan olbmožat}¥{noun,cmp|gulahallanolbmožat}
{1600- logu}¥{noun,cmp|1600-logu}
{Gaska Nuortái}¥{prop,cmp|Gaska-Nuortái}
{guovddáš ulbmilin}¥{noun,cmp|guovddášulbmilin}
{80 jahkásačča}¥{adj,cmp|80-jahkásačča}
here is a nested one (two errors in the same phrase, but with a different scope)
{{blogg}${noun,vow|blogga} čállosa}¥{noun,cmp|bloggačállosa}
these are marked as syntactic errors as the alternative is that the words are syntactically related to each other:
omd {mánáid}¥{noun,hyph|mánáid-} ja {nuoraiddoaimmaguin}${noun,typo|nuoraiddoaimmaiguin}
not like this:
Ossodagat addet maiddái doarjaga dutkamii, {geahččalan ja ovdánahttinbargui}${noun,punct|geahččalan- ja ovdánahttinbargui}, ja servet riikkaidgaskasaš ovttasbargguide sin fágasurggiineaset.
(xml element name after conversion to xml is specified after the symbol used for the actual markup)
By following these guidelines the resulting files should be readily useable for (speller) testing, as soon as they are converted to xml.