Finite state and Constraint Grammar based analysers, proofing tools and other resources
View the project on GitHub giellalt/lang-sma
Error tag | Explanation |
---|---|
+Err/Orth | Substandard, unormert form av et ord |
+Err/Hyph | Substandard, unormert |
+Err/SpaceCmp | Substandard, unormert |
+Err/Attr | Substandard, unormert Attr-form av et ord |
+Err/Lex | lemma med dens ordformer er utenfor normen. No normative lemma, it’s grammatically correct. |
+Err/Der | Errors in derivations |
+Err/Spellrelax | Used to tag spellrelaxed typos (tag is inserted via flag diacritics) |
+Err/MissingSpace | in use in smi lexc |
Usage tag | Explanation |
---|---|
+Use/Marg | Marginal, korrekte, eksisterende former, men som er sjeldne. vi kan fjerne disse ordene f.eks fra speller, fordi de er så sjeldne og lite i bruk at de lemma som ligger nært kan bli forvekslet. |
+Use/-Spell | Excluded from speller |
+Use/-PLX | Excluded in PLX speller |
+Use/SpellNoSugg | Recognized but not suggested in speller |
+Use/Circ | Circular path |
+Use/CircN | Circular number path? |
+Use/Ped | Remove from pedagogical speller |
+Use/NG | Do not generate for isme-ped.fst and apertium |
+Use/MT | Generate for apertium only |
+Use/NotDNorm | For (spellings of) words that do not follow the orthographic principles of sma. Divvun suggest that this shouldn’t be normative, even though they are decided upon by GG. Included in speller. |
+Use/DNorm | For words without formal normalization. Divvun suggest that this should be normative. Included in speller. Based on 2010 normative decision & Ove Lorentz’ suggestions for the norm. |
+Use/PMatch | Do only include in fst’s for hfst-pmatch |
+Use/-PMatch | Do not include in fst’s made for hfst-pmatch |
+Use/GC | only retained in the HFST Grammar Checker disambiguation analyser |
+Use/-GC | never retained in the HFST Grammar Checker disambiguation analyser |
+Use/TTS | only retained in the HFST Text-To-Speech disambiguation tokeniser |
+Use/-TTS | never retained in the HFST Text-To-Speech disambiguation tokeniser |
Dialect tag | Explanation |
---|---|
+Dial/-S | Not in the South |
+Dial/S | Only in the South |
+Dial/-N | Not in the North |
+Dial/N | Only in the North |
+Dial/-NOR | Words not in Norway |
+Dial/NOR | Words only in Norway |
+Dial/-SW | Words not in Sweden |
+Dial/SW | Words only in Sweden |
+Dial/SH | Short forms |
+Dial/L | Long forms |
(to govern compound behaviour for the speller, ie what a compound SHOULD BE)
The default is +CmpN/SgN
, so when nothing is specified, that
will be used. To override that one, specify one or more of the
following tags. +CmpN/SgN
must be specified if also other tags
are listed - unless +CmpN/SgN
should not be used, for course.
Normative compounding tag | Explanation |
---|---|
+CmpN/Sg | Singular |
+CmpN/SgN | Singular Nominative |
+CmpN/SgG | Singular Genitive |
+CmpN/PlG | Plural Genitive |
These tags overrule the regular tags defined above. One or more can be specified.
Normative left-governing tag | Explanation |
---|---|
+CmpN/SgLeft | Sg to the left |
+CmpN/SgNomLeft | etc. |
+CmpN/SgGenLeft | ” |
+CmpN/PlGenLeft | ” |
Normative position tag | Explanation |
---|---|
+CmpNP/All | … be in all positions, default, this tag does not have to be written |
+CmpNP/First | … only be first part in a compound or alone |
+CmpNP/Pref | … only be first part in a compound, NEVER alone |
+CmpNP/Last | … only be last part in a compound or alone |
+CmpNP/Suff | … only be last part in a compound, NEVER alone |
+CmpNP/None | … not take part in compounds |
+CmpNP/Only | … only be part of a compound, i.e. can never be used alone, but can appear in any position |
Tags for compound analysis - this is what a compound actually is. We use this to research compounding patterns in the corpus.
Descriptive compounding tag | Explanation |
---|---|
+Cmp/Sg | Compounding using an unspecified singular stem |
+Cmp/SgNom | Compounding using nominative singular |
+Cmp/SgGen | Compounding using genitive singular |
+Cmp/PlGen | Compounding using genitive plural |
+Cmp/Attr | Compounding using attribute form |
+Cmp/eh | Compound stem in –eh, as in gaameh-gåaroje, from gaamege |
+Cmp/ege | Compound stem in –ege, as in gaamege-gåaroje |
+Cmp/FinEDel | Deletion of final e, as in voelem-gaaroeh, from voeleme |
+Cmp/ShH | Compounding using a short stem + h: –biejjh– (from biejjie), cf reakedsbiejjhvadtese |
+Cmp/Sh | Compounding using a short stem: –biejj– (from biejjie) |
+Cmp/SplitR | This is a split compound with the other part to the right: “Arbeids- og inkluderingsdepartementet” => Arbeids– = +Cmp/SplitR |
+Cmp/SplitL | This is a split compound with the other part to the left, this is the oposite of the previous case |
+Cmp | Dynamic compound - this tag should always be part of a dynamic compound. It is important for Apertium and the speller (to give extra weights to compounds), and useful in other cases as well. |
+Cmp/XForm | Alle Cmp som ikke har en klar klassifisering |
+Cmp/AttrH | Alle Cmp som har en attr-h |
+Du = Dual
Tense tag | Explanation |
---|---|
+Prs | Presens |
+Prt | Preteritum |
Person & Number tag | Explanation |
---|---|
+Sg1 | Singular, 1.person |
+Sg2 | Singular, 2.person |
+Sg3 | Singular, 3.person |
+Du1 | Dual , 1.person |
+Du2 | Dual , 2.person |
+Du3 | Dual , 3.person |
+Pl1 | Plural , 1.person |
+Pl2 | Plural , 2.person |
+Pl3 | Plural , 3.person |
Verbal tag | Explanation |
---|---|
+Neg | negation verb ij |
+ConNeg | main verb complement to Neg, form identical to Imp |
+VAbess | Verb Abessive |
+Inf | Infinitive and participles |
+PrfPrc | Infinitive and participles |
+PrsPrc | Infinitive and participles |
+Ger | Gerundium |
+VGen | Verbgenitive |
+Ind | Indicative |
+Imprt | Imperative |
+ImprtII | Imperative, for Neg: ollem ollh … |
+Cond | Kondisjonalis, for one form: lidtjie. To be looked at.+ lidtjim, + lidtjih |
+Act | -eme, could be changed to +Actio |
Semantic tags help disambiguation and syntactic analysis. All tags used are defined and listed below.
Multiple semantic tags are written as one tag, with the different semantic values separated by an underline _
.
All used combinations must be declared below, and the list must be manually maintained. The tags are ordered alphabetically, both the list and the semantic values within one tag.
Tag | Explanation |
---|---|
+MWE | multi word expressions, goes to abbr |
Use the following flag diacritics to control downcasing of derived proper nouns (e.g. Finnish Pariisi -> pariisilainen). See e.g. North Sámi for how to use these flags. There exists a ready-made regex that will do the actual down-casing given the proper use of these flags.
Flag | Explanation |
---|---|
@P.Px.add@ | Giving possibility for Px-suffixes (all except from Nom 3.p) |
@R.Px.add@ | Requiring P.Px.add-flag for Px-suffixes (all except from Nom 3.p) |
@P.Nom3Px.add@ | Giving possibility for Px-suffixes Nom 3.p |
@R.Nom3Px.add@ | Requiring P.Nom3Px.add flag for Px-suffixes Nom 3.p |
@P.Pmatch.Backtrack@ | Used on single-token analyses; tell hfst-tokenise/pmatch to backtrack by reanalysing the substrings before and after this poin in the form (to find combinations of shorter analyses that would otherwise be missed) |
@D.ErrOrth.ON@ | asdf |
@C.ErrOrth@ | asdf |
@P.ErrOrth.ON@ | asdf |
Derivations in the same position are mutually exclusive (can not be combined), whereas tags in different positions can be combined, so that position 1 derivations must precede position 2 derivations, and so on.
Pos1 | Pos2 | Pos3 | POS switches (from-to) | Explanation |
---|---|---|---|---|
+Der1 | Position tag, required | |||
+Der2 | Position tag, required | |||
+Der3 | Position tag, required | |||
+Der/htalle | VV | Passive, frekeventative | ||
+Der/lg | VV | Passive | ||
+Der/ijes | NA | Nomen agentis | ||
+Der/ihks | VA | (Handlernomen- tilbøyelig til å utføre den handlingen som grunnordet angir) | ||
+Der/les | VA | Intensive | ||
+Der/ldihkie | VA | |||
+Der/ldahke | VA | Resultatnomen (?) | ||
+Der/ldh | VA | Attributt | ||
+Der/ht | VV | Causative | ||
+Der/l | VV | Subitive | ||
+Der/st | VV | Diminutive, Subitive | ||
+Der/d | VV | Continuative, Konative, Frequentative, Refleksive, Momentan | ||
+Der/Car | -hts, Caritive, was Der/heapmi in sme | |||
+Der/htj | NN | Dim-cont, Frequentative | ||
+Der/Dimin | NN | Diminutive | ||
+Der/Rec | NN | Forholdsformer | ||
+Der/laakan | AAdv | adverb | ||
+Der/laaketje | AA | adjektiv | ||
+Der/Comp | AA | adjektiv | ||
+Der/Superl | AA | adjektiv | ||
+Der/vuota | AN | Noun | ||
+Der/adte | VV | Frequentative, Kontinuativ | ||
+Der/alla | VV | Frequentative | ||
+Der/eds | NA | Attributt | ||
+Der/PassL | VV | long only | ||
+Der/NomAg | VN | Nomen Agentis | ||
+Der/NomAct | VN | Nomen Actionis | ||
+Der/ahtje | VV | Inchoative | ||
+Der/InchL | VV | Inchoative |
All non-positional derivations should be preceded by the following tag,
to make it possible to target regular expressions in all derivations in a
language-independent way:
just specify
[+Der](+Der1 .. +Der5)
and you are set.
Derivation tag | POS switch | Explanation |
---|---|---|
+Der/PassS | VV | short passive only |
+Der/A | NA | comparation of N’s |
The following tags are used to guide conversion to IPA: loan words and foreign names are usually pronounced (approximately) as in the originating (majority) language. Instead of trying to identify the correct pronunciation based on phonotactics (orthotactics actually), we tag all words that can’t be correctly transcribed using the SMA transcriber with source language codes. Once tagged, it is possible to apply different IPA conversions to each of them. The principle of tagging is that we only tag to the extent needed, and following a priority:
Originating language tag | Originating language |
---|---|
+OLang/SME | North Sámi |
+OLang/SMA | South Sámi |
+OLang/SMJ | Lule Sámi |
+OLang/FIN | Finnish |
+OLang/SWE | Swedish |
+OLang/NOB | Norw. bokmål |
+OLang/NNO | Norw. nynorsk |
+OLang/ENG | English |
+OLang/RUS | Russian |
+OLang/UND | Undefined |
+OLang/PARA | parallelle navn, navnet skal ikke overføres til andre samisk språk |
A multichar that usually just goes to zero:
|»
Trigger | Explanation |
---|---|
%^DISIMP | diphthong simplification |
%^COMPDISIMP | diphthong simplification in comparatives |
%^COMPDISIMP2 | diphthong simplification in comparatives, type 2 |
%^COMPDISIMP3 | diphthong simplification |
%^PLCDISIMP | diphthong simplification in ACCRA-names |
%^NOMAGieDISIMP | diphthong simplification for NomAg ie stems |
%^1UML | a-uml, like 1sg prs, perf.part of båetedh/V-I, and ill sg of -ie nouns |
%^2UML | dark e, as 3sg prs & perf.part of tjearodh/V-II, and ill sg of -oe nouns |
%^3UML | adj Umlaut oeh:an |
%^3sUML | a-uml in 3sg prs of V-IV (roehtedh - ruahta) |
%^3dUML | ie-uml in 1du & 3pl prs of V-IV (roehtedh - ruehtien) |
%^iæUML | not used |
%^iUML | i-uml in pret of V-I (båetedh - böötim) |
%^PASSUML | Short passive Umlaut Rx->R5 |
%^didhUML | Der/d Umlaut for GUARKEDH-words |
%^htjidhUML | Umlaut für der/htjidh derivations |
%^adteUML | Umlaut für Der/adte and Der/alla derivations |
%^aLATUS | Latus-Umlaut for -ie stems |
%^uLATUS | Latus-Umlaut for -oe stems |
%^ConsDel | Stem consonant deletion in front of Der/PassL |
%^ILLELA | Stem vowel changes in Illative an Elative |
%^PLGENPLCOM | Stem vowel changes in final from e -> i, and withoaut -j- |
%^COMESS | Stem vowel changes in ACCRA-names |
∑ | Symbol used before # and - in dynamic compounds, and only there. Used to block optional conversion of word boundaries to spaces for error detection in grammar checkers. That is, dynamic compounds are not allowed to be written appart for error detection, only lexicalised ones. This is done to reduce the amound of ambiguity in the raw analyses to an amount we can cope with. |
We have manually optimised the structure of our lexicon using the following flag diacritics to restrict morhpological combinatorics - only allow compounds with verbs if the verb is further derived into a noun again:
Flag | Explanation |
---|---|
@P.NeedNoun.ON@ | (Dis)allow compounds with verbs unless nominalised |
@D.NeedNoun.ON@ | (Dis)allow compounds with verbs unless nominalised |
@C.NeedNoun@ | (Dis)allow compounds with verbs unless nominalised |
@R.ErrOrth.ON@ |
For languages that allow compounding, the following flag diacritics are needed to control position-based compounding restrictions for nominals. Their use is handled automatically if combined with +CmpN/xxx tags. If not used, they will do no harm.
Flag | Explanation |
---|---|
@P.CmpFrst.FALSE@ | Require that words tagged as such only appear first |
@D.CmpPref.TRUE@ | Block such words from entering ENDLEX |
@P.CmpPref.FALSE@ | Block these words from making further compounds |
@D.CmpLast.TRUE@ | Block such words from entering R |
@D.CmpNone.TRUE@ | Combines with the next tag to prohibit compounding |
@U.CmpNone.FALSE@ | Combines with the prev tag to prohibit compounding |
@U.CmpNone.TRUE@ | Combines with the two previous ones to block compounding |
@P.CmpOnly.TRUE@ | Sets a flag to indicate that the word has passed R |
@D.CmpOnly.FALSE@ | Disallow words coming directly from root. |
@U.CmpHyph.FALSE@ | Flag to control hyphenated compounds like proper nouns |
@U.CmpHyph.TRUE@ | Flag to control hyphenated compounds like proper nouns |
@C.CmpHyph@ | Flag to control hyphenated compounds like proper nouns |
Use the following flag diacritics to control downcasing of derived proper nouns (e.g. Finnish Pariisi -> pariisilainen). See e.g. North Sámi for how to use these flags. There exists a ready-made regex that will do the actual down-casing given the proper use of these flags.
Flag | Explanation |
---|---|
@U.Cap.Obl@ | Allowing downcasing of derived names: deatnulasj. |
@U.Cap.Opt@ | Allowing downcasing of derived names: deatnulasj. |
The following flag diacritics are used by the grammar checker.
Flag | Explanation |
---|---|
@R.SpellRlx.ON@ | Flag used to tag spell-relax-analysed strings (and only those). |
@D.SpellRlx.ON@ | Flag used to tag spell-relax-analysed strings (and only those). |
@C.SpellRlx@ | Flag used to tag spell-relax-analysed strings (and only those). |
@P.Pmatch.Loc@ | Used on multi-token analyses; tell hfst-tokenise/pmatch where in the form/analysis the token should be split. |
@P.Pmatch.Backtrack@ | Used on single-token analyses; tell hfst-tokenise/pmatch to backtrack by reanalysing the substrings before and after this point in the form (to find combinations of shorter analyses that would otherwise be missed) |
Flag diacritic | Explanation |
---|---|
@U.number.one@ | Flag used to give arabic numerals in smj different cases ; |
@U.number.two@ | Flag used to give arabic numerals in smj different cases ; |
@U.number.three@ | Flag used to give arabic numerals in smj different cases ; |
@U.number.four@ | Flag used to give arabic numerals in smj different cases ; |
@U.number.five@ | Flag used to give arabic numerals in smj different cases ; |
@U.number.six@ | Flag used to give arabic numerals in smj different cases ; |
@U.number.seven@ | Flag used to give arabic numerals in smj different cases ; |
@U.number.eight@ | Flag used to give arabic numerals in smj different cases ; |
@U.number.nine@ | Flag used to give arabic numerals in smj different cases ; |
@U.number.zero@ | Flag used to give arabic numerals in smj different cases ; |
This is the beginning of everything. The Root lexicon is reserved in the LexC language, and must be the first lexicon defined.
Here is the list of top-level lexica in the South Sámi analyser:
Abbreviation ;
Acronym ;
Adjective ;
Adposition ;
Adverb ;
Conjunction ;
Interjection ;
NounRoot ;
Numeral ;
Particle ;
Prefixes ;
Pronoun ;
ProperNoun ;
Punctuation ;
Subjunction ;
Symbols ;
Verb ;
And this is the ENDLEX of everything:
@D.CmpOnly.FALSE@@D.CmpPref.TRUE@@D.NeedNoun.ON@ ENDLEX2 ;
The @D.CmpOnly.FALSE@
flag diacritic is ued to disallow words tagged
with +CmpNP/Only to end here.
The @D.NeedNoun.ON@
flag diacritic is used to block illegal compounds.
This (part of) documentation was generated from src/fst/morphology/root.lexc