South Sámi NLP Grammar

Finite state and Constraint Grammar based analysers, proofing tools and other resources

View the project on GitHub giellalt/lang-sma

South Sámi morphological analyser

Multichar_Symbols definitions

Tags for POS (Part-Of-Speech, Word class)

Tags for sub-POS

Proper nouns

Pronoun subtypes

Numeral subtypes

Error (non-standard language) tags

Error tag Explanation
+Err/Orth Substandard, unormert form av et ord
+Err/Hyph Substandard, unormert
+Err/SpaceCmp Substandard, unormert
+Err/Attr Substandard, unormert Attr-form av et ord
+Err/Lex lemma med dens ordformer er utenfor normen.
No normative lemma, it’s grammatically correct.
+Err/Der Errors in derivations
+Err/Spellrelax Used to tag spellrelaxed typos (tag is inserted via flag diacritics)
+Err/MissingSpace in use in smi lexc

Usage tags

Usage tag Explanation
+Use/Marg Marginal, korrekte, eksisterende former, men som er sjeldne. vi kan fjerne disse ordene f.eks fra speller, fordi de er så sjeldne og lite i bruk at de lemma som ligger nært kan bli forvekslet.
+Use/-Spell Excluded from speller
+Use/-PLX Excluded in PLX speller
+Use/SpellNoSugg Recognized but not suggested in speller
+Use/Circ Circular path
+Use/CircN Circular number path?
+Use/Ped Remove from pedagogical speller
+Use/NG Do not generate
for isme-ped.fst and apertium
+Use/MT Generate for apertium only
+Use/NotDNorm For (spellings of) words that do not follow the orthographic principles of sma. Divvun suggest that this shouldn’t be normative, even though they are decided upon by GG. Included in speller.
+Use/DNorm For words without formal normalization. Divvun suggest that this should be normative. Included in speller. Based on 2010 normative decision & Ove Lorentz’ suggestions for the norm.
+Use/PMatch Do only include in fst’s for hfst-pmatch
+Use/-PMatch Do not include in fst’s made for hfst-pmatch
+Use/GC only retained in the HFST Grammar Checker disambiguation analyser
+Use/-GC never retained in the HFST Grammar Checker disambiguation analyser
+Use/TTS only retained in the HFST Text-To-Speech disambiguation tokeniser
+Use/-TTS never retained in the HFST Text-To-Speech disambiguation tokeniser

Dialect tags

Dialect tag Explanation
+Dial/-S Not in the South
+Dial/S Only in the South
+Dial/-N Not in the North
+Dial/N Only in the North
+Dial/-NOR Words not in Norway
+Dial/NOR Words only in Norway
+Dial/-SW Words not in Sweden
+Dial/SW Words only in Sweden
+Dial/SH Short forms
+Dial/L Long forms

Normative/prescriptive compounding tags

(to govern compound behaviour for the speller, ie what a compound SHOULD BE)

The left part of a compound should be …

The default is +CmpN/SgN, so when nothing is specified, that will be used. To override that one, specify one or more of the following tags. +CmpN/SgN must be specified if also other tags are listed - unless +CmpN/SgN should not be used, for course.

Normative compounding tag Explanation
+CmpN/Sg Singular
+CmpN/SgN Singular Nominative
+CmpN/SgG Singular Genitive
+CmpN/PlG Plural Genitive

The right part of a compound requires to the left …

These tags overrule the regular tags defined above. One or more can be specified.

Normative left-governing tag Explanation
+CmpN/SgLeft Sg to the left
+CmpN/SgNomLeft etc.
+CmpN/SgGenLeft
+CmpN/PlGenLeft

This part of the component can …

Normative position tag Explanation
+CmpNP/All … be in all positions, default, this tag does not have to be written
+CmpNP/First … only be first part in a compound or alone
+CmpNP/Pref … only be first part in a compound, NEVER alone
+CmpNP/Last … only be last part in a compound or alone
+CmpNP/Suff … only be last part in a compound, NEVER alone
+CmpNP/None … not take part in compounds
+CmpNP/Only … only be part of a compound, i.e. can never be used alone, but can appear in any position

Descriptive compounding tags

Tags for compound analysis - this is what a compound actually is. We use this to research compounding patterns in the corpus.

Descriptive compounding tag Explanation
+Cmp/Sg Compounding using an unspecified singular stem
+Cmp/SgNom Compounding using nominative singular
+Cmp/SgGen Compounding using genitive singular
+Cmp/PlGen Compounding using genitive plural
+Cmp/Attr Compounding using attribute form
+Cmp/eh Compound stem in –eh, as in gaameh-gåaroje, from gaamege
+Cmp/ege Compound stem in –ege, as in gaamege-gåaroje
+Cmp/FinEDel Deletion of final e, as in voelem-gaaroeh, from voeleme
+Cmp/ShH Compounding using a short stem + h: –biejjh– (from biejjie), cf reakedsbiejjhvadtese
+Cmp/Sh Compounding using a short stem: –biejj– (from biejjie)
+Cmp/SplitR This is a split compound with the other part to the right:
“Arbeids- og inkluderingsdepartementet” => Arbeids– = +Cmp/SplitR
+Cmp/SplitL This is a split compound with the other part to the left, this is the oposite of the previous case
+Cmp Dynamic compound - this tag should always be part of a dynamic compound. It is important for Apertium and the speller (to give extra weights to compounds), and useful in other cases as well.
+Cmp/XForm Alle Cmp som ikke har en klar klassifisering
+Cmp/AttrH Alle Cmp som har en attr-h

Tags for Inflection

Tags for Case, Number & Possessive Inflection

Case and number

Possessive

Tense, Person & Number

Tense tag Explanation
+Prs Presens
+Prt Preteritum
Person & Number tag Explanation
+Sg1 Singular, 1.person
+Sg2 Singular, 2.person
+Sg3 Singular, 3.person
+Du1 Dual , 1.person
+Du2 Dual , 2.person
+Du3 Dual , 3.person
+Pl1 Plural , 1.person
+Pl2 Plural , 2.person
+Pl3 Plural , 3.person

Other verbal tags

Verbal tag Explanation
+Neg negation verb ij
+ConNeg main verb complement to Neg, form identical to Imp
+VAbess Verb Abessive
+Inf Infinitive and participles
+PrfPrc Infinitive and participles
+PrsPrc Infinitive and participles
+Ger Gerundium
+VGen Verbgenitive
+Ind Indicative
+Imprt Imperative
+ImprtII Imperative, for Neg: ollem ollh …
+Cond Kondisjonalis, for one form: lidtjie. To be looked at.+ lidtjim, + lidtjih
+Act -eme, could be changed to +Actio

Tags for adjectives

Other tags

Tags for testing the frequency of certain phenomenas in our corpora

Tags for punctuation

Different focus particles

Tags for adverbs and comparated adjectives

Semantic tags

Semantic tags help disambiguation and syntactic analysis. All tags used are defined and listed below.

Multiple Semantic tags

Multiple semantic tags are written as one tag, with the different semantic values separated by an underline _.

All used combinations must be declared below, and the list must be manually maintained. The tags are ordered alphabetically, both the list and the semantic values within one tag.

Tag Explanation
+MWE multi word expressions, goes to abbr

Use the following flag diacritics to control downcasing of derived proper nouns (e.g. Finnish Pariisi -> pariisilainen). See e.g. North Sámi for how to use these flags. There exists a ready-made regex that will do the actual down-casing given the proper use of these flags.

Flag Explanation
@P.Px.add@ Giving possibility for Px-suffixes (all except from Nom 3.p)
@R.Px.add@ Requiring P.Px.add-flag for Px-suffixes (all except from Nom 3.p)
@P.Nom3Px.add@ Giving possibility for Px-suffixes Nom 3.p
@R.Nom3Px.add@ Requiring P.Nom3Px.add flag for Px-suffixes Nom 3.p
@P.Pmatch.Backtrack@ Used on single-token analyses; tell hfst-tokenise/pmatch to backtrack by reanalysing the substrings before and after this poin in the form (to find combinations of shorter analyses that would otherwise be missed)
@D.ErrOrth.ON@ asdf
@C.ErrOrth@ asdf
@P.ErrOrth.ON@ asdf

Derivation tags and derivation position tags in a derivation row

Derivations in the same position are mutually exclusive (can not be combined), whereas tags in different positions can be combined, so that position 1 derivations must precede position 2 derivations, and so on.

Pos1 Pos2 Pos3 POS switches (from-to) Explanation
+Der1       Position tag, required
  +Der2     Position tag, required
    +Der3   Position tag, required
+Der/htalle     VV Passive, frekeventative
+Der/lg     VV Passive
+Der/ijes     NA Nomen agentis
+Der/ihks     VA (Handlernomen- tilbøyelig til å utføre den handlingen som grunnordet angir)
+Der/les     VA Intensive
+Der/ldihkie     VA  
+Der/ldahke     VA Resultatnomen (?)
+Der/ldh     VA Attributt
+Der/ht     VV Causative
+Der/l     VV Subitive
+Der/st     VV Diminutive, Subitive
+Der/d     VV Continuative, Konative, Frequentative, Refleksive, Momentan
+Der/Car       -hts, Caritive, was Der/heapmi in sme
+Der/htj     NN Dim-cont, Frequentative
+Der/Dimin     NN Diminutive
+Der/Rec     NN Forholdsformer
+Der/laakan     AAdv adverb
+Der/laaketje     AA adjektiv
+Der/Comp     AA adjektiv
+Der/Superl     AA adjektiv
  +Der/vuota   AN Noun
  +Der/adte   VV Frequentative, Kontinuativ
  +Der/alla   VV Frequentative
  +Der/eds   NA Attributt
    +Der/PassL VV long only
    +Der/NomAg VN Nomen Agentis
    +Der/NomAct VN Nomen Actionis
    +Der/ahtje VV Inchoative
    +Der/InchL VV Inchoative

Other, non-positional derivations

All non-positional derivations should be preceded by the following tag, to make it possible to target regular expressions in all derivations in a language-independent way: just specify [+Der](+Der1 .. +Der5) and you are set.

Derivation tag POS switch Explanation
+Der/PassS VV short passive only
+Der/A NA comparation of N’s

Tags for originating language

The following tags are used to guide conversion to IPA: loan words and foreign names are usually pronounced (approximately) as in the originating (majority) language. Instead of trying to identify the correct pronunciation based on phonotactics (orthotactics actually), we tag all words that can’t be correctly transcribed using the SMA transcriber with source language codes. Once tagged, it is possible to apply different IPA conversions to each of them. The principle of tagging is that we only tag to the extent needed, and following a priority:

  1. any untagged word is pronounced with native orthographic conventions
  2. NNO and NOB have identical pronunciation, NNO is only used if different in spelling from NOB
  3. SWE has mostly the same pronunciation as NOB, and is only used if different in spelling from NOB
  4. Occasionally even SMA (the default) may be tagged, to block other languages from being specified, mainly during semi-automatic language tagging sessions All in all, we want to get as much correctly transcribed to IPA with as little work as possible. On the other hand, if more words are tagged than strictly needed, this should pose no problem as long as the IPA conversion is correct - at least some words will get the same pronunciation whether read as SMA or NOB/NNO/SWE.
Originating language tag Originating language
+OLang/SME North Sámi
+OLang/SMA South Sámi
+OLang/SMJ Lule Sámi
+OLang/FIN Finnish
+OLang/SWE Swedish
+OLang/NOB Norw. bokmål
+OLang/NNO Norw. nynorsk
+OLang/ENG English
+OLang/RUS Russian
+OLang/UND Undefined
+OLang/PARA parallelle navn, navnet skal ikke overføres til andre samisk språk

Area tags

Triggers for morphophonological rules

Morphophonemes and Sámi letters

Symbols that need to be escaped on the lower side (towards twolc):

Lexeme disambiguation tags

Stem variant tags

The clitic boundary mark

A multichar that usually just goes to zero:

Umlaut and diphthong simplification triggers

Trigger Explanation
%^DISIMP diphthong simplification
%^COMPDISIMP diphthong simplification in comparatives
%^COMPDISIMP2 diphthong simplification in comparatives, type 2
%^COMPDISIMP3 diphthong simplification
%^PLCDISIMP diphthong simplification in ACCRA-names
%^NOMAGieDISIMP diphthong simplification for NomAg ie stems
%^1UML a-uml, like 1sg prs, perf.part of båetedh/V-I, and ill sg of -ie nouns
%^2UML dark e, as 3sg prs & perf.part of tjearodh/V-II, and ill sg of -oe nouns
%^3UML adj Umlaut oeh:an
%^3sUML a-uml in 3sg prs of V-IV (roehtedh - ruahta)
%^3dUML ie-uml in 1du & 3pl prs of V-IV (roehtedh - ruehtien)
%^iæUML not used
%^iUML i-uml in pret of V-I (båetedh - böötim)
%^PASSUML Short passive Umlaut Rx->R5
%^didhUML Der/d Umlaut for GUARKEDH-words
%^htjidhUML Umlaut für der/htjidh derivations
%^adteUML Umlaut für Der/adte and Der/alla derivations
%^aLATUS Latus-Umlaut for -ie stems
%^uLATUS Latus-Umlaut for -oe stems
%^ConsDel Stem consonant deletion in front of Der/PassL
%^ILLELA Stem vowel changes in Illative an Elative
%^PLGENPLCOM Stem vowel changes in final from e -> i, and withoaut -j-
%^COMESS Stem vowel changes in ACCRA-names
Symbol used before # and - in dynamic compounds, and only there. Used to block optional conversion of word boundaries to spaces for error detection in grammar checkers. That is, dynamic compounds are not allowed to be written appart for error detection, only lexicalised ones. This is done to reduce the amound of ambiguity in the raw analyses to an amount we can cope with.

Flag diacritics

We have manually optimised the structure of our lexicon using the following flag diacritics to restrict morhpological combinatorics - only allow compounds with verbs if the verb is further derived into a noun again:

Flag Explanation
@P.NeedNoun.ON@ (Dis)allow compounds with verbs unless nominalised
@D.NeedNoun.ON@ (Dis)allow compounds with verbs unless nominalised
@C.NeedNoun@ (Dis)allow compounds with verbs unless nominalised
@R.ErrOrth.ON@  

For languages that allow compounding, the following flag diacritics are needed to control position-based compounding restrictions for nominals. Their use is handled automatically if combined with +CmpN/xxx tags. If not used, they will do no harm.

Flag Explanation
@P.CmpFrst.FALSE@ Require that words tagged as such only appear first
@D.CmpPref.TRUE@ Block such words from entering ENDLEX
@P.CmpPref.FALSE@ Block these words from making further compounds
@D.CmpLast.TRUE@ Block such words from entering R
@D.CmpNone.TRUE@ Combines with the next tag to prohibit compounding
@U.CmpNone.FALSE@ Combines with the prev tag to prohibit compounding
@U.CmpNone.TRUE@ Combines with the two previous ones to block compounding
@P.CmpOnly.TRUE@ Sets a flag to indicate that the word has passed R
@D.CmpOnly.FALSE@ Disallow words coming directly from root.
@U.CmpHyph.FALSE@ Flag to control hyphenated compounds like proper nouns
@U.CmpHyph.TRUE@ Flag to control hyphenated compounds like proper nouns
@C.CmpHyph@ Flag to control hyphenated compounds like proper nouns

Use the following flag diacritics to control downcasing of derived proper nouns (e.g. Finnish Pariisi -> pariisilainen). See e.g. North Sámi for how to use these flags. There exists a ready-made regex that will do the actual down-casing given the proper use of these flags.

Flag Explanation
@U.Cap.Obl@ Allowing downcasing of derived names: deatnulasj.
@U.Cap.Opt@ Allowing downcasing of derived names: deatnulasj.

The following flag diacritics are used by the grammar checker.

Flag Explanation
@R.SpellRlx.ON@ Flag used to tag spell-relax-analysed strings (and only those).
@D.SpellRlx.ON@ Flag used to tag spell-relax-analysed strings (and only those).
@C.SpellRlx@ Flag used to tag spell-relax-analysed strings (and only those).
@P.Pmatch.Loc@ Used on multi-token analyses; tell hfst-tokenise/pmatch where in the form/analysis the token should be split.
@P.Pmatch.Backtrack@ Used on single-token analyses; tell hfst-tokenise/pmatch to backtrack by reanalysing the substrings before and after this point in the form (to find combinations of shorter analyses that would otherwise be missed)
Flag diacritic Explanation
@U.number.one@ Flag used to give arabic numerals in smj different cases ;
@U.number.two@ Flag used to give arabic numerals in smj different cases ;
@U.number.three@ Flag used to give arabic numerals in smj different cases ;
@U.number.four@ Flag used to give arabic numerals in smj different cases ;
@U.number.five@ Flag used to give arabic numerals in smj different cases ;
@U.number.six@ Flag used to give arabic numerals in smj different cases ;
@U.number.seven@ Flag used to give arabic numerals in smj different cases ;
@U.number.eight@ Flag used to give arabic numerals in smj different cases ;
@U.number.nine@ Flag used to give arabic numerals in smj different cases ;
@U.number.zero@ Flag used to give arabic numerals in smj different cases ;

Lexicon Root

This is the beginning of everything. The Root lexicon is reserved in the LexC language, and must be the first lexicon defined.

Here is the list of top-level lexica in the South Sámi analyser:

Lexicon ENDLEX

And this is the ENDLEX of everything:

@D.CmpOnly.FALSE@@D.CmpPref.TRUE@@D.NeedNoun.ON@ ENDLEX2 ;

The @D.CmpOnly.FALSE@ flag diacritic is ued to disallow words tagged with +CmpNP/Only to end here. The @D.NeedNoun.ON@ flag diacritic is used to block illegal compounds.


This (part of) documentation was generated from src/fst/morphology/root.lexc