North Sami NLP Grammar

Finite state and Constraint Grammar based analysers, proofing tools and other resources

View the project on GitHub giellalt/lang-sme

Divvun & Giellatekno - open source grammars for North Sámi.

North Sámi morphological analyser

Multicharacter symbols

Tags for POS

Tags for sub-POS

Tags for Inflection

Tags for Case and Number Inflection

Possessive tags

Adjectival tags

Moods

Tenses

Verb person-number

Infinite verb forms

Other tags

Question and Focus particles:

Tags distinguishing different versions of the same lemma (before POS)

Note: These high +v… number are in use for one word only: doavttergrádakursa

Escaped chars

Error (non-standard language) tags

Usage tags

Dialect tags:

Tags for indicating the orthography used

+Orth/Strd - Standard orthography +Orth/IPA - IPA transcription

The above should either be used in pairs, or not at all. That is, if a word doesn’t need an IPA stem (because the word in all its inflection can be converted to IPA by the standard IPA conversion rules), then none of these tags should be used. On the other hand, if the word has a spelling that doesn’t follow the orthographic rules, and thus needs an exceptional IPA stem to get it right, then the exceptional stem must be marked with the +Orth/IPA, and the regular orthography stem must be marked with the tag +Orth/Strd. This is so that we can exclude the one or the other from different fst’s, but only when the oposite stem variant is present.

Multichars for marking start and end of IPA sequences

Compounding tags

The tags are of the following form:

This entry / word should be in the following position(s):

If unmarked, any position goes.

The tagged part of the compound should make a compound using:

Unmarked = Default, ie +CmpN/SgN for SME.

The second part of the compound may require that the previous (left part) is:

Tags for descriptive compound analysis - this is what a compound actually is:

Compounding tag ordering

To ease writing and maintaining regexes etc for manipulating and enforcing compounding, it is important to keep the tags in a certain order. The order is:

  1. +CmpN/ tags
  2. +CmpNP/ tags
  3. +Cmp/ tags - this is always true since the descriptive tags are always part of the continuation lexicons, and will be located after the POS tag.

Semantic tags to help disambiguation & synt. analysis: (before POS)

Multiple Semantic tags:

Tags for derivation

Explanation:

Positional derivational tags

+Der1 +Der2 +Der3  +Der4 POS transition Comments
+Der/Dimin       NN (was: Der/aš & Der/š)
+Der/lasj       NA  
+Der/meahttun       VA  
+Der/d       VV  
+Der/h       VV - -hit/Causative
+Der/Caus       VV - -ahtti/Causative
+Der/huhtti       VV  
+Der/l       VV  
+Der/st       VV  
+Der/las       VA * +Der1+Der2 - can only combine with Der3
+Der/Car       NA * +Der1+Der2 - can only combine with Der3
+Der/laakan       AA * +Der1+Der2 - can only combine with Der3
+Der/halla       VV * +Der1+Der2 - can only combine with Der3
+Der/huvva       VV * +Der1+Der2 - can only combine with Der3
+Der/stuvva       VV * +Der1+Der2 - can only combine with Der3
+Der/PassS       VV - short passive
  +Der/t     NA  
  +Der/ár     ACRO>N  
  +Der/NomAg     VN  
  +Der/NomAct     VN Der/NomAct har to realisasjonar, med ulike restriksjonar, this is previous Der/eapmi
  +Der/sasj     NA  
  +Der/adda     VV  
  +Der/alla     VV  
  +Der/AAdv     QA check this!
  +Der/easti     VV  
  +Der/laagasj     QA  
  +Der/Comp     AA  
  +Der/Superl     AA  
    +Der/PassL   VV long passive
    +Der/vuota   AN  
      +Der/InchL VV  
      +Der/amoš VN  
      +Der/eamoš VN  
      +Der/geahtes VA  
      +Der/keahtta VA  
      +Der/muš VN  
      +Der/supmi VN  
      +Der/upmi VN  

Non-positional derivations

All non-positional derivations should be preceded by the following tag, to make it possible to target regular expressions at all derivations in a language-independent way: just specify +Der|+Der1 .. +Der4 and you are set.

Tag POS transition Comment
+Der n/a generic derivation tag used in front of all non-positional derivations.
+Der/veara NA#  
+Der/viđá NA#  
+Der/viđi NA#  
+Der/has ? only one in the code

Miscellanious list

See lexicons NAMAT and SAS for these:

Tags for originating language

The following tags are used to guide conversion to IPA: loan words and foreign names are usually pronounced (approximately) as in the originating (majority) language. Instead of trying to identify the correct pronunciation based on phonotactics (orthotactics actually), we tag all words that can’t be correctly transcribed using the SME transcriber with source language codes. Once tagged, it is possible to split the lexical transducer in smaller ones according to langu- age, and apply different IPA conversion to each of them. The principle of tagging is that we only tag to the extent needed, and following a priority:

  1. any untagged word is pronounced with SME orthographic conventions
  2. NNO and NOB have identical pronunciation, NNO is only used if different in spelling from NOB
  3. SWE has mostly the same pronunciation as NOB, and is only used if different in spelling from NOB
  4. Occasionally even SME (the default) may be tagged, to block other languages from being specified, mainly during semi-automatic language tagging sessions All in all, we want to get as much correctly transcribed to IPA with as little work as possible. On the other hand, if more words are tagged than strictly needed, this should pose no problem as long as the IPA conversion is correct - at least some words will get the same pronunciation whether read as SME or NOB/NNO/SWE.

Triggers for morphophonological rules

Morphophonemes and Sámi letters

= a symbol used in front of # to block backtracking and mwe reanalysis in hfst-tokenise (e.g. in dynanic compounds). Makes it possible to distinguish lexical and dynamic compounds in rules. It is converted to zero together with #.

Symbols that need to be escaped on the lower side (towards twolc):

Flag diacritics

We have manually optimised the structure of our lexicon using following flag diacritics to restrict morhpological combinatorics - only allow compounds with verbs if the verb is further derived into a noun again:

Flag Explanation
@P.NeedNoun.ON@ (Dis)allow compounds with verbs unless nominalised
@D.NeedNoun.ON@ (Dis)allow compounds with verbs unless nominalised
@C.NeedNoun@ (Dis)allow compounds with verbs unless nominalised
@P.Vgen.add@ (Dis)allow VGen
@R.Vgen.add@ (Dis)allow VGen
@P.12p.add@ (Dis)allow 1. and 2. pers forms
@R.12p.add@ (Dis)allow 1. and 2. pers forms
@P.Pmatch.Loc@ Used on multi-token analyses; tell hfst-tokenise/pmatch where in the form/analysis the token should be split.
@P.Pmatch.Backtrack@ Used on single-token analyses; tell hfst-tokenise/pmatch to backtrack by reanalysing the substrings before and after this point in the form (to find combinations of shorter analyses that would otherwise be missed)
Flag Explanation
@D.ErrOrth.ON@  
@C.ErrOrth@  
@P.ErrOrth.ON@  
@R.ErrOrth.ON@  

For languages that allow compounding, the following flag diacritics are needed to control position-based compounding restrictions for nominals. Their use is handled automatically if combined with +CmpN/xxx tags. If not used, they will do no harm.

Flag Explanation
@P.CmpFrst.FALSE@ Require that words tagged as such only appear first
@D.CmpPref.TRUE@ Block such words from entering ENDLEX
@P.CmpPref.FALSE@ Block these words from making further compounds
@D.CmpLast.TRUE@ Block such words from entering R
@D.CmpNone.TRUE@ Combines with the next tag to prohibit compounding
@U.CmpNone.FALSE@ Combines with the prev tag to prohibit compounding
@U.CmpNone.TRUE@ Combines with the two previous ones to block compounding
@P.CmpOnly.TRUE@ Sets a flag to indicate that the word has passed R
@D.CmpOnly.FALSE@ Disallow words coming directly from root.
@D.CmpHyph.TRUE@ Flag to control hyphenated compounds like proper nouns
@U.CmpHyph.FALSE@ Flag to control hyphenated compounds like proper nouns
@U.CmpHyph.TRUE@ Flag to control hyphenated compounds like proper nouns
@C.CmpHyph@ Flag to control hyphenated compounds like proper nouns

Use the following flag diacritics to control downcasing of derived proper nouns (e.g. Finnish Pariisi -> pariisilainen). See e.g. North Sámi for how to use these flags. There exists a ready-made regex that will do the actual down-casing given the proper use of these flags.

Flag Explanation
@U.Cap.Obl@ Allowing downcasing of derived names: deatnulasj.
@U.Cap.Opt@ Allowing downcasing of derived names: deatnulasj.
Flag diacritic Explanation
@U.number.one@ Flag used to give arabic numerals in smj different cases ;
@U.number.two@ Flag used to give arabic numerals in smj different cases ;
@U.number.three@ Flag used to give arabic numerals in smj different cases ;
@U.number.four@ Flag used to give arabic numerals in smj different cases ;
@U.number.five@ Flag used to give arabic numerals in smj different cases ;
@U.number.six@ Flag used to give arabic numerals in smj different cases ;
@U.number.seven@ Flag used to give arabic numerals in smj different cases ;
@U.number.eight@ Flag used to give arabic numerals in smj different cases ;
@U.number.nine@ Flag used to give arabic numerals in smj different cases ;
@U.number.zero@ Flag used to give arabic numerals in smj different cases ;

Basic lexica, pointing to the other lexicon files

Abbreviation

Lexicon ENDLEX And this is the ENDLEX of everything:

@D.CmpOnly.FALSE@@D.CmpPref.TRUE@@D.NeedNoun.ON@ ENDLEX2 ;

The @D.CmpOnly.FALSE@ flag diacritic is used to disallow words tagged with +CmpNP/Only to end here. The @D.NeedNoun.ON@ flag diacritic is used to block illegal compounds.

ENDLEX2

ENDLEX3

ENDLEX4


This (part of) documentation was generated from src/fst/morphology/root.lexc