Lule Sami NLP Grammar

Finite state and Constraint Grammar based analysers, proofing tools and other resources

View the project on GitHub giellalt/lang-smj

Lule Sámi morphological analyser

Definitions for Multichar_Symbols

Tags for POS

Tags for sub-POS

Pronoun subtypes

Error tags

All Err-tags must have a normative form as lemma except Err/Lex

Usage restriction tags

Dialect and Area tags

Compounding tags

The tags are of the following form:

Normative/prescriptive compounding tags

These govern compound behaviour for normative tools like the speller, ie. what a compound SHOULD BE.

The first part of the component may be ..

This part of the component can ..

The second part of the compound may require that the previous (left part) is (and thus overrides the regular CmpN tags):

But these tags can again be overriden by the first word in a compound, if this part of the compound is tagged with a def tag:

Descriptive compounding tags

Tags for compound analysis - this is what a compound actually is. Some of these tags are also used in combination with the above normative tags to actually enforce compound restrictions in the fst.

Inflectional Tags

Tags for Case and Number Inflection

Possessive tags

Adjective specific tags

Verbal inflection

Other tags

Lexeme disambiguation = homonym tags

Stem variant tags

Question and Focus particles:

Other tags

Semantic tags to help disambiguation & syntactic analysis

These tags should always be located just before the POS tag.

Multiple Semantic tags:

Not sure which section this goes in: (before POS)

Derivation tags

The following tags are used to describe the dynamic derivational system in Lule Sámi as encoded in this lexical description. The tags are classified according to a positional system, where each tag can be in one and only one position, and can only combine with tags from an earlier / lower position. This is done to avoid possible overgeneration in the derivational system.

Der#2 tags - tags in second position

There are no such tags in SMJ, but for symmetry and code coherence with SME the class is still kept.

Tags for originating language

The following tags are used to guide conversion to IPA: loan words and foreign names are usually pronounced (approximately) as in the originating (majority) language. Instead of trying to identify the correct pronunciation based on phonotactics (orthotactics actually), we tag all words that can’t be correctly transcribed using the SME transcriber with source language codes. Once tagged, it is possible to split the lexical transducer in smaller ones according to langu- age, and apply different IPA conversion to each of them. The principle of tagging is that we only tag to the extent needed, and following a priority:

  1. any untagged word is pronounced with SME orthographic conventions
  2. NNO and NOB have identical pronunciation, NNO is only used if different in spelling from NOB
  3. SWE has mostly the same pronunciation as NOB, and is only used if different in spelling from NOB
  4. Occasionally even SME (the default) may be tagged, to block other languages from being specified, mainly during semi-automatic language tagging sessions All in all, we want to get as much correctly transcribed to IPA with as little work as possible. On the other hand, if more words are tagged than strictly needed, this should pose no problem as long as the IPA conversion is correct - at least some words will get the same pronunciation whether read as SME or NOB/NNO/SWE.
    • +OLang/SME - North Sámi
    • +OLang/SMS - Skolt Sámi
    • +OLang/SMA - South Sámi
    • +OLang/FIN - Finnish
    • +OLang/SWE - Swedish
    • +OLang/NOB - Norw. bokmål
    • +OLang/NNO - Norw. nynorsk
    • +OLang/ENG - English
    • +OLang/RUS - Russian
    • +OLang/UND - Undefined
    • +OLang/PARA - parallelle navn, navnet skal ikke overføres til andre samisk språk

Flag diacritics

Tags from SME, coming to smj by propernouns.

Flag diacritics

We have manually optimised the structure of our lexicon using following flag diacritics to restrict morhpological combinatorics - only allow compounds with verbs if the verb is further derived into a noun again:

Flag diacritic Explanation
@P.NeedNoun.ON@ (Dis)allow compounds with verbs unless nominalised
@D.NeedNoun.ON@ (Dis)allow compounds with verbs unless nominalised
@C.NeedNoun@ (Dis)allow compounds with verbs unless nominalised
Flag diacritic Explanation
@P.Pmatch.Loc@ Used on multi-token analyses; tell hfst-tokenise/pmatch where in the form/analysis the token should be split. Used e.g. in bijladagi to split bijla from dagi, or after abbreviations with full stops before the full stop, to allow an alternate +CLB analysis of it in case of a sentence final abbreviation. NB! This will give a faulty lemma for the abbreviation, as it will not include the full stop. This can lead to other issues, but presently we have no other solution if we want to keep the full stopp as a separate token. We could leave a full stop at the end of the abbreviation lemma as well (but not on the input side - we only have one full stop in the input). That must be tested, it could work, but then requires special attention when generating suggestions in e.g. grammar checkers - it should not generate two full stops.
@P.Pmatch.Backtrack@ Used on single-token analyses; tell hfst-tokenise/pmatch to backtrack by reanalysing the substrings before and after this point in the form (to find combinations of shorter analyses that would otherwise be missed)
Flag diacritic Explanation
@D.ErrOrth.ON@ To be written
@R.ErrOrth.ON@ To be written
@C.ErrOrth@ To be written
@P.ErrOrth.ON@ To be written

For languages that allow compounding, the following flag diacritics are needed to control position-based compounding restrictions for nominals. Their use is handled automatically if combined with +CmpN/xxx tags. If not used, they will do no harm.

Flag diacritic Explanation
@P.CmpFrst.FALSE@ Require that words tagged as such only appear first
@D.CmpPref.TRUE@ Block such words from entering ENDLEX
@P.CmpPref.FALSE@ Block these words from making further compounds
@D.CmpLast.TRUE@ Block such words from entering R
@D.CmpNone.TRUE@ Combines with the next tag to prohibit compounding
@U.CmpNone.FALSE@ Combines with the prev tag to prohibit compounding
@U.CmpNone.TRUE@ Combines with the two previous ones to block compounding
@P.CmpOnly.TRUE@ Sets a flag to indicate that the word has passed R
@D.CmpOnly.FALSE@ Disallow words coming directly from root.
@U.CmpHyph.FALSE@ Flag to control hyphenated compounds like proper nouns
@U.CmpHyph.TRUE@ Flag to control hyphenated compounds like proper nouns
@C.CmpHyph@ Flag to control hyphenated compounds like proper nouns

Use the following flag diacritics to control downcasing of derived proper nouns (e.g. Finnish Pariisi -> pariisilainen). See e.g. North Sámi for how to use these flags. There exists a ready-made regex that will do the actual down-casing given the proper use of these flags.

Flag diacritic Explanation
@U.Cap.Obl@ Disallow downcasing of names when not derived: Deatnu
@U.Cap.Opt@ Allowing downcasing of derived names: deatnulasj.
@P.Px.add@ Giving possibility for Px-suffixes (all except from Nom 3.p)
@R.Px.add@ Requiring P.Px.add-flag for Px-suffixes (all except from Nom 3.p)
@P.Nom3Px.add@ Giving possibility for Px-suffixes Nom 3.p
@R.Nom3Px.add@ Requiring P.Nom3Px.add flag for Px-suffixes Nom 3.p
Flag diacritic Explanation
@U.number.one@ Flag used to give arabic numerals in smj different cases ;
@U.number.two@ Flag used to give arabic numerals in smj different cases ;
@U.number.three@ Flag used to give arabic numerals in smj different cases ;
@U.number.four@ Flag used to give arabic numerals in smj different cases ;
@U.number.five@ Flag used to give arabic numerals in smj different cases ;
@U.number.six@ Flag used to give arabic numerals in smj different cases ;
@U.number.seven@ Flag used to give arabic numerals in smj different cases ;
@U.number.eight@ Flag used to give arabic numerals in smj different cases ;
@U.number.nine@ Flag used to give arabic numerals in smj different cases ;
@U.number.zero@ Flag used to give arabic numerals in smj different cases ;

Lexicon Root

The beginning of everything. Every FST defined in LexC must start with the reserved lexicon name Root.

Lexicon ENDLEX

And this is the ENDLEX of everything:

@D.CmpOnly.FALSE@@D.CmpPref.TRUE@@D.NeedNoun.ON@ ENDLEX2 ;

The @D.CmpOnly.FALSE@ flag diacritic is ued to disallow words tagged with +CmpNP/Only to end here. The @D.NeedNoun.ON@ flag diacritic is used to block illegal compounds.

ENDLEX2

ENDLEX3

ENDLEX4


This (part of) documentation was generated from src/fst/morphology/root.lexc