Skolt Sami NLP Grammar

Finite state and Constraint Grammar based analysers, proofing tools and other resources

View the project on GitHub giellalt/lang-sms

Skolt Sámi morphological analyser

This file contains all definitions of symbols written by more than one character, and it contains the initial Root lexicon.

Definitions for Multichar_Symbols

Grammatical tags

Tags for POS

Pre-derivational POS tags for CG processing

Tags for sub-POS

Types of adverbs

Number

Case

symbols ?

Possessive suffix

Adjective declension

Verb forms Veʹrbbååʹbleʹǩ

Valence

Person-number

Homonymy

Derivation

All non-positional derivations should be preceded by this tag, to make it possible to target regular expressions at all derivations in a language-independent way: just specify +Der|+Der1 .. +Der5 and you are set.

Verb derivation

Tags for originating language

The following tags are used to guide conversion to IPA: loan words and foreign names are usually pronounced (approximately) as in the originating (majority) language. Instead of trying to identify the correct pronunciation based on phonotactics (orthotactics actually), we tag all words that can’t be correctly transcribed using the SME transcriber with source language codes. Once tagged, it is possible to split the lexical transducer in smaller ones according to langu- age, and apply different IPA conversion to each of them.

The principle of tagging is that we only tag to the extent needed, and following a priority:

  1. any untagged word is pronounced with SME orthographic conventions
  2. NNO and NOB have identical pronunciation, NNO is only used if different in spelling from NOB
  3. SWE has mostly the same pronunciation as NOB, and is only used if different in spelling from NOB
  4. Occasionally even SME (the default) may be tagged, to block other languages from being specified, mainly during semi-automatic language tagging sessions

All in all, we want to get as much correctly transcribed to IPA with as little work as possible. On the other hand, if more words are tagged than strictly needed, this should pose no problem as long as the IPA conversion is correct - at least some words will get the same pronunciation whether read as SME or NOB/NNO/SWE.

Government tags

Semantic tags

Multiple Semantic tags:

Clitic

Tags distinguishing different versions of the same lemma (before POS)

In the xml the varid attribute is used in the st element with a mere numeric value an extra lemma attribute is inserted in the st element, e.g. lemma=”tõlvvad”

Other tags

Punctuation

Letters

Skolt Saami letters

These definitions are probably not needed

Archiphonemes

These are for letters with special behaviour. Say that all m-s change to n in a given context, but not this m, because it is m2. In twolc these are then defined m2:m, etc, i.e. the m2 is an m, although it is a different m.

Diacritic marks

These symbol govern the way the morphophonological rules treat the affix string.

This project started out using arbitrary names, X1, X2…, but since they were hard to remember, we changed to (a bit) more transparent names (^DIADEL, …). On the TODO-list: Change all X1, X2, … to easy-to-remember names.

Consonant lengthening

Vowel length and height

for vowel height, by default vowels are low.

In Skolt Saami, triggers are used for signaling change in the preceding stem 2025.09.16

Two or more alternations may occur in the stem simultaneously, they should be addressed separately – not as portmanteau features of length as in sme. Two codas may be simultaneous loci of alternation, so we need to remember where the triggers point to, and we need to know what order they are in. Here is what may happen in the final coda of a stem when it is the sole locus of alternation: jiẹˈʹnstem: jeäʹnˈn+N+Sg+Loc+PxSg1 ‹iẹ› diphthong alternation: vowel raising, vowel tinting ‹ˈ› allegro marking following the diphthong ‹ʹ› palatal suprasegment marking ‹n› allegro marking of what was a long consonant, but which has been reduced to single allegro consonant before a consonant cluster that is secondarily following the extra short coda vowel The ordering of phenomena is (so we should work towards this in trigger ordering) 2025-09-16 needs adjustment 1: Vowel height 2: Vowel length, which here is Allegro vowel 3: Palatal marker (soft sign) 4: Consonant length, quality (still portmanteau) 5: Palatal marking in alternations {kǩ}, {gǧ}, {ǥj} Other words to consider jiẹˈʹrjsted: jeäʹrǧǧ+N+Sg+Loc+PxSg2 jieˈʹrjsted: jiârgg+N+Sg+Loc+PxSg2

Words with two codas undergoing variation simultaneously will feature additional trigger work where the triggers are each preceded by a %^PEN trigger, which indicates penultimate. Orderwise, penultimate triggers appear closer to the stem (see Rueter forthcomingXXX)

CHARACTERISTIC BREAKDOWN 2015-02-17

Gradation triggers 2015.01.23

Gradation triggers 2015.02.09 For Consonant Clusters

Diacritic with mnemonic names

Hyphen at compound word boundary

Escaped symbols

Symbols that need to be escaped on the lower side (towards twolc):

The Usage extents are marked using following tags:

Dialect tags:

Compounding

Flag diacritics

Flag Explanation
@P.AssocColl.ON@ Used with Kin terms and Ant
@R.AssocColl.ON@  
@C.AssocColl@  

We have manually optimised the structure of our lexicon using following flag diacritics to restrict morhpological combinatorics - only allow compounds with verbs if the verb is further derived into a noun again:

Flag Explanation
@P.NeedNoun.ON@ (Dis)allow compounds with verbs unless nominalised
@D.NeedNoun.ON@ (Dis)allow compounds with verbs unless nominalised
@C.NeedNoun@ (Dis)allow compounds with verbs unless nominalised

For languages that allow compounding, the following flag diacritics are needed to control position-based compounding restrictions for nominals. Their use is handled automatically if combined with +CmpN/xxx tags. If not used, they will do no harm.

Flag Explanation
@P.CmpFrst.FALSE@ Require that words tagged as such only appear first
@D.CmpPref.TRUE@ Block such words from entering ENDLEX
@P.CmpPref.FALSE@ Block these words from making further compounds
@D.CmpLast.TRUE@ Block such words from entering R
@D.CmpNone.TRUE@ Combines with the next tag to prohibit compounding
@U.CmpNone.FALSE@ Combines with the prev tag to prohibit compounding
@P.CmpOnly.TRUE@ Sets a flag to indicate that the word has passed R
@D.CmpOnly.FALSE@ Disallow words coming directly from root.

Use the following flag diacritics to control downcasing of derived proper nouns (e.g. Finnish Pariisi -> pariisilainen). See e.g. North Sámi for how to use these flags. There exists a ready-made regex that will do the actual down-casing given the proper use of these flags.

Flag Explanation
@U.Cap.Obl@ Allowing downcasing of derived names: deatnulasj.
@U.Cap.Opt@ Allowing downcasing of derived names: deatnulasj.
@C.ErrOrth@ tbw
@R.ErrOrth.ON@ tbw
@D.ErrOrth.ON@ tbw
@P.ErrOrth.ON@ tbw
@P.Pmatch.Backtrack@ tbw
@U.NUMORTH.YES@  
@N.NUMORTH.YES@  
@D.NUMORTH.YES@  
Flag diacritic Explanation
@U.number.one@ Flag used to give arabic numerals in smj different cases ;
@U.number.two@ Flag used to give arabic numerals in smj different cases ;
@U.number.three@ Flag used to give arabic numerals in smj different cases ;
@U.number.four@ Flag used to give arabic numerals in smj different cases ;
@U.number.five@ Flag used to give arabic numerals in smj different cases ;
@U.number.six@ Flag used to give arabic numerals in smj different cases ;
@U.number.seven@ Flag used to give arabic numerals in smj different cases ;
@U.number.eight@ Flag used to give arabic numerals in smj different cases ;
@U.number.nine@ Flag used to give arabic numerals in smj different cases ;
@U.number.zero@ Flag used to give arabic numerals in smj different cases ;
@P.number.one@ Flag used to give arabic numerals in smj different cases ;
@P.number.two@ Flag used to give arabic numerals in smj different cases ;
@P.number.three@ Flag used to give arabic numerals in smj different cases ;
@P.number.four@ Flag used to give arabic numerals in smj different cases ;
@P.number.five@ Flag used to give arabic numerals in smj different cases ;
@P.number.six@ Flag used to give arabic numerals in smj different cases ;
@P.number.seven@ Flag used to give arabic numerals in smj different cases ;
@P.number.eight@ Flag used to give arabic numerals in smj different cases ;
@P.number.nine@ Flag used to give arabic numerals in smj different cases ;
@P.number.ten@ Flag used to give arabic numerals in smj different cases ;
@P.number.zero@ Flag used to give arabic numerals in smj different cases ;

Basic lexica, pointing to the other lexicon files

These generate from merged materials:

INCOMING lemma:stem Contlex sets to be distinguished from glossing in progress


NounRoot

VerbRoot

INTERJ_ Interjections

CONJUNCTIONS INTERJ_

CS_ Subjunction

CS-TEMP_ when

NUM_ NUM_VAHTT

NUM_1Y_VYXX NUM_1Y_VUCC NUM_TOLL NUM_1Y_VCC NUM_AELDD NUM_KUEQLL NUM_TAQHTT NUM_KAEAEUQC

NUM_AANAR

NUM_ATOM NUM_JEAQNNN

PCLE_ is here since Pcle_sms2x.xml wants it. It does nothing. PCLE-NEG_ is here since Pcle_sms2x.xml wants it. It adds +Neg.

Postpositions with government tagging possible ADP_ PO_tag is the lexicon adding the tag +Po PO-ILL_ PO-LOC_ PO_ is a dummy lexicon not adding anything ADP-GOV-LOC_ PO-GOV-GEN_

Prepositions with government tagging possible

PR_tag is the lexicon adding the tag +Pr PR_ is a dummy lexicon not adding anything

PR-TEMP-GOV-LOC_

PREFIX/A_

SUF/A_


This (part of) documentation was generated from src/fst/morphology/root.lexc

Sitemap