Skolt Sami NLP Grammar

Finite state and Constraint Grammar based analysers, proofing tools and other resources

View the project on GitHub giellalt/lang-sms

Skolt Sámi morphological analyser

This file contains all definitions of symbols written by more than one character, and it contains the initial Root lexicon.

Definitions for Multichar_Symbols

Grammatical tags

Tags for POS

Pre-derivational POS tags for CG processing

Tags for sub-POS

Types of adverbs

Number

Case

symbols ?

Possessive suffix

Adjective declension

Verb forms Veʹrbbååʹbleʹǩ

###Valence

Person-number

Homonymy

Derivation

All non-positional derivations should be preceded by this tag, to make it possible to target regular expressions at all derivations in a language-independent way: just specify +Der|+Der1 .. +Der5 and you are set.

Verb derivation

Tags for originating language

The following tags are used to guide conversion to IPA: loan words and foreign names are usually pronounced (approximately) as in the originating (majority) language. Instead of trying to identify the correct pronunciation based on phonotactics (orthotactics actually), we tag all words that can’t be correctly transcribed using the SME transcriber with source language codes. Once tagged, it is possible to split the lexical transducer in smaller ones according to langu- age, and apply different IPA conversion to each of them.

The principle of tagging is that we only tag to the extent needed, and following a priority:

  1. any untagged word is pronounced with SME orthographic conventions
  2. NNO and NOB have identical pronunciation, NNO is only used if different in spelling from NOB
  3. SWE has mostly the same pronunciation as NOB, and is only used if different in spelling from NOB
  4. Occasionally even SME (the default) may be tagged, to block other languages from being specified, mainly during semi-automatic language tagging sessions

All in all, we want to get as much correctly transcribed to IPA with as little work as possible. On the other hand, if more words are tagged than strictly needed, this should pose no problem as long as the IPA conversion is correct - at least some words will get the same pronunciation whether read as SME or NOB/NNO/SWE.

Government tags

Semantic tags

Multiple Semantic tags:

Clitic

Tags distinguishing different versions of the same lemma (before POS)

In the xml the varid attribute is used in the st element with a mere numeric value an extra lemma attribute is inserted in the st element, e.g. lemma=”tõlvvad”

Other tags

Punctuation

Letters

Skolt Saami letters

These definitions are probably not needed

Archiphonemes

These are for letters with special behaviour. Say that all m-s change to n in a given context, but not this m, because it is m2. In twolc these are then defined m2:m, etc, i.e. the m2 is an m, although it is a different m.

Diacritic marks

These symbol govern the way the morphophonological rules treat the affix string.

This project started out using arbitrary names, X1, X2…, but since they were hard to remember, we changed to (a bit) more transparent names (^DIADEL, …). On the TODO-list: Change all X1, X2, … to easy-to-remember names. Special iterations

Consonant lengthening

Vowel length and height

for vowel height, by default vowels are low.

CHARACTERISTIC BREAKDOWN 2015-02-17

Gradation triggers 2015.01.23

Gradation triggers 2015.02.09 For Consonant Clusters

Diacritic with mnemonic names

Hyphen at compound word boundary

Escaped symbols

Symbols that need to be escaped on the lower side (towards twolc):

The Usage extents are marked using following tags:

Dialect tags:

Compounding

Flag diacritics

| Flag | Explanation | — | —

We have manually optimised the structure of our lexicon using following flag diacritics to restrict morhpological combinatorics - only allow compounds with verbs if the verb is further derived into a noun again:

Flag Explanation
@P.NeedNoun.ON@ (Dis)allow compounds with verbs unless nominalised
@D.NeedNoun.ON@ (Dis)allow compounds with verbs unless nominalised
@C.NeedNoun@ (Dis)allow compounds with verbs unless nominalised

For languages that allow compounding, the following flag diacritics are needed to control position-based compounding restrictions for nominals. Their use is handled automatically if combined with +CmpN/xxx tags. If not used, they will do no harm.

Flag Explanation
@P.CmpFrst.FALSE@ Require that words tagged as such only appear first
@D.CmpPref.TRUE@ Block such words from entering ENDLEX
@P.CmpPref.FALSE@ Block these words from making further compounds
@D.CmpLast.TRUE@ Block such words from entering R
@D.CmpNone.TRUE@ Combines with the next tag to prohibit compounding
@U.CmpNone.FALSE@ Combines with the prev tag to prohibit compounding
@P.CmpOnly.TRUE@ Sets a flag to indicate that the word has passed R
@D.CmpOnly.FALSE@ Disallow words coming directly from root.

Use the following flag diacritics to control downcasing of derived proper nouns (e.g. Finnish Pariisi -> pariisilainen). See e.g. North Sámi for how to use these flags. There exists a ready-made regex that will do the actual down-casing given the proper use of these flags.

Flag Explanation
@U.Cap.Obl@ Allowing downcasing of derived names: deatnulasj.
@U.Cap.Opt@ Allowing downcasing of derived names: deatnulasj.
@C.ErrOrth@ tbw
@R.ErrOrth.ON@ tbw
@D.ErrOrth.ON@ tbw
@P.ErrOrth.ON@ tbw
@P.Pmatch.Backtrack@ tbw
Flag diacritic Explanation
@U.number.one@ Flag used to give arabic numerals in smj different cases ;
@U.number.two@ Flag used to give arabic numerals in smj different cases ;
@U.number.three@ Flag used to give arabic numerals in smj different cases ;
@U.number.four@ Flag used to give arabic numerals in smj different cases ;
@U.number.five@ Flag used to give arabic numerals in smj different cases ;
@U.number.six@ Flag used to give arabic numerals in smj different cases ;
@U.number.seven@ Flag used to give arabic numerals in smj different cases ;
@U.number.eight@ Flag used to give arabic numerals in smj different cases ;
@U.number.nine@ Flag used to give arabic numerals in smj different cases ;
@U.number.zero@ Flag used to give arabic numerals in smj different cases ;
@P.number.one@ Flag used to give arabic numerals in smj different cases ;
@P.number.two@ Flag used to give arabic numerals in smj different cases ;
@P.number.three@ Flag used to give arabic numerals in smj different cases ;
@P.number.four@ Flag used to give arabic numerals in smj different cases ;
@P.number.five@ Flag used to give arabic numerals in smj different cases ;
@P.number.six@ Flag used to give arabic numerals in smj different cases ;
@P.number.seven@ Flag used to give arabic numerals in smj different cases ;
@P.number.eight@ Flag used to give arabic numerals in smj different cases ;
@P.number.nine@ Flag used to give arabic numerals in smj different cases ;
@P.number.ten@ Flag used to give arabic numerals in smj different cases ;
@P.number.zero@ Flag used to give arabic numerals in smj different cases ;

Basic lexica, pointing to the other lexicon files

INCOMING lemma:stem Contlex sets to be distinguished from glossing in progress

NounRoot

VerbRoot

INTERJ_ Interjections

CONJUNCTIONS INTERJ_

CS_ Subjunction

CS-TEMP_ when

NUM_ NUM_VAHTT

NUM_ALGG NUM_AUTT NUM_TOLL NUM_PAPP NUM_AELDD NUM_KUEQLL NUM_TAQHTT NUM_KAEAEUQC

NUM_AANAR

NUM_ATOM NUM_JEAQNNN

PCLE_ is here since Pcle_sms2x.xml wants it. It does nothing. PCLE-NEG_ is here since Pcle_sms2x.xml wants it. It adds +Neg.

Postpositions with government tagging possible ADP_ PO_tag is the lexicon adding the tag +Po PO-ILL_ PO-LOC_ PO_ is a dummy lexicon not adding anything ADP-GOV-LOC_ PO-GOV-GEN_

Prepositions with government tagging possible

PR_tag is the lexicon adding the tag +Pr PR_ is a dummy lexicon not adding anything

PR-TEMP-GOV-LOC_

PREFIX/A_

SUF/A_


This (part of) documentation was generated from src/fst/morphology/root.lexc