Mansi NLP Grammar

Finite state and Constraint Grammar based analysers, proofing tools and other resources

View the project on GitHub giellalt/lang-mns

Mansi morphological analyser

This file declares the multicharacter symbols used to analyse Mansi, as well as gives the basic Root lexicon.

Multichar_Symbolsdefinitions

Multicharacter letters in the alphabet

Vowels with a macron

Analysis symbols

The morphological analyses of wordforms for the Mansi language are presented in this system in terms of the following symbols. (It is highly suggested to follow existing standards when adding new tags).

The parts-of-speech are:

The parts of speech are further split up into:

The Usage extents are marked using following tags:

Dialect tags:

The nominals are inflected in the following Case and Number

The comparative forms are:

Number, person and mod

Compounded words

Abbreviated words are classified with:

Special symbols are classified with:

Tags distinguishing different versions of the same lemma (before POS)

Semantics are classified with

Clitic

Derivations are classified under the morphophonetic form of the suffix, the source and target part-of-speech.

Symbols that need to be escaped on the lower side (towards twolc):

Morphophonology

To represent phonologic variations in word forms we use the following symbols (archiphones) in the lexicon files: %{аяØ%} PxPl3 %{аяØ%}ныл %{аяLong%} ScPl3+OcSg %{аяLong%}ныл %{ыиØ%} Loc and Ins

%{тØ%} Ins, PxSg3,

%{ЫИ%} +V+Ind+Prs+OcSg+ScSg1 %{ЭЕLong%} +V+Ind+Prs+ScSg1, PxDu3 %{ЭЕ%} +V+Ind+Prs+ScDu2, PxSg3 %{йØ%} ыг

And following triggers to control variation %{VO%} Stem ending in vowel other than и ы %{VI%} Stem ending in vowel и or ы %{SYNCH%} Stem with syncope with и, ы, у hard %{SYNCS%} Stem with syncope with и, ы, у soft %{NOSYNCH%} Stem without syncope with и, ы, у hard %{NOSYNCS%} Stem without syncope with и, ы, у soft %{VCH%} Stem ending in single hard consonant %{VCCH%} Stem ending in hard consonant cluster %{VCS%} Stem ending in single soft consonant %{VCCS%} Stem ending in soft consonant cluster %{VA%} -аӈкве verb %{VU%} -уӈкве verb %^RmVow stem-final vowel removal

Flag diacritics

We have manually optimised the structure of our lexicon using following flag diacritics to restrict morhpological combinatorics - only allow compounds with verbs if the verb is further derived into a noun again:

Flag Explanation
@P.NeedNoun.ON@ (Dis)allow compounds with verbs unless nominalised
@D.NeedNoun.ON@ (Dis)allow compounds with verbs unless nominalised
@C.NeedNoun@ (Dis)allow compounds with verbs unless nominalised
@D.ErrOrth.ON@ Disallow ErrOrth
@C.ErrOrth@ Clear ErrOrth flag
@P.ErrOrth.ON@ Set positive value for ErrOrth flag
@R.ErrOrth.ON@ Reset ErrOrth Flag

For languages that allow compounding, the following flag diacritics are needed to control position-based compounding restrictions for nominals. Their use is handled automatically if combined with +CmpN/xxx tags. If not used, they will do no harm.

Flag Explanation
@P.CmpFrst.FALSE@ Require that words tagged as such only appear first
@D.CmpPref.TRUE@ Block such words from entering ENDLEX
@P.CmpPref.FALSE@ Block these words from making further compounds
@D.CmpLast.TRUE@ Block such words from entering R
@D.CmpNone.TRUE@ Combines with the next tag to prohibit compounding
@U.CmpNone.FALSE@ Combines with the prev tag to prohibit compounding
@P.CmpOnly.TRUE@ Sets a flag to indicate that the word has passed R
@D.CmpOnly.FALSE@ Disallow words coming directly from root.

Use the following flag diacritics to control downcasing of derived proper nouns (e.g. Finnish Pariisi -> pariisilainen). See e.g. North Sámi for how to use these flags. There exists a ready-made regex that will do the actual down-casing given the proper use of these flags.

Flag Explanation
@U.Cap.Obl@ Allowing downcasing of derived names: deatnulasj.
@U.Cap.Opt@ Allowing downcasing of derived names: deatnulasj.

Flags used to identify parts of speech

Flags used with serial verbs

problematic

FLAGS USED WITH VERB PREFIXES

FLAGS USED WITH ERRORS, ORTHOGRAPHIC or others

FLAGS USED WITH COLLECTIVE NOUNS

number

Removal

The basic lexica

LEXICON Root The word forms in the MANSI language start from the lexeme roots of basic word classes, or optionally from prefixes: Nouns ; Verbs ; VPrefixes ; Adjectives ; Adverbs ; Pronouns ; Numerals ; Conjunctions ; Interjections ; Participles ; Postpositions ; PROP_MANSINAMES ; mansi-specific proper nouns urj-Cyrl-ProperNouns ; common cyrillic proper nouns Punctuation ; Symbols ; Abbreviation ; foreign_words ;


This (part of) documentation was generated from src/fst/morphology/root.lexc