Komi-Zyrian NLP Grammar

Finite state and Constraint Grammar based analysers, proofing tools and other resources

View the project on GitHub giellalt/lang-kpv

Multichar_Symbols and Root lexicon for Komi

Check these:

Analysis symbols

The morphological analyses of wordforms for the Komi-Zyrian language are presented in this system in terms of the following symbols. (It is highly suggeste d to follow existing standards when adding new tags).

The parts-of-speech tags

Subtags

Adverb subtags

Interjections

+Formulaic = expressions such as аттьӧ, ало, … +Conative Used for calling animals, for example брысь, баль-баль, …

Nouns

Pronouns

Nominals are inflected for Number and Case

Number

Case

A category of case in Komi can be identified as:

Possessive suff

The comparative forms are:

Numeral tags:

Quantifiers (numerals)

Verb tags

Other tags

Question and Focus particles:

Tags distinguishing different versions of the same lemma (before POS)

Usage tags:

Dialect features

Check these Where do these come from source

Semantic tags to help disambiguation & synt. analysis: (before POS) Borrowed from main/langs/sme/src/morphology/root.lexc

Semantic tags

Multiple Semantic tags:

Derivation

Derivations are classified under the morphophonetic form of the suffix, the source and target part-of-speech.

Dertags

Declaring adjectival derivations Noun phrase modifiers are generally considered derivational

More dertags (TODO: sort/group)

Declaring Deverbal derivations of verbs

Tags for Ethymological Origin marking. This has initially used used with proper nouns

Morphophonology

To represent phonologic variations in word forms we use the following symbols in the lexicon files:

Archiphonemes

Triggers to control variation

Valency tags, i.e. tags assigned to verbs for denoting their arbuments

Symbols that need to be escaped on the lower side (towards twolc):

Flag diacritics

We have manually optimised the structure of our lexicon using following flag diacritics to restrict morhpological combinatorics - only allow compounds with verbs if the verb is further derived into a noun again:

Flags Explanation
@P.NeedNoun.ON@ (Dis)allow compounds with verbs unless nominalised
@D.NeedNoun.ON@ (Dis)allow compounds with verbs unless nominalised
@C.NeedNoun@ (Dis)allow compounds with verbs unless nominalised

Two flags copied from sme

Flags Explanation
@P.Pmatch.Loc@ Used on multi-token analyses; tell hfst-tokenise/pmatch where in the form/analysis the token should be split.
@P.Pmatch.Backtrack@ Used on single-token analyses; tell hfst-tokenise/pmatch to backtrack by reanalysing the substrings before and after this point in the form (to find combinations of shorter analyses that would otherwise be missed)

Compunding

Tags

Flags

For languages that allow compounding, the following flag diacritics are needed to control position-based compounding restrictions for nominals. Their use is

handled automatically if combined with +CmpN/xxx tags. If not used, they will do no harm.

Flags Explanation
@P.CmpFrst.FALSE@ Require that words tagged as such only appear first
@D.CmpPref.TRUE@ Block such words from entering ENDLEX
@P.CmpPref.FALSE@ Block these words from making further compounds
@D.CmpLast.TRUE@ Block such words from entering R
@D.CmpNone.TRUE@ Combines with the next tag to prohibit compounding
@U.CmpNone.FALSE@ Combines with the prev tag to prohibit compounding
@P.CmpOnly.TRUE@ Sets a flag to indicate that the word has passed R
@D.CmpOnly.FALSE@ Disallow words coming directly from root.

Use the following flag diacritics to control downcasing of derived proper nouns (e.g. Finnish Pariisi -> pariisilainen). See e.g. North Sámi for how to use these flags. There exists a ready-made regex that will do the actual down-casing given the proper use of these flags.

Flags Explanation
@U.Cap.Obl@ Always capital letter for names: Deatnu.
@U.Cap.Opt@ Allowing downcasing of derived names: deatnulasj.
Flags Explanation
@U.CONJ-VAL.TV@ Flags used with serial verbs: VAL = Valence
@U.CONJ-VAL.IV@ Flags used with serial verbs: VAL = Valence
@U.CONJ-INF.YES@ INF = Infinitive
@U.CONJ-INF.NO@ INF = Infinitive
@U.CONJ-TX.FUT@ TX = tense
@U.CONJ-TX.PRES@ TX = tense
@U.CONJ-TX.PRET1@ TX = tense
@U.CONJ-TX.PRET2@ TX = tense
@U.CONJ-GER.IG@ GER = gerund
@U.CONJ-GER.VCAR@ GER = VCar тӧг
@U.CONJ-GER.VCARMoz@ GER = VCar тӧгмоз
@U.CONJ-GER.VMON@ GER = VMon мӧн
@U.CONJ-GER.VTER@ GER = VTer тӧдз
@U.CONJ-MX.IND@ MX = mood
@U.CONJ-MX.IMP@ MX = mood
@U.CONJ-CONNEG.YES@ CONNEG = negation
@U.CONJ-CONNEG.NO@ CONNEG = negation
@U.CONJ-NX.PL@ NX = number
@U.CONJ-NX.SG@ NX = number
@U.CONJ-POSS.1@ POSS = possessive, person 1
@U.CONJ-POSS.2@ POSS = possessive 2
@U.CONJ-POSS.3@ POSS = possessive 3
@U.CONJ-POSS.2ACC@ POSS = possessive etc.
@U.CONJ-POSS.3ACC@ POSS = possessive
@U.CONJ-PX.1@ PX = person
@U.CONJ-PX.2@ PX = person
@U.CONJ-PX.3@ PX = person
@C.CONJ-VAL@ Removal
@C.CONJ-INF@ Removal
@C.CONJ-TX@ Removal
@C.CONJ-MX@ Removal
@C.CONJ-GER@ Removal
@C.CONJ-CONNEG@ Removal
@C.CONJ-NX@ Removal
@C.CONJ-PX@ Removal
@C.CONJ-POSS@ Removal
@P.PossPx.Sg1@ FLAGS USED WITH COLLECTIVE NOUNS
@P.PossPx.Sg2@ FLAGS USED WITH COLLECTIVE NOUNS
@P.PossPx.Sg3@ FLAGS USED WITH COLLECTIVE NOUNS
@P.PossPx.Pl1@ FLAGS USED WITH COLLECTIVE NOUNS
@P.PossPx.Pl2@ FLAGS USED WITH COLLECTIVE NOUNS
@P.PossPx.Pl3@ FLAGS USED WITH COLLECTIVE NOUNS
@U.PossPx.Sg1@ FLAGS USED WITH COLLECTIVE NOUNS
@U.PossPx.Sg2@ FLAGS USED WITH COLLECTIVE NOUNS
@U.PossPx.Sg3@ FLAGS USED WITH COLLECTIVE NOUNS
@U.PossPx.Pl1@ FLAGS USED WITH COLLECTIVE NOUNS
@U.PossPx.Pl2@ FLAGS USED WITH COLLECTIVE NOUNS
@U.PossPx.Pl3@ FLAGS USED WITH COLLECTIVE NOUNS
@D.PossPx@ FLAGS USED WITH COLLECTIVE NOUNS
@C.PossPx@ FLAGS USED WITH COLLECTIVE NOUNS
@U.DECL-NX.SG@ number
@U.DECL-NX.PL@ number
@R.DECL-NX.PL@ number
@U.DECL-CX.ABE@ unify case
@U.DECL-CX.ABL@ unify case
@U.DECL-CX.ACC@ unify case
@U.DECL-CX.APR@ unify case
@U.DECL-CX.APRINE@ unify case
@U.DECL-CX.APRILL@ unify case
@U.DECL-CX.APRELA@ unify case
@U.DECL-CX.APREGR@ unify case
@U.DECL-CX.APRPRL@ unify case
@U.DECL-CX.APRTRA@ unify case
@U.DECL-CX.APRTER@ unify case
@U.DECL-CX.CAR@ unify case
@U.DECL-CX.CMP@ unify case
@U.DECL-CX.CNS@ unify case
@U.DECL-CX.COM@ unify case
@U.DECL-CX.DAT@ unify case
@U.DECL-CX.EGR@ unify case
@U.DECL-CX.ELA@ unify case
@U.DECL-CX.GEN@ unify case
@U.DECL-CX.ILL@ unify case
@U.DECL-CX.INE@ unify case
@U.DECL-CX.INS@ unify case
@U.DECL-CX.NOM@ unify case
@U.DECL-CX.PRL@ unify case
@U.DECL-CX.TRA@ unify case
@U.DECL-CX.TER@ unify case
@U.DECL-DX.INDEF@ declension type
@U.DECL-DX.PX@ declension type
@C.DECL-NX@ Removal
@C.DECL-DX@ Removal
@C.DECL-CX@ Removal
@U.Cap.Obl@ Allowing downcasing of derived names: deatnulasj
@U.Cap.Opt@ Allowing downcasing of derived names: deatnulasj

Lexicon Root

The word forms in Komi (Zyrian) language start from the lexeme roots of basic word classes, or optionally from prefixes:

Lexica without morphology !

Absolute forms ABS_ пу керка выль керка

Compounding

R

Serial-Verbs

Lexica called End, whatever they are

ABBR-IS_ADV

ABBR-IS_N

Clitics

K

WordEnd

WordEnd-2

SPAT-COMPARATIVE

COMPARATIVE

SUBSTANDARDS

Endlex

Lexicon ENDLEX And this is the ENDLEX of everything:

@D.CmpOnly.FALSE@@D.CmpPref.TRUE@@D.NeedNoun.ON@ # ; The @D.CmpOnly.FALSE@ flag diacritic is ued to disallow words tagged with +CmpNP/Only to end here. The @D.NeedNoun.ON@ flag diacritic is used to block illegal compounds.


This (part of) documentation was generated from src/fst/morphology/root.lexc