Komi-Zyrian NLP Grammar

Finite state and Constraint Grammar based analysers, proofing tools and other resources

View the project on GitHub giellalt/lang-kpv

Page Content

Komi-Zyrian language model documentation

All doc-comment documentation in one large file.


src-cg3-disambiguator.cg3.md

Komi disambiguator

Delimiters

Sentence delimiters are: “<.>” “<!>” “<?>” “<…>” “<¶>” sent

Tags and sets

Beginning and end of sentence

BOS EOS

Miscellanous

CmpTest Err вӧлі Sg3

Parts of speech tags

N V A Adv CC CS Inter Pron Num Pcle Clt Po Dem Deg Qnt Prop

Derivation tags

Ex/A (former adj) Ex/N Ex/Num Ex/V Ex/WORD VCar DerTag AspDerTag

Verbal categories

Prs Fut Fut1 Imprt Prt1 Prt2 Prf PrfIpf HstPrf PluPrf HstPluPrf Ind Imp Cond Opt

Sg1 Sg2 …

Nominal categories Sg Pl Nom Gen Abl Dat Com Cns …

Verb sets

VNEG (all Neg verbs)

VFIN

ASKI (tomorrow set)

NOT-PRL (have no homograph Prolative pairs set)


This (part of) documentation was generated from src/cg3/disambiguator.cg3


src-cg3-functions-ikpd.cg3.md


This (part of) documentation was generated from src/cg3/functions-ikpd.cg3


src-cg3-functions.cg3.md

These sets model noun phrases (NPs). The idea is to first define whatever can occur in front of the head of the NP, and thereafter negate that with the expression WORD - premodifiers.

The set NOT-NPMOD is used to find barriers between NPs. Typical usage: … (*1 N BARRIER NPT-NPMOD) … meaning: Scan to the first noun, ignoring anything that can be part of the noun phrase of that noun (i.e., “scan to the next NP head”)

These were the set types.

HABITIVE MAPPING

sma object

SUBJ MAPPING - leftovers

OBJ MAPPING - leftovers

HNOUN MAPPING

therestX adds @X to all what is left, often errouneus disambiguated forms


This (part of) documentation was generated from src/cg3/functions.cg3


src-fst-morphology-affixes-adjectives.lexc.md

Adjective inflection


Komi (Zyrian) adjectives compare.

Continuation lexicon has been assigned according to content


This (part of) documentation was generated from src/fst/morphology/affixes/adjectives.lexc


src-fst-morphology-affixes-adpositions.lexc.md

Postposition inflection


Komi postpositions inflect for direction.

Prep lexica

Postp lexica

This contlex allows for relational word which, otherwise, are open to extensive declension

аддза, бӧрті, бокиті, боксянь, дырйи, йитӧдын, кузя, ног, ньылыд, паныдӧн, пӧлӧн, пыдди, пыр, понда, ради, уліті, выліті, вывті, вомас, вомӧн пӧвст

аддза, бӧрті, бокиті, боксянь, дырйи, йитӧдын, кузя, ног, ньылыд, паныдӧн, пӧлӧн, пыдди, пыр, понда, ради, уліті, выліті, вывті, вомас, вомӧн пӧвст


This (part of) documentation was generated from src/fst/morphology/affixes/adpositions.lexc


src-fst-morphology-affixes-adverbs.lexc.md

Adverb inflection


Komi adverbs inflect for direction.

LEXICON ADV-DEG_ depricate ADV-ADA_ and Ad-ATAG

LEXICON ADV-MANNER_

LEXICON ADV-NEG_

LEXICON GER_


This (part of) documentation was generated from src/fst/morphology/affixes/adverbs.lexc


src-fst-morphology-affixes-conjunctors.lexc.md

Conjunctors


Komi conjunctors

LEXICON CC_

LEXICON CS_

LEXICON CS_DIAL

LEXICON CONJ_


This (part of) documentation was generated from src/fst/morphology/affixes/conjunctors.lexc


src-fst-morphology-affixes-interjections.lexc.md

Interjections


Komi Interjections

LEXICON INTERJ_

LEXICON INTERJ-CONATIVE_

LEXICON INTERJ-FORMULAIC_


This (part of) documentation was generated from src/fst/morphology/affixes/interjections.lexc


src-fst-morphology-affixes-nouns.lexc.md

Noun morphological lexica

Basic nouns.

The lexicon for basic nouns is ` N_ `

This should be phased out 2013-05-07

subsequent Cns vs Vow

Inflectional lexica

All nouns follow one contlex “N_” to begin with here is simply a list of all variant with no more variants beyond:

SG1

SG2

SG3

PL1

PL2

PL3

SG1 SG2 SG3 PL1 PL2 PL3

SG1 SG2 SG3 PL1 PL2 PL3

SG1

SG2

SG3

PL1

PL2

PL3

Case

+Der/а+Adv:%>а K ;


This (part of) documentation was generated from src/fst/morphology/affixes/nouns.lexc


src-fst-morphology-affixes-particles.lexc.md

Particles


Komi Particles

LEXICON PCLE_

LEXICON PCLE_NEG

LEXICON PcleIntens

LEXICON ONOM_

LEXICON DESCR_


This (part of) documentation was generated from src/fst/morphology/affixes/particles.lexc


src-fst-morphology-affixes-pronouns.lexc.md

Pronominal morphology

Closed class personal pronouns

LEXICON PRONOUN-TYPES

ми мийӧ The 1st and 2nd persons have Oblique case stem strategies that differ from the 3rd person: ті тійӧ nämä ovat aivan eri asioita сы сійӧ tosin joskus

Tagged in the src/morphology/stems/pronouns.xml file

Word-final cases


This (part of) documentation was generated from src/fst/morphology/affixes/pronouns.lexc


src-fst-morphology-affixes-propernouns.lexc.md

Proper noun inflection

Komi proper nouns inflect in the same cases as regular nouns.

Temporary lexica

LEXICON ACRON-F

LEXICON ACRON

LEXICON PROP-RUS_ LEXICON PROP_

Russian type Surnames

Preparing for the template urj-Cyrl Beginning 2012-11-15 LEXICON CYRL-CONS_SUR

LEXICON CYRL-SIBILANT_SUR

LEXICON CYRL-VOW_SUR

LEXICON CYRL-A_SUR

LEXICON CYRL-K_SUR

LEXICON CYRL-L_SUR

LEXICON CYRL-T_SUR

LEXICON Deriv-RUS-AN_SURMAL

Абдеев:Абдеев LEXICON Deriv-RUS-V_SURMAL

Багрий:Багр LEXICON Deriv-RUS-IJ_SURMAL

LEXICON Deriv-RUS-IN_SURMAL

Аморский:Аморск LEXICON Deriv-RUS-KIJ_SURMAL

LEXICON Deriv-RUS-OJ_SURMAL

LEXICON Deriv-RUS-YJ_SURMAL

PLACE NAMES FROM TEMPLATES

LEXICON PROP-PLC_KAL

LEXICON PROP-PLC_KIT

LEXICON PROP-PLC_KUDO

These are vowel-final stems They have previously received +Sem/Fem tags

Male given name for deriving patronyms

Should this be limited to +Sg? 2015-09-06

Вили:Вил

Андрей:Андре

Ending 2012-11-15

FEMALE NAMES FROM TEMPLATE

PLACE NAMES FROM TEMPLATES


This (part of) documentation was generated from src/fst/morphology/affixes/propernouns.lexc


src-fst-morphology-affixes-quantifiers.lexc.md

Numeral morphological lexica

This has to be worked on 2012-01-19 LEXICON NUM-CARD_

LEXICON CARD

LEXICON ORD

LEXICON DET_

LEXICON DET_END

LEXICON NUM-IS_DISTR

LEXICON QNT_

LEXICON NUM-APPR ! 2011-11-03 This will need work

LEXICON CARD-APPR

Inflectional lexica

All nouns follow one contlex “Noun1” to begin with here is simply a list of all variant with no more variants beyond:

LEXICON NumCASEPOSSLEX

LEXICON NumMWN

Arabic numerals


This (part of) documentation was generated from src/fst/morphology/affixes/quantifiers.lexc


src-fst-morphology-affixes-symbols.lexc.md

Symbol affixes

Noun_symbols_possibly_inflected

Noun_symbols_never_inflected

SYMBOL_connector

SYMBOL_NO_suff

SYMBOL_suff (can abbreviations have suffixes? Probably, yes)


This (part of) documentation was generated from src/fst/morphology/affixes/symbols.lexc


src-fst-morphology-affixes-verbs.lexc.md

Verbal morphology

Temporary lexicon

V_ temporary lexicon gives +V+WORK

Closed class verbs

VERBNEGATIVE

Open class verbs

Some Flag diacritic lines are with regexes, other with aligned zeros. We want to migrate to regexes < … > , for readability reasons (sic!)

IV_ЛОКНЫ

IV_ШУНЫ

IV_АМНЫ TV_АМНЫ

BV_АМНЫ

Verb conjugation

Derivation

This is fed by LEXICON V_ШУНЫ, and therefore certain corrections must be made 2012-01-18

овсьыны пусьыштлывлыны босьтчыштлывлыны

verb-to-noun

вевттьысьыны

бертласьны


This (part of) documentation was generated from src/fst/morphology/affixes/verbs.lexc


src-fst-morphology-phonology.twolc.md

Komi Zyrian twol file

This file documents the phonology.twolc file

cf. kpv-phon-old.xfscript cf. Rueter 2000 Хельсинкиса университетын кыв туялысь Ижкарын Перымса кывъяс симпозиум вылын лыддьӧмтор

Alphabet, Sets and Definitions

Letters of the alphabet

Triggers

Boundary symbols

Diacritics

Sets

Vowel

Palatal Vowel Cns-initial vowels

All non-vowels, consonants and hard and soft signs

All non-vowels with exception of soft sign

All but z consonants that can be followed by either і or и

Letters

Dummy

Definitions

No definitions

Rules

Rules connected to L/V alternations

Rule: The famous L/V changes л to в betweeen vowel and the ^Close symbol

Rule: The famous L/V goes Izhva where л goes to its preceeding vowel (except a) before ^C2V.

Rule: Vowel lengthening а:о я:ё for the ^C2V context

Rule: The ӧ/V as in унаан

Rules for paragogic consonants

Rule: Paragogic consonant deletion

Rule: Paragogic т deletion and tripple т between Cns and ^Close

Other consonant deletion rules

Rule: Paragogic т deletion and tripple т

Rule: Paragogic т deletion and tripple т

Rule: jDeletion after vowel

Rule: j to hard sign after consonant

Rule: l deletion ALSO tripple letter

Rule: d deletion

Vowel Palatalisation rules

Rule: а 2 я, о 2 ё, у 2 ю

Rule: %{иі%} 2 і

Rule: %{иі%} 2 и

Rules for soft and hard sign

Rule: Soft Sign Deletion

Rule: Hard Sign Deletion

Rule: Hard Sign Palatalization

Other rules

To do: Look at a more logical ordering

Rule: No triple letters deletes the middle consonant in Cx Cx > Cx sequences

Rule: IClitic

клуб+N+Sg+Err/Dial+Ill club/kerho

Rule: Disallow l to vowel after other than l


This (part of) documentation was generated from src/fst/morphology/phonology.twolc


src-fst-morphology-root.lexc.md

Multichar_Symbols and Root lexicon for Komi

Check these:

Analysis symbols

The morphological analyses of wordforms for the Komi-Zyrian language are presented in this system in terms of the following symbols. (It is highly suggeste d to follow existing standards when adding new tags).

The parts-of-speech tags

Subtags

Adverb subtags

Interjections

+Formulaic = expressions such as аттьӧ, ало, … +Conative Used for calling animals, for example брысь, баль-баль, …

Nouns

Pronouns

Nominals are inflected for Number and Case

Number

Case

A category of case in Komi can be identified as:

Possessive suff

The comparative forms are:

Numeral tags:

Quantifiers (numerals)

Verb tags

Other tags

Question and Focus particles:

Tags distinguishing different versions of the same lemma (before POS)

Usage tags:

Dialect features

Check these Where do these come from source

Semantic tags to help disambiguation & synt. analysis: (before POS) Borrowed from main/langs/sme/src/morphology/root.lexc

Semantic tags

Multiple Semantic tags:

Derivation

Derivations are classified under the morphophonetic form of the suffix, the source and target part-of-speech.

Dertags

Declaring adjectival derivations Noun phrase modifiers are generally considered derivational

More dertags (TODO: sort/group)

Declaring Deverbal derivations of verbs

Tags for Ethymological Origin marking. This has initially used used with proper nouns

Morphophonology

To represent phonologic variations in word forms we use the following symbols in the lexicon files:

Archiphonemes

Triggers to control variation

Valency tags, i.e. tags assigned to verbs for denoting their arbuments

Symbols that need to be escaped on the lower side (towards twolc):

Flag diacritics

We have manually optimised the structure of our lexicon using following flag diacritics to restrict morhpological combinatorics - only allow compounds with verbs if the verb is further derived into a noun again:

Flags Explanation
@P.NeedNoun.ON@ (Dis)allow compounds with verbs unless nominalised
@D.NeedNoun.ON@ (Dis)allow compounds with verbs unless nominalised
@C.NeedNoun@ (Dis)allow compounds with verbs unless nominalised

Two flags copied from sme

Flags Explanation
@P.Pmatch.Loc@ Used on multi-token analyses; tell hfst-tokenise/pmatch where in the form/analysis the token should be split.
@P.Pmatch.Backtrack@ Used on single-token analyses; tell hfst-tokenise/pmatch to backtrack by reanalysing the substrings before and after this point in the form (to find combinations of shorter analyses that would otherwise be missed)

Compunding

Tags

Flags

For languages that allow compounding, the following flag diacritics are needed to control position-based compounding restrictions for nominals. Their use is

handled automatically if combined with +CmpN/xxx tags. If not used, they will do no harm.

Flags Explanation
@P.CmpFrst.FALSE@ Require that words tagged as such only appear first
@D.CmpPref.TRUE@ Block such words from entering ENDLEX
@P.CmpPref.FALSE@ Block these words from making further compounds
@D.CmpLast.TRUE@ Block such words from entering R
@D.CmpNone.TRUE@ Combines with the next tag to prohibit compounding
@U.CmpNone.FALSE@ Combines with the prev tag to prohibit compounding
@P.CmpOnly.TRUE@ Sets a flag to indicate that the word has passed R
@D.CmpOnly.FALSE@ Disallow words coming directly from root.

Use the following flag diacritics to control downcasing of derived proper nouns (e.g. Finnish Pariisi -> pariisilainen). See e.g. North Sámi for how to use these flags. There exists a ready-made regex that will do the actual down-casing given the proper use of these flags.

Flags Explanation
@U.Cap.Obl@ Always capital letter for names: Deatnu.
@U.Cap.Opt@ Allowing downcasing of derived names: deatnulasj.
Flags Explanation
@U.CONJ-VAL.TV@ Flags used with serial verbs: VAL = Valence
@U.CONJ-VAL.IV@ Flags used with serial verbs: VAL = Valence
@U.CONJ-INF.YES@ INF = Infinitive
@U.CONJ-INF.NO@ INF = Infinitive
@U.CONJ-TX.FUT@ TX = tense
@U.CONJ-TX.PRES@ TX = tense
@U.CONJ-TX.PRET1@ TX = tense
@U.CONJ-TX.PRET2@ TX = tense
@U.CONJ-GER.IG@ GER = gerund
@U.CONJ-GER.VCAR@ GER = VCar тӧг
@U.CONJ-GER.VCARMoz@ GER = VCar тӧгмоз
@U.CONJ-GER.VMON@ GER = VMon мӧн
@U.CONJ-GER.VTER@ GER = VTer тӧдз
@U.CONJ-MX.IND@ MX = mood
@U.CONJ-MX.IMP@ MX = mood
@U.CONJ-CONNEG.YES@ CONNEG = negation
@U.CONJ-CONNEG.NO@ CONNEG = negation
@U.CONJ-NX.PL@ NX = number
@U.CONJ-NX.SG@ NX = number
@U.CONJ-POSS.1@ POSS = possessive, person 1
@U.CONJ-POSS.2@ POSS = possessive 2
@U.CONJ-POSS.3@ POSS = possessive 3
@U.CONJ-POSS.2ACC@ POSS = possessive etc.
@U.CONJ-POSS.3ACC@ POSS = possessive
@U.CONJ-PX.1@ PX = person
@U.CONJ-PX.2@ PX = person
@U.CONJ-PX.3@ PX = person
@C.CONJ-VAL@ Removal
@C.CONJ-INF@ Removal
@C.CONJ-TX@ Removal
@C.CONJ-MX@ Removal
@C.CONJ-GER@ Removal
@C.CONJ-CONNEG@ Removal
@C.CONJ-NX@ Removal
@C.CONJ-PX@ Removal
@C.CONJ-POSS@ Removal
@P.PossPx.Sg1@ FLAGS USED WITH COLLECTIVE NOUNS
@P.PossPx.Sg2@ FLAGS USED WITH COLLECTIVE NOUNS
@P.PossPx.Sg3@ FLAGS USED WITH COLLECTIVE NOUNS
@P.PossPx.Pl1@ FLAGS USED WITH COLLECTIVE NOUNS
@P.PossPx.Pl2@ FLAGS USED WITH COLLECTIVE NOUNS
@P.PossPx.Pl3@ FLAGS USED WITH COLLECTIVE NOUNS
@U.PossPx.Sg1@ FLAGS USED WITH COLLECTIVE NOUNS
@U.PossPx.Sg2@ FLAGS USED WITH COLLECTIVE NOUNS
@U.PossPx.Sg3@ FLAGS USED WITH COLLECTIVE NOUNS
@U.PossPx.Pl1@ FLAGS USED WITH COLLECTIVE NOUNS
@U.PossPx.Pl2@ FLAGS USED WITH COLLECTIVE NOUNS
@U.PossPx.Pl3@ FLAGS USED WITH COLLECTIVE NOUNS
@D.PossPx@ FLAGS USED WITH COLLECTIVE NOUNS
@C.PossPx@ FLAGS USED WITH COLLECTIVE NOUNS
@U.DECL-NX.SG@ number
@U.DECL-NX.PL@ number
@R.DECL-NX.PL@ number
@U.DECL-CX.ABE@ unify case
@U.DECL-CX.ABL@ unify case
@U.DECL-CX.ACC@ unify case
@U.DECL-CX.APR@ unify case
@U.DECL-CX.APRINE@ unify case
@U.DECL-CX.APRILL@ unify case
@U.DECL-CX.APRELA@ unify case
@U.DECL-CX.APREGR@ unify case
@U.DECL-CX.APRPRL@ unify case
@U.DECL-CX.APRTRA@ unify case
@U.DECL-CX.APRTER@ unify case
@U.DECL-CX.CAR@ unify case
@U.DECL-CX.CMP@ unify case
@U.DECL-CX.CNS@ unify case
@U.DECL-CX.COM@ unify case
@U.DECL-CX.DAT@ unify case
@U.DECL-CX.EGR@ unify case
@U.DECL-CX.ELA@ unify case
@U.DECL-CX.GEN@ unify case
@U.DECL-CX.ILL@ unify case
@U.DECL-CX.INE@ unify case
@U.DECL-CX.INS@ unify case
@U.DECL-CX.NOM@ unify case
@U.DECL-CX.PRL@ unify case
@U.DECL-CX.TRA@ unify case
@U.DECL-CX.TER@ unify case
@U.DECL-DX.INDEF@ declension type
@U.DECL-DX.PX@ declension type
@C.DECL-NX@ Removal
@C.DECL-DX@ Removal
@C.DECL-CX@ Removal
@U.Cap.Obl@ Allowing downcasing of derived names: deatnulasj
@U.Cap.Opt@ Allowing downcasing of derived names: deatnulasj

Lexicon Root

The word forms in Komi (Zyrian) language start from the lexeme roots of basic word classes, or optionally from prefixes:

Lexica without morphology !

Absolute forms ABS_ пу керка выль керка

Compounding

R

Serial-Verbs

Lexica called End, whatever they are

ABBR-IS_ADV

ABBR-IS_N

Clitics

K

WordEnd

WordEnd-2

SPAT-COMPARATIVE

COMPARATIVE

SUBSTANDARDS

Endlex

Lexicon ENDLEX And this is the ENDLEX of everything:

@D.CmpOnly.FALSE@@D.CmpPref.TRUE@@D.NeedNoun.ON@ # ; The @D.CmpOnly.FALSE@ flag diacritic is ued to disallow words tagged with +CmpNP/Only to end here. The @D.NeedNoun.ON@ flag diacritic is used to block illegal compounds.


This (part of) documentation was generated from src/fst/morphology/root.lexc


src-fst-morphology-stems-acronyms.lexc.md

Acronym inflection


This (part of) documentation was generated from src/fst/morphology/stems/acronyms.lexc


src-fst-morphology-stems-adjectives-russian-like_newwords.lexc.md

This is where new words are added as lexc entries before they are added to the xml source files. важ:важ A_ “(eng) /(fin)/(rus) “ ;

ADD ADJECTIVES BELOW


This (part of) documentation was generated from src/fst/morphology/stems/adjectives-russian-like_newwords.lexc


src-fst-morphology-stems-adjectives_newwords.lexc.md

This is where new words are added as lexc entries before they are added to the xml source files. важ+A:важ A_ “(eng) /(fin)/(rus) “ ;

ADD ADJECTIVES BELOW

colors

from Syktyvkar


This (part of) documentation was generated from src/fst/morphology/stems/adjectives_newwords.lexc


src-fst-morphology-stems-adverbs_newwords.lexc.md

This is where new words are added as lexc entries before they are added to the xml source files. важын:важын ADV_ “(eng) /(fin)/(rus) “ ;

ADD ADVERBS BELOW


This (part of) documentation was generated from src/fst/morphology/stems/adverbs_newwords.lexc


src-fst-morphology-stems-dialect_lexicon.lexc.md

Hypothetical dialect forms with е/э 2021-03-15


This (part of) documentation was generated from src/fst/morphology/stems/dialect_lexicon.lexc


src-fst-morphology-stems-exceptions.lexc.md

Exceptions are quite strange word-forms. the ones that do not fit anywhere else. This file contains all enumerated word forms that cannot reasonably be created from lexical data by regular inflection. Usually there should be next to none exceptions, it’s always better to have a paradigm that covers only one or few words than an exception since these will not work nicely with e.g. compounding scheme or possibly many end applications.

The pair verb овны-вывны conjugates in more forms than are attested for the single verb вывны:

VERBS WITH FIRST PRETERITE THIRD PERSON WITHOUT с IN NORM

SPECIAL VERB FORM FOR VERBAL TERMINATIVE OF ЛОКНЫ

REDUPLICATED ADVERBS

SUPERLATIVE ADVERBS

SUPERLATIVE ADJECTIVES

ADJECTIVES NOT YET ADDED TO DICTIONARY DATABANK

VOCATIVE EXPRESSIONS

PROPER NOUNS NOT YET ADDED TO DICTIONARY DATABANK


This (part of) documentation was generated from src/fst/morphology/stems/exceptions.lexc


src-fst-morphology-stems-nouns_newwords.lexc.md

This is where new words are added as lexc entries before they are added to the xml source files. чери+N:чери N_ “(eng) fish/(fin) kala|fisu/(rus) рыба” ;

ADD NOUNS BELOW


This (part of) documentation was generated from src/fst/morphology/stems/nouns_newwords.lexc


src-fst-morphology-stems-propernouns_newwords.lexc.md

This is where new words are added as lexc entries before they are added to the xml source files. Абъячой+N+Prop+Sem/Plc:Абъячой PROP_ “(eng) fish/(fin) /(rus)” ;

ADD NOUNS BELOW


This (part of) documentation was generated from src/fst/morphology/stems/propernouns_newwords.lexc


src-fst-morphology-stems-verbs_newwords.lexc.md

This is where new words are added as lexc entries before they are added to the xml source files. воны+V:во V_ “(eng) /(fin)/(rus) “ ;

test:test V_ “(eng) /(fin) /(rus) “ ; ADD VERBS BELOW


This (part of) documentation was generated from src/fst/morphology/stems/verbs_newwords.lexc


src-fst-phonetics-txt2ipa.xfscript.md

retroflex plosive, voiceless t ʈ 0288, 648 ( = ASCII 096) retroflex plosive, voiced d ɖ 0256, 598 labiodental nasal F ɱ 0271, 625 retroflex nasal n ɳ 0273, 627 palatal nasal J ɲ 0272, 626 velar nasal N ŋ 014B, 331 uvular nasal N\ ɴ 0274, 628

bilabial trill B\ ʙ 0299, 665 uvular trill R\ ʀ 0280, 640 alveolar tap 4 ɾ 027E, 638 retroflex flap r ɽ 027D, 637 bilabial fricative, voiceless p\ ɸ 0278, 632 bilabial fricative, voiced B β 03B2, 946 dental fricative, voiceless T θ 03B8, 952 dental fricative, voiced D ð 00F0, 240 postalveolar fricative, voiceless S ʃ 0283, 643 postalveolar fricative, voiced Z ʒ 0292, 658 retroflex fricative, voiceless s ʂ 0282, 642 retroflex fricative, voiced z` ʐ 0290, 656 palatal fricative, voiceless C ç 00E7, 231 palatal fricative, voiced j\ ʝ 029D, 669 velar fricative, voiced G ɣ 0263, 611 uvular fricative, voiceless X χ 03C7, 967 uvular fricative, voiced R ʁ 0281, 641 pharyngeal fricative, voiceless X\ ħ 0127, 295 pharyngeal fricative, voiced ?\ ʕ 0295, 661 glottal fricative, voiced h\ ɦ 0266, 614

alveolar lateral fricative, vl. K alveolar lateral fricative, vd. K\

labiodental approximant P (or v) alveolar approximant r\ retroflex approximant r` velar approximant M\

retroflex lateral approximant l` palatal lateral approximant L velar lateral approximant L
Clicks

bilabial O\ (O = capital letter) dental |
(post)alveolar !\ palatoalveolar =\ alveolar lateral ||
Ejectives, implosives

ejective > e.g. ejective p p> implosive < e.g. implosive b b< Vowels

close back unrounded M close central unrounded 1 close central rounded } lax i I lax y Y lax u U

close-mid front rounded 2 close-mid central unrounded @\ close-mid central rounded 8 close-mid back unrounded 7

schwa ə @

open-mid front unrounded E open-mid front rounded 9 open-mid central unrounded 3 open-mid central rounded 3\ open-mid back unrounded V open-mid back rounded O

ash (ae digraph) { open schwa (turned a) 6

open front rounded & open back unrounded A open back rounded Q Other symbols

voiceless labial-velar fricative W voiced labial-palatal approx. H voiceless epiglottal fricative H\ voiced epiglottal fricative <\ epiglottal plosive >\

alveolo-palatal fricative, vl. s\ alveolo-palatal fricative, voiced z\ alveolar lateral flap l\ simultaneous S and x x\ tie bar _ Suprasegmentals

primary stress “ secondary stress % long : half-long :\ extra-short _X linking mark -
Tones and word accents

level extra high _T level high _H level mid _M level low _L level extra low _B downstep ! upstep ^ (caret, circumflex)

contour, rising contour, falling _F contour, high rising _H_T contour, low rising _B_L

contour, rising-falling _R_F (NB Instead of being written as diacritics with _, all prosodic marks can alternatively be placed in a separate tier, set off by < >, as recommended for the next two symbols.) global rise global fall Diacritics

voiceless 0 (0 = figure), e.g. n_0 voiced _v aspirated _h more rounded _O (O = letter) less rounded _c advanced _+ retracted _- centralized _” syllabic = (or _=) e.g. n= (or n=) non-syllabic _^ rhoticity `

breathy voiced _t creaky voiced _k linguolabial _N labialized _w palatalized ‘ (or _j) e.g. t’ (or t_j) velarized _G pharyngealized _?\

dental d apical _a laminal _m nasalized ~ (or _~) e.g. A~ (or A~) nasal release _n lateral release _l no audible release _}

velarized or pharyngealized _e velarized l, alternatively 5 raised _r lowered _o advanced tongue root _A retracted tongue root _q


This (part of) documentation was generated from src/fst/phonetics/txt2ipa.xfscript


src-fst-phonology-old.xfscript.md

Definition section ! ================== !

Defining Vowel

Defining Palatal Vowel

Defining Consonants

Defining non-soft consonants

Defining consonants before Cyrillic і

Defining letters

Defining flags

Defining boundaries

Defining diacritics

Defining dummy

Rule section ! ============ !

stopping ы -> 0 2011-01-26 LET’s remember that this should only affect verb forms That means the surface vowels я а и і ӧ Wrong results тӧд where тыӧд should be Wrong на should be ныа Absence of “ы” vowel “ы” vowel is present before


This (part of) documentation was generated from src/fst/phonology-old.xfscript


src-fst-transcriptions-transcriptor-abbrevs2text.lexc.md

We describe here how abbreviations are in Komi-Zyrian are read out, e.g. for text-to-speech systems.

For example:


This (part of) documentation was generated from src/fst/transcriptions/transcriptor-abbrevs2text.lexc


tools-grammarcheckers-grammarchecker.cg3.md

K O M I G R A M M A R C H E C K E R

DELIMITERS

TAGS AND SETS

Beginning and end of sentence

BOS EOS

Miscellanous

CmpTest Err

Parts of speech tags

N V A Adv CC CS Inter Pron Num Pcle Clt Po Dem Qnt Prop

Derivation tags

Ex/A (former adj) Ex/N Ex/Num Ex/V Ex/WORD DerTag

Verbal categories

Prs Fut Fut1 Imprt Prt1 Prt2 Prf PrfIpf HstPrf PluPrf HstPluPrf Ind Imp Cond Opt

Sg1 Sg2 …

Nominal categories Sg Pl Nom Gen Abl Dat Com Cns …

PPUNCT PUNCT ¶

Verb sets

VNEG (all Neg verbs)

VFIN

ASKI (tomorrow set)

Grammarchecker sets


This (part of) documentation was generated from tools/grammarcheckers/grammarchecker.cg3


tools-tokenisers-tokeniser-disamb-gt-desc.pmscript.md

Tokeniser for kpv

Usage:

$ make
$ echo "ja, ja" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa boasttu olmmoš, man mielde lahtuid." | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "márffibiillagáffe" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst

Pmatch documentation: https://github.com/hfst/hfst/wiki/HfstPmatch

Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words:

Whitespace contains ASCII white space and the List contains some unicode white space characters

Apart from what’s in our morphology, there are

  1. unknown word-like forms, and
  2. unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a Unknowns are made of:
    • lower-case ASCII
    • upper-case ASCII
    • select extended latin symbols
    • extended cyrillic ASCII digits
    • select symbols
    • Combining diacritics as individual symbols,
    • various symbols from Private area (probably Microsoft), so far:
    • U+F0B7 for “x in box”

Unknown handling

Unknowns are tagged ?? and treated specially with hfst-tokenise hfst-tokenise –giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it’s safer to let hfst-tokenise handle them.

Finally we mark as a token any sequence making up a:


This (part of) documentation was generated from tools/tokenisers/tokeniser-disamb-gt-desc.pmscript


tools-tokenisers-tokeniser-gramcheck-gt-desc.pmscript.md

Grammar checker tokenisation for kpv

Requires a recent version of HFST (3.10.0 / git revision>=3aecdbc) Then just:

$ make
$ echo "ja, ja" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst

More usage examples:

$ echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa boasttu olmmoš, man mielde lahtuid." | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "márffibiillagáffe" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst

Pmatch documentation: https://github.com/hfst/hfst/wiki/HfstPmatch

Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words:

Whitespace contains ASCII white space and the List contains some unicode white space characters

Apart from what’s in our morphology, there are 1) unknown word-like forms, and 2) unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a

TODO: Could use something like this, but built-in’s don’t include šžđčŋ:

Simply give an empty reading when something is unknown: hfst-tokenise –giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it’s safer to let hfst-tokenise handle them.

Finally we mark as a token any sequence making up a:


This (part of) documentation was generated from tools/tokenisers/tokeniser-gramcheck-gt-desc.pmscript


tools-tokenisers-tokeniser-tts-cggt-desc.pmscript.md

TTS tokenisation for smj

Requires a recent version of HFST (3.10.0 / git revision>=3aecdbc) Then just:

make
echo "ja, ja" \
| hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst

More usage examples:

echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa \
boasttu olmmoš, man mielde lahtuid." \
| hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" \
| hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
echo "márffibiillagáffe" \
| hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst

Pmatch documentation: https://kitwiki.csc.fi/twiki/bin/view/KitWiki/HfstPmatch

Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words:

Whitespace contains ASCII white space and the List contains some unicode white space characters

Apart from what’s in our morphology, there are 1) unknown word-like forms, and 2) unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a

TODO: Could use something like this, but built-in’s don’t include šžđčŋ:

Simply give an empty reading when something is unknown: hfst-tokenise –giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it’s safer to let hfst-tokenise handle them.

Needs hfst-tokenise to output things differently depending on the tag they get


This (part of) documentation was generated from tools/tokenisers/tokeniser-tts-cggt-desc.pmscript