Komi-Zyrian NLP Grammar

Finite state and Constraint Grammar based analysers, proofing tools and other resources

View the project on GitHub giellalt/lang-kpv

Page Content

  • src-fst-morphology-stems-nouns_newwords.lexc.md
  • src-fst-morphology-stems-propernouns_newwords.lexc.md
  • src-fst-morphology-stems-verbs_newwords.lexc.md
  • src-fst-phonetics-txt2ipa.xfscript.md
  • src-fst-phonology-old.xfscript.md
  • src-fst-transcriptions-transcriptor-abbrevs2text.lexc.md
  • tools-grammarcheckers-grammarchecker.cg3.md
  • DELIMITERS
  • TAGS AND SETS
  • tools-tokenisers-tokeniser-disamb-gt-desc.pmscript.md
  • Tokeniser for kpv
  • tools-tokenisers-tokeniser-gramcheck-gt-desc.pmscript.md
  • Grammar checker tokenisation for kpv
  • tools-tokenisers-tokeniser-tts-cggt-desc.pmscript.md
  • TTS tokenisation for smj
  • Komi-Zyrian language model documentation

    All doc-comment documentation in one large file.


    src-cg3-disambiguator.cg3.md

    Komi disambiguator

    Delimiters

    Sentence delimiters are: “<.>” “<!>” “<?>” “<…>” “<¶>” sent

    Tags and sets

    Beginning and end of sentence

    BOS EOS

    Miscellanous

    CmpTest Err вӧлі Sg3

    Parts of speech tags

    N V A Adv CC CS Inter Pron Num Pcle Clt Po Dem Deg Qnt Prop

    Derivation tags

    Ex/A (former adj) Ex/N Ex/Num Ex/V Ex/WORD VCar DerTag AspDerTag

    Verbal categories

    Prs Fut Fut1 Imprt Prt1 Prt2 Prf PrfIpf HstPrf PluPrf HstPluPrf Ind Imp Cond Opt

    Sg1 Sg2 …

    Nominal categories Sg Pl Nom Gen Abl Dat Com Cns …

    Verb sets

    VNEG (all Neg verbs)

    VFIN

    ASKI (tomorrow set)

    NOT-PRL (have no homograph Prolative pairs set)


    This (part of) documentation was generated from src/cg3/disambiguator.cg3


    src-cg3-functions-ikpd.cg3.md


    This (part of) documentation was generated from src/cg3/functions-ikpd.cg3


    src-cg3-functions.cg3.md

    These sets model noun phrases (NPs). The idea is to first define whatever can occur in front of the head of the NP, and thereafter negate that with the expression WORD - premodifiers.

    The set NOT-NPMOD is used to find barriers between NPs. Typical usage: … (*1 N BARRIER NPT-NPMOD) … meaning: Scan to the first noun, ignoring anything that can be part of the noun phrase of that noun (i.e., “scan to the next NP head”)

    These were the set types.

    HABITIVE MAPPING

    sma object

    SUBJ MAPPING - leftovers

    OBJ MAPPING - leftovers

    HNOUN MAPPING

    therestX adds @X to all what is left, often errouneus disambiguated forms


    This (part of) documentation was generated from src/cg3/functions.cg3


    src-fst-morphology-affixes-adjectives.lexc.md

    Adjective inflection


    Komi (Zyrian) adjectives compare.

    Continuation lexicon has been assigned according to content


    This (part of) documentation was generated from src/fst/morphology/affixes/adjectives.lexc


    src-fst-morphology-affixes-adpositions.lexc.md

    Postposition inflection


    Komi postpositions inflect for direction.

    Prep lexica

    Postp lexica

    This contlex allows for relational word which, otherwise, are open to extensive declension

    аддза, бӧрті, бокиті, боксянь, дырйи, йитӧдын, кузя, ног, ньылыд, паныдӧн, пӧлӧн, пыдди, пыр, понда, ради, уліті, выліті, вывті, вомас, вомӧн пӧвст

    аддза, бӧрті, бокиті, боксянь, дырйи, йитӧдын, кузя, ног, ньылыд, паныдӧн, пӧлӧн, пыдди, пыр, понда, ради, уліті, выліті, вывті, вомас, вомӧн пӧвст


    This (part of) documentation was generated from src/fst/morphology/affixes/adpositions.lexc


    src-fst-morphology-affixes-adverbs.lexc.md

    Adverb inflection


    Komi adverbs inflect for direction.

    LEXICON ADV-DEG_ depricate ADV-ADA_ and Ad-ATAG

    LEXICON ADV-MANNER_

    LEXICON ADV-NEG_

    LEXICON GER_


    This (part of) documentation was generated from src/fst/morphology/affixes/adverbs.lexc


    src-fst-morphology-affixes-conjunctors.lexc.md

    Conjunctors


    Komi conjunctors

    LEXICON CC_

    LEXICON CS_

    LEXICON CS_DIAL

    LEXICON CONJ_


    This (part of) documentation was generated from src/fst/morphology/affixes/conjunctors.lexc


    src-fst-morphology-affixes-interjections.lexc.md

    Interjections


    Komi Interjections

    LEXICON INTERJ_

    LEXICON INTERJ-CONATIVE_

    LEXICON INTERJ-FORMULAIC_


    This (part of) documentation was generated from src/fst/morphology/affixes/interjections.lexc


    src-fst-morphology-affixes-nouns.lexc.md

    Noun morphological lexica

    Basic nouns.

    The lexicon for basic nouns is ` N_ `

    This should be phased out 2013-05-07

    subsequent Cns vs Vow

    Inflectional lexica

    All nouns follow one contlex “N_” to begin with here is simply a list of all variant with no more variants beyond:

    SG1

    SG2

    SG3

    PL1

    PL2

    PL3

    SG1 SG2 SG3 PL1 PL2 PL3

    SG1 SG2 SG3 PL1 PL2 PL3

    SG1

    SG2

    SG3

    PL1

    PL2

    PL3

    Case

    +Der/а+Adv:%>а K ;


    This (part of) documentation was generated from src/fst/morphology/affixes/nouns.lexc


    src-fst-morphology-affixes-particles.lexc.md

    Particles


    Komi Particles

    LEXICON PCLE_

    LEXICON PCLE_NEG

    LEXICON PcleIntens

    LEXICON ONOM_

    LEXICON DESCR_


    This (part of) documentation was generated from src/fst/morphology/affixes/particles.lexc


    src-fst-morphology-affixes-pronouns.lexc.md

    Pronominal morphology

    Closed class personal pronouns

    LEXICON PRONOUN-TYPES

    ми мийӧ The 1st and 2nd persons have Oblique case stem strategies that differ from the 3rd person: ті тійӧ nämä ovat aivan eri asioita сы сійӧ tosin joskus

    Tagged in the src/morphology/stems/pronouns.xml file

    Word-final cases


    This (part of) documentation was generated from src/fst/morphology/affixes/pronouns.lexc


    src-fst-morphology-affixes-propernouns.lexc.md

    Proper noun inflection

    Komi proper nouns inflect in the same cases as regular nouns.

    Temporary lexica

    LEXICON ACRON-F

    LEXICON ACRON

    LEXICON PROP-RUS_ LEXICON PROP_

    Russian type Surnames

    Preparing for the template urj-Cyrl Beginning 2012-11-15 LEXICON CYRL-CONS_SUR

    LEXICON CYRL-SIBILANT_SUR

    LEXICON CYRL-VOW_SUR

    LEXICON CYRL-A_SUR

    LEXICON CYRL-K_SUR

    LEXICON CYRL-L_SUR

    LEXICON CYRL-T_SUR

    LEXICON Deriv-RUS-AN_SURMAL

    Абдеев:Абдеев LEXICON Deriv-RUS-V_SURMAL

    Багрий:Багр LEXICON Deriv-RUS-IJ_SURMAL

    LEXICON Deriv-RUS-IN_SURMAL

    Аморский:Аморск LEXICON Deriv-RUS-KIJ_SURMAL

    LEXICON Deriv-RUS-OJ_SURMAL

    LEXICON Deriv-RUS-YJ_SURMAL

    PLACE NAMES FROM TEMPLATES

    These are vowel-final stems They have previously received +Sem/Fem tags

    Male given name for deriving patronyms

    Should this be limited to +Sg? 2015-09-06

    Вили:Вил

    Андрей:Андре

    Ending 2012-11-15

    FEMALE NAMES FROM TEMPLATE

    PLACE NAMES FROM TEMPLATES


    This (part of) documentation was generated from src/fst/morphology/affixes/propernouns.lexc


    src-fst-morphology-affixes-quantifiers.lexc.md

    Numeral morphological lexica

    This has to be worked on 2012-01-19 LEXICON NUM-CARD_

    LEXICON CARD

    LEXICON ORD

    LEXICON DET_

    LEXICON DET_END

    LEXICON NUM-IS_DISTR

    LEXICON QNT_

    LEXICON NUM-APPR ! 2011-11-03 This will need work

    LEXICON CARD-APPR

    Inflectional lexica

    All nouns follow one contlex “Noun1” to begin with here is simply a list of all variant with no more variants beyond:

    LEXICON NumCASEPOSSLEX

    LEXICON NumMWN

    Arabic numerals


    This (part of) documentation was generated from src/fst/morphology/affixes/quantifiers.lexc


    src-fst-morphology-affixes-symbols.lexc.md

    Symbol affixes

    Noun_symbols_possibly_inflected

    Noun_symbols_never_inflected

    SYMBOL_connector

    SYMBOL_NO_suff

    SYMBOL_suff (can abbreviations have suffixes? Probably, yes)


    This (part of) documentation was generated from src/fst/morphology/affixes/symbols.lexc


    src-fst-morphology-affixes-verbs.lexc.md

    Verbal morphology

    Temporary lexicon

    V_ temporary lexicon gives +V+WORK

    Closed class verbs

    VERBNEGATIVE

    Open class verbs

    Some Flag diacritic lines are with regexes, other with aligned zeros. We want to migrate to regexes < … > , for readability reasons (sic!)

    IV_ЛОКНЫ

    IV_ШУНЫ

    IV_АМНЫ TV_АМНЫ

    BV_АМНЫ

    Verb conjugation

    Derivation

    This is fed by LEXICON V_ШУНЫ, and therefore certain corrections must be made 2012-01-18

    овсьыны пусьыштлывлыны босьтчыштлывлыны

    verb-to-noun

    вевттьысьыны

    бертласьны


    This (part of) documentation was generated from src/fst/morphology/affixes/verbs.lexc


    src-fst-morphology-phonology.twolc.md

    Komi Zyrian twol file

    This file documents the phonology.twolc file

    cf. kpv-phon-old.xfscript cf. Rueter 2000 Хельсинкиса университетын кыв туялысь Ижкарын Перымса кывъяс симпозиум вылын лыддьӧмтор

    Alphabet, Sets and Definitions

    Letters of the alphabet

    Triggers

    Boundary symbols

    Diacritics

    Sets

    Vowel

    Palatal Vowel Cns-initial vowels

    All non-vowels, consonants and hard and soft signs

    All non-vowels with exception of soft sign

    All but z consonants that can be followed by either і or и

    Letters

    Dummy

    Definitions

    No definitions

    Rules

    Rules connected to L/V alternations

    Rule: The famous L/V changes л to в betweeen vowel and the ^Close symbol

    Rule: The famous L/V goes Izhva where л goes to its preceeding vowel (except a) before ^C2V.

    Rule: Vowel lengthening а:о я:ё for the ^C2V context

    Rule: The ӧ/V as in унаан

    Rules for paragogic consonants

    Rule: Paragogic consonant deletion

    Rule: Paragogic т deletion and tripple т between Cns and ^Close

    Other consonant deletion rules

    Rule: Paragogic т deletion and tripple т

    Rule: Paragogic т deletion and tripple т

    Rule: jDeletion after vowel

    Rule: j to hard sign after consonant

    Rule: l deletion ALSO tripple letter

    Rule: d deletion

    Vowel Palatalisation rules

    Rule: а 2 я, о 2 ё, у 2 ю

    Rule: %{иі%} 2 і

    Rule: %{иі%} 2 и

    Rules for soft and hard sign

    Rule: Soft Sign Deletion

    Rule: Hard Sign Deletion

    Rule: Hard Sign Palatalization

    Other rules

    To do: Look at a more logical ordering

    Rule: No triple letters deletes the middle consonant in Cx Cx > Cx sequences

    Rule: IClitic

    клуб+N+Sg+Err/Dial+Ill club/kerho

    Rule: Disallow l to vowel after other than l


    This (part of) documentation was generated from src/fst/morphology/phonology.twolc


    src-fst-morphology-root.lexc.md

    Multichar_Symbols and Root lexicon for Komi

    Check these:

    Analysis symbols

    The morphological analyses of wordforms for the Komi-Zyrian language are presented in this system in terms of the following symbols. (It is highly suggeste d to follow existing standards when adding new tags).

    The parts-of-speech tags

    Subtags

    Adverb subtags
    Interjections

    +Formulaic = expressions such as аттьӧ, ало, … +Conative Used for calling animals, for example брысь, баль-баль, …

    Nouns
    Pronouns
    Nominals are inflected for Number and Case
    Number
    Case

    A category of case in Komi can be identified as:

    Possessive suff
    The comparative forms are:
    Numeral tags:
    Quantifiers (numerals)
    Verb tags
    Other tags
    Question and Focus particles:

    Tags distinguishing different versions of the same lemma (before POS)

    Usage tags:
    Dialect features
    Check these Where do these come from source

    Semantic tags to help disambiguation & synt. analysis: (before POS) Borrowed from main/langs/sme/src/morphology/root.lexc

    Semantic tags

    Multiple Semantic tags:

    Derivation

    Derivations are classified under the morphophonetic form of the suffix, the source and target part-of-speech.

    Dertags

    Declaring adjectival derivations Noun phrase modifiers are generally considered derivational

    More dertags (TODO: sort/group)
    Declaring Deverbal derivations of verbs
    Tags for Ethymological Origin marking. This has initially used used with proper nouns

    Morphophonology

    To represent phonologic variations in word forms we use the following symbols in the lexicon files:

    Archiphonemes
    Triggers to control variation
    Valency tags, i.e. tags assigned to verbs for denoting their arbuments

    Symbols that need to be escaped on the lower side (towards twolc):

    Flag diacritics

    We have manually optimised the structure of our lexicon using following flag diacritics to restrict morhpological combinatorics - only allow compounds with verbs if the verb is further derived into a noun again:

    Flags Explanation
    @P.NeedNoun.ON@ (Dis)allow compounds with verbs unless nominalised
    @D.NeedNoun.ON@ (Dis)allow compounds with verbs unless nominalised
    @C.NeedNoun@ (Dis)allow compounds with verbs unless nominalised

    Two flags copied from sme

    Flags Explanation
    @P.Pmatch.Loc@ Used on multi-token analyses; tell hfst-tokenise/pmatch where in the form/analysis the token should be split.
    @P.Pmatch.Backtrack@ Used on single-token analyses; tell hfst-tokenise/pmatch to backtrack by reanalysing the substrings before and after this point in the form (to find combinations of shorter analyses that would otherwise be missed)

    Compunding

    Tags
    Flags

    For languages that allow compounding, the following flag diacritics are needed to control position-based compounding restrictions for nominals. Their use is

    handled automatically if combined with +CmpN/xxx tags. If not used, they will do no harm.

    Flags Explanation
    @P.CmpFrst.FALSE@ Require that words tagged as such only appear first
    @D.CmpPref.TRUE@ Block such words from entering ENDLEX
    @P.CmpPref.FALSE@ Block these words from making further compounds
    @D.CmpLast.TRUE@ Block such words from entering R
    @D.CmpNone.TRUE@ Combines with the next tag to prohibit compounding
    @U.CmpNone.FALSE@ Combines with the prev tag to prohibit compounding
    @P.CmpOnly.TRUE@ Sets a flag to indicate that the word has passed R
    @D.CmpOnly.FALSE@ Disallow words coming directly from root.

    Use the following flag diacritics to control downcasing of derived proper nouns (e.g. Finnish Pariisi -> pariisilainen). See e.g. North Sámi for how to use these flags. There exists a ready-made regex that will do the actual down-casing given the proper use of these flags.

    Flags Explanation
    @U.Cap.Obl@ Always capital letter for names: Deatnu.
    @U.Cap.Opt@ Allowing downcasing of derived names: deatnulasj.
    Flags Explanation
    @U.CONJ-VAL.TV@ Flags used with serial verbs: VAL = Valence
    @U.CONJ-VAL.IV@ Flags used with serial verbs: VAL = Valence
    @U.CONJ-INF.YES@ INF = Infinitive
    @U.CONJ-INF.NO@ INF = Infinitive
    @U.CONJ-TX.FUT@ TX = tense
    @U.CONJ-TX.PRES@ TX = tense
    @U.CONJ-TX.PRET1@ TX = tense
    @U.CONJ-TX.PRET2@ TX = tense
    @U.CONJ-GER.IG@ GER = gerund
    @U.CONJ-GER.VCAR@ GER = VCar тӧг
    @U.CONJ-GER.VCARMoz@ GER = VCar тӧгмоз
    @U.CONJ-GER.VMON@ GER = VMon мӧн
    @U.CONJ-GER.VTER@ GER = VTer тӧдз
    @U.CONJ-MX.IND@ MX = mood
    @U.CONJ-MX.IMP@ MX = mood
    @U.CONJ-CONNEG.YES@ CONNEG = negation
    @U.CONJ-CONNEG.NO@ CONNEG = negation
    @U.CONJ-NX.PL@ NX = number
    @U.CONJ-NX.SG@ NX = number
    @U.CONJ-POSS.1@ POSS = possessive, person 1
    @U.CONJ-POSS.2@ POSS = possessive 2
    @U.CONJ-POSS.3@ POSS = possessive 3
    @U.CONJ-POSS.2ACC@ POSS = possessive etc.
    @U.CONJ-POSS.3ACC@ POSS = possessive
    @U.CONJ-PX.1@ PX = person
    @U.CONJ-PX.2@ PX = person
    @U.CONJ-PX.3@ PX = person
    @C.CONJ-VAL@ Removal
    @C.CONJ-INF@ Removal
    @C.CONJ-TX@ Removal
    @C.CONJ-MX@ Removal
    @C.CONJ-GER@ Removal
    @C.CONJ-CONNEG@ Removal
    @C.CONJ-NX@ Removal
    @C.CONJ-PX@ Removal
    @C.CONJ-POSS@ Removal
    @P.PossPx.Sg1@ FLAGS USED WITH COLLECTIVE NOUNS
    @P.PossPx.Sg2@ FLAGS USED WITH COLLECTIVE NOUNS
    @P.PossPx.Sg3@ FLAGS USED WITH COLLECTIVE NOUNS
    @P.PossPx.Pl1@ FLAGS USED WITH COLLECTIVE NOUNS
    @P.PossPx.Pl2@ FLAGS USED WITH COLLECTIVE NOUNS
    @P.PossPx.Pl3@ FLAGS USED WITH COLLECTIVE NOUNS
    @U.PossPx.Sg1@ FLAGS USED WITH COLLECTIVE NOUNS
    @U.PossPx.Sg2@ FLAGS USED WITH COLLECTIVE NOUNS
    @U.PossPx.Sg3@ FLAGS USED WITH COLLECTIVE NOUNS
    @U.PossPx.Pl1@ FLAGS USED WITH COLLECTIVE NOUNS
    @U.PossPx.Pl2@ FLAGS USED WITH COLLECTIVE NOUNS
    @U.PossPx.Pl3@ FLAGS USED WITH COLLECTIVE NOUNS
    @D.PossPx@ FLAGS USED WITH COLLECTIVE NOUNS
    @C.PossPx@ FLAGS USED WITH COLLECTIVE NOUNS
    @U.DECL-NX.SG@ number
    @U.DECL-NX.PL@ number
    @R.DECL-NX.PL@ number
    @U.DECL-CX.ABE@ unify case
    @U.DECL-CX.ABL@ unify case
    @U.DECL-CX.ACC@ unify case
    @U.DECL-CX.APR@ unify case
    @U.DECL-CX.APRINE@ unify case
    @U.DECL-CX.APRILL@ unify case
    @U.DECL-CX.APRELA@ unify case
    @U.DECL-CX.APREGR@ unify case
    @U.DECL-CX.APRPRL@ unify case
    @U.DECL-CX.APRTRA@ unify case
    @U.DECL-CX.APRTER@ unify case
    @U.DECL-CX.CAR@ unify case
    @U.DECL-CX.CMP@ unify case
    @U.DECL-CX.CNS@ unify case
    @U.DECL-CX.COM@ unify case
    @U.DECL-CX.DAT@ unify case
    @U.DECL-CX.EGR@ unify case
    @U.DECL-CX.ELA@ unify case
    @U.DECL-CX.GEN@ unify case
    @U.DECL-CX.ILL@ unify case
    @U.DECL-CX.INE@ unify case
    @U.DECL-CX.INS@ unify case
    @U.DECL-CX.NOM@ unify case
    @U.DECL-CX.PRL@ unify case
    @U.DECL-CX.TRA@ unify case
    @U.DECL-CX.TER@ unify case
    @U.DECL-DX.INDEF@ declension type
    @U.DECL-DX.PX@ declension type
    @C.DECL-NX@ Removal
    @C.DECL-DX@ Removal
    @C.DECL-CX@ Removal
    @U.Cap.Obl@ Allowing downcasing of derived names: deatnulasj
    @U.Cap.Opt@ Allowing downcasing of derived names: deatnulasj

    Lexicon Root

    The word forms in Komi (Zyrian) language start from the lexeme roots of basic word classes, or optionally from prefixes:

    Lexica without morphology !

    Absolute forms ABS_ пу керка выль керка

    Compounding

    R

    Serial-Verbs

    Lexica called End, whatever they are

    ABBR-IS_ADV

    ABBR-IS_N

    Clitics

    K

    WordEnd

    WordEnd-2

    SPAT-COMPARATIVE

    COMPARATIVE

    SUBSTANDARDS

    Endlex

    Lexicon ENDLEX And this is the ENDLEX of everything:

    @D.CmpOnly.FALSE@@D.CmpPref.TRUE@@D.NeedNoun.ON@ # ; The @D.CmpOnly.FALSE@ flag diacritic is ued to disallow words tagged with +CmpNP/Only to end here. The @D.NeedNoun.ON@ flag diacritic is used to block illegal compounds.


    This (part of) documentation was generated from src/fst/morphology/root.lexc


    src-fst-morphology-stems-acronyms.lexc.md

    Acronym inflection


    This (part of) documentation was generated from src/fst/morphology/stems/acronyms.lexc


    src-fst-morphology-stems-adjectives-russian-like_newwords.lexc.md

    This is where new words are added as lexc entries before they are added to the xml source files. важ:важ A_ “(eng) /(fin)/(rus) “ ;

    ADD ADJECTIVES BELOW


    This (part of) documentation was generated from src/fst/morphology/stems/adjectives-russian-like_newwords.lexc


    src-fst-morphology-stems-adjectives.lexc.md

    colors from Syktyvkar


    This (part of) documentation was generated from src/fst/morphology/stems/adjectives.lexc


    src-fst-morphology-stems-adjectives_newwords.lexc.md

    This is where new words are added as lexc entries before they are added to the xml source files. важ+A:важ A_ “(eng) /(fin)/(rus) “ ;

    ADD ADJECTIVES BELOW


    This (part of) documentation was generated from src/fst/morphology/stems/adjectives_newwords.lexc


    src-fst-morphology-stems-adverbs_newwords.lexc.md

    This is where new words are added as lexc entries before they are added to the xml source files. важын:важын ADV_ “(eng) /(fin)/(rus) “ ;

    ADD ADVERBS BELOW


    This (part of) documentation was generated from src/fst/morphology/stems/adverbs_newwords.lexc


    src-fst-morphology-stems-dialect_lexicon.lexc.md

    Hypothetical dialect forms with е/э 2021-03-15


    This (part of) documentation was generated from src/fst/morphology/stems/dialect_lexicon.lexc


    src-fst-morphology-stems-exceptions.lexc.md

    Exceptions are quite strange word-forms. the ones that do not fit anywhere else. This file contains all enumerated word forms that cannot reasonably be created from lexical data by regular inflection. Usually there should be next to none exceptions, it’s always better to have a paradigm that covers only one or few words than an exception since these will not work nicely with e.g. compounding scheme or possibly many end applications.

    The pair verb овны-вывны conjugates in more forms than are attested for the single verb вывны:

    VERBS WITH FIRST PRETERITE THIRD PERSON WITHOUT с IN NORM

    SPECIAL VERB FORM FOR VERBAL TERMINATIVE OF ЛОКНЫ

    REDUPLICATED ADVERBS

    SUPERLATIVE ADVERBS

    SUPERLATIVE ADJECTIVES

    ADJECTIVES NOT YET ADDED TO DICTIONARY DATABANK

    VOCATIVE EXPRESSIONS

    PROPER NOUNS NOT YET ADDED TO DICTIONARY DATABANK


    This (part of) documentation was generated from src/fst/morphology/stems/exceptions.lexc


    src-fst-morphology-stems-nouns_newwords.lexc.md

    This is where new words are added as lexc entries before they are added to the xml source files. чери+N:чери N_ “(eng) fish/(fin) kala|fisu/(rus) рыба” ;

    ADD NOUNS BELOW


    This (part of) documentation was generated from src/fst/morphology/stems/nouns_newwords.lexc


    src-fst-morphology-stems-propernouns_newwords.lexc.md

    This is where new words are added as lexc entries before they are added to the xml source files. Абъячой+N+Prop+Sem/Plc:Абъячой PROP_ “(eng) fish/(fin) /(rus)” ;

    ADD NOUNS BELOW


    This (part of) documentation was generated from src/fst/morphology/stems/propernouns_newwords.lexc


    src-fst-morphology-stems-verbs_newwords.lexc.md

    This is where new words are added as lexc entries before they are added to the xml source files. воны+V:во V_ “(eng) /(fin)/(rus) “ ;

    test:test V_ “(eng) /(fin) /(rus) “ ; ADD VERBS BELOW


    This (part of) documentation was generated from src/fst/morphology/stems/verbs_newwords.lexc


    src-fst-phonetics-txt2ipa.xfscript.md

    retroflex plosive, voiceless t ʈ 0288, 648 ( = ASCII 096) retroflex plosive, voiced d ɖ 0256, 598 labiodental nasal F ɱ 0271, 625 retroflex nasal n ɳ 0273, 627 palatal nasal J ɲ 0272, 626 velar nasal N ŋ 014B, 331 uvular nasal N\ ɴ 0274, 628

    bilabial trill B\ ʙ 0299, 665 uvular trill R\ ʀ 0280, 640 alveolar tap 4 ɾ 027E, 638 retroflex flap r ɽ 027D, 637 bilabial fricative, voiceless p\ ɸ 0278, 632 bilabial fricative, voiced B β 03B2, 946 dental fricative, voiceless T θ 03B8, 952 dental fricative, voiced D ð 00F0, 240 postalveolar fricative, voiceless S ʃ 0283, 643 postalveolar fricative, voiced Z ʒ 0292, 658 retroflex fricative, voiceless s ʂ 0282, 642 retroflex fricative, voiced z` ʐ 0290, 656 palatal fricative, voiceless C ç 00E7, 231 palatal fricative, voiced j\ ʝ 029D, 669 velar fricative, voiced G ɣ 0263, 611 uvular fricative, voiceless X χ 03C7, 967 uvular fricative, voiced R ʁ 0281, 641 pharyngeal fricative, voiceless X\ ħ 0127, 295 pharyngeal fricative, voiced ?\ ʕ 0295, 661 glottal fricative, voiced h\ ɦ 0266, 614

    alveolar lateral fricative, vl. K alveolar lateral fricative, vd. K\

    labiodental approximant P (or v) alveolar approximant r\ retroflex approximant r` velar approximant M\

    retroflex lateral approximant l` palatal lateral approximant L velar lateral approximant L
    Clicks

    bilabial O\ (O = capital letter) dental |
    (post)alveolar !\ palatoalveolar =\ alveolar lateral ||
    Ejectives, implosives

    ejective > e.g. ejective p p> implosive < e.g. implosive b b< Vowels

    close back unrounded M close central unrounded 1 close central rounded } lax i I lax y Y lax u U

    close-mid front rounded 2 close-mid central unrounded @\ close-mid central rounded 8 close-mid back unrounded 7

    schwa ə @

    open-mid front unrounded E open-mid front rounded 9 open-mid central unrounded 3 open-mid central rounded 3\ open-mid back unrounded V open-mid back rounded O

    ash (ae digraph) { open schwa (turned a) 6

    open front rounded & open back unrounded A open back rounded Q Other symbols

    voiceless labial-velar fricative W voiced labial-palatal approx. H voiceless epiglottal fricative H\ voiced epiglottal fricative <\ epiglottal plosive >\

    alveolo-palatal fricative, vl. s\ alveolo-palatal fricative, voiced z\ alveolar lateral flap l\ simultaneous S and x x\ tie bar _ Suprasegmentals

    primary stress “ secondary stress % long : half-long :\ extra-short _X linking mark -
    Tones and word accents

    level extra high _T level high _H level mid _M level low _L level extra low _B downstep ! upstep ^ (caret, circumflex)

    contour, rising contour, falling _F contour, high rising _H_T contour, low rising _B_L

    contour, rising-falling _R_F (NB Instead of being written as diacritics with _, all prosodic marks can alternatively be placed in a separate tier, set off by < >, as recommended for the next two symbols.) global rise global fall Diacritics

    voiceless 0 (0 = figure), e.g. n_0 voiced _v aspirated _h more rounded _O (O = letter) less rounded _c advanced _+ retracted _- centralized _” syllabic = (or _=) e.g. n= (or n=) non-syllabic _^ rhoticity `

    breathy voiced _t creaky voiced _k linguolabial _N labialized _w palatalized ‘ (or _j) e.g. t’ (or t_j) velarized _G pharyngealized _?\

    dental d apical _a laminal _m nasalized ~ (or _~) e.g. A~ (or A~) nasal release _n lateral release _l no audible release _}

    velarized or pharyngealized _e velarized l, alternatively 5 raised _r lowered _o advanced tongue root _A retracted tongue root _q


    This (part of) documentation was generated from src/fst/phonetics/txt2ipa.xfscript


    src-fst-phonology-old.xfscript.md

    Definition section ! ================== !

    Defining Vowel

    Defining Palatal Vowel

    Defining Consonants

    Defining non-soft consonants

    Defining consonants before Cyrillic і

    Defining letters

    Defining flags

    Defining boundaries

    Defining diacritics

    Defining dummy

    Rule section ! ============ !

    stopping ы -> 0 2011-01-26 LET’s remember that this should only affect verb forms That means the surface vowels я а и і ӧ Wrong results тӧд where тыӧд should be Wrong на should be ныа Absence of “ы” vowel “ы” vowel is present before


    This (part of) documentation was generated from src/fst/phonology-old.xfscript


    src-fst-transcriptions-transcriptor-abbrevs2text.lexc.md

    We describe here how abbreviations are in Komi-Zyrian are read out, e.g. for text-to-speech systems.

    For example:


    This (part of) documentation was generated from src/fst/transcriptions/transcriptor-abbrevs2text.lexc


    tools-grammarcheckers-grammarchecker.cg3.md

    K O M I G R A M M A R C H E C K E R

    DELIMITERS

    TAGS AND SETS

    Beginning and end of sentence

    BOS EOS

    Miscellanous

    CmpTest Err

    Parts of speech tags

    N V A Adv CC CS Inter Pron Num Pcle Clt Po Dem Qnt Prop

    Derivation tags

    Ex/A (former adj) Ex/N Ex/Num Ex/V Ex/WORD DerTag

    Verbal categories

    Prs Fut Fut1 Imprt Prt1 Prt2 Prf PrfIpf HstPrf PluPrf HstPluPrf Ind Imp Cond Opt

    Sg1 Sg2 …

    Nominal categories Sg Pl Nom Gen Abl Dat Com Cns …

    PPUNCT PUNCT ¶

    Verb sets

    VNEG (all Neg verbs)

    VFIN

    ASKI (tomorrow set)

    Grammarchecker sets


    This (part of) documentation was generated from tools/grammarcheckers/grammarchecker.cg3


    tools-tokenisers-tokeniser-disamb-gt-desc.pmscript.md

    Tokeniser for kpv

    Usage:

    $ make
    $ echo "ja, ja" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
    $ echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa boasttu olmmoš, man mielde lahtuid." | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
    $ echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
    $ echo "márffibiillagáffe" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
    

    Pmatch documentation: https://github.com/hfst/hfst/wiki/HfstPmatch

    Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words:

    Whitespace contains ASCII white space and the List contains some unicode white space characters

    Apart from what’s in our morphology, there are

    1. unknown word-like forms, and
    2. unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a Unknowns are made of:
      • lower-case ASCII
      • upper-case ASCII
      • select extended latin symbols
      • extended cyrillic ASCII digits
      • select symbols
      • Combining diacritics as individual symbols,
      • various symbols from Private area (probably Microsoft), so far:
      • U+F0B7 for “x in box”

    Unknown handling

    Unknowns are tagged ?? and treated specially with hfst-tokenise hfst-tokenise –giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it’s safer to let hfst-tokenise handle them.

    Finally we mark as a token any sequence making up a:


    This (part of) documentation was generated from tools/tokenisers/tokeniser-disamb-gt-desc.pmscript


    tools-tokenisers-tokeniser-gramcheck-gt-desc.pmscript.md

    Grammar checker tokenisation for kpv

    Requires a recent version of HFST (3.10.0 / git revision>=3aecdbc) Then just:

    $ make
    $ echo "ja, ja" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
    

    More usage examples:

    $ echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa boasttu olmmoš, man mielde lahtuid." | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
    $ echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
    $ echo "márffibiillagáffe" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
    

    Pmatch documentation: https://github.com/hfst/hfst/wiki/HfstPmatch

    Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words:

    Whitespace contains ASCII white space and the List contains some unicode white space characters

    Apart from what’s in our morphology, there are 1) unknown word-like forms, and 2) unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a

    TODO: Could use something like this, but built-in’s don’t include šžđčŋ:

    Simply give an empty reading when something is unknown: hfst-tokenise –giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it’s safer to let hfst-tokenise handle them.

    Finally we mark as a token any sequence making up a:


    This (part of) documentation was generated from tools/tokenisers/tokeniser-gramcheck-gt-desc.pmscript


    tools-tokenisers-tokeniser-tts-cggt-desc.pmscript.md

    TTS tokenisation for smj

    Requires a recent version of HFST (3.10.0 / git revision>=3aecdbc) Then just:

    make
    echo "ja, ja" \
    | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
    

    More usage examples:

    echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa \
    boasttu olmmoš, man mielde lahtuid." \
    | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
    echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" \
    | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
    echo "márffibiillagáffe" \
    | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
    

    Pmatch documentation: https://kitwiki.csc.fi/twiki/bin/view/KitWiki/HfstPmatch

    Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words:

    Whitespace contains ASCII white space and the List contains some unicode white space characters

    Apart from what’s in our morphology, there are 1) unknown word-like forms, and 2) unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a

    TODO: Could use something like this, but built-in’s don’t include šžđčŋ:

    Simply give an empty reading when something is unknown: hfst-tokenise –giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it’s safer to let hfst-tokenise handle them.

    Needs hfst-tokenise to output things differently depending on the tag they get


    This (part of) documentation was generated from tools/tokenisers/tokeniser-tts-cggt-desc.pmscript

    Sitemap