Erzya NLP Grammar

Finite state and Constraint Grammar based analysers, proofing tools and other resources

View the project on GitHub giellalt/lang-myv

Page Content

  • src-fst-morphology-affixes-symbols.lexc.md
  • Symbol affixes
  • src-fst-morphology-affixes-verbs.lexc.md
  • Verb inflection
  • AUXILIARY VERBS
  • DERIVATION
  • VERBS AFTER TRANSITIVITY Tags OBJECT FLAGS
  • src-fst-morphology-clitics.lexc.md
  • Clitics
  • src-fst-morphology-phonology.twolc.md
  • The Erzya morphophonological/twolc rules file
  • src-fst-morphology-root.lexc.md
  • Morphology
  • Dialect tags
  • Orthography tags
  • Abbreviated words are classified with:
  • Semantic tags
  • Other tags
  • Morphophonology
  • MISC
  • FLAGS USED WITH COLLECTIVE NOUNS
  • src-fst-morphology-stems-adjectives-russian-like_newwords.lexc.md
  • src-fst-morphology-stems-adjectives_newwords.lexc.md
  • src-fst-morphology-stems-adverbs_newwords.lexc.md
  • src-fst-morphology-stems-exceptions.lexc.md
  • src-fst-morphology-stems-genitive_attributes.lexc.md
  • src-fst-morphology-stems-hyphenated-nouns.lexc.md
  • src-fst-morphology-stems-hyphenated-verbs.lexc.md
  • src-fst-morphology-stems-myv-propernouns.lexc.md
  • src-fst-morphology-stems-nouns_newwords.lexc.md
  • src-fst-morphology-stems-nouns_russian_100_newwords.lexc.md
  • src-fst-morphology-stems-propernouns_newwords.lexc.md
  • src-fst-morphology-stems-rusMaleNameDer.lexc.md
  • src-fst-morphology-stems-verbs_newwords.lexc.md
  • src-fst-phonetics-txt2ipa.xfscript.md
  • src-fst-transcriptions-transcriptor-abbrevs2text.lexc.md
  • tools-grammarcheckers-grammarchecker.cg3.md
  • DELIMITERS
  • TAGS AND SETS
  • tools-tokenisers-tokeniser-disamb-gt-desc.pmscript.md
  • Tokeniser for myv
  • tools-tokenisers-tokeniser-gramcheck-gt-desc.pmscript.md
  • Grammar checker tokenisation for myv
  • tools-tokenisers-tokeniser-tts-cggt-desc.pmscript.md
  • TTS tokenisation for smj
  • Erzya language model documentation

    All doc-comment documentation in one large file.


    src-cg3-disambiguator.cg3.md

    DELIMITERS

    TAGS AND SETS

    Tags

    This section lists all the tags inherited from the fst, and used as tags in the syntactic analysis. The next section, Sets, contains sets defined on the basis of the tags listed here, those set names are not visible in the output.

    Beginning and end of sentence

    BOS EOS

    Parts of speech tags

    Dialect homonyms of Sg Gen Def

    foreign

    motion verbs with supline loc verb form motion verbs with supline loc verb form

    Semantic tags

    noun phrase heads

    Syntactic tags

    Upper and lower case

    Sets containing sets of lists and tags

    This part of the file lists a large number of sets based partly upon the tags defined above, and partly upon lexemes drawn from the lexicon. See the sourcefile itself to inspect the sets, what follows here is an overview of the set types.

    Sets for Single-word sets

    the set INITIAL for initial letters INITIAL

    Sets for word or not

    Derivational affixes

    Case sets

    ADLVCASE

    Verb sets

    NOT-V

    Sets for finiteness and mood

    MOOD-V

    Homonymy for subject conjugation and subject-object conjugation with Pl3 object

    VFIN

    VFIN-POS

    Sets for person

    Pronoun sets

    кортамс мезде

    words that go with эрьва for кизэ homonymy PxSg2 for кизэ homonymy PxSg1 This will be expanded for homonymy at first

    This will be expanded for homonymy at first, i.e., diminutives

    verbs elative, illative, lative

    these have homonyms

    used with Dat PxSg1

    Derivation tags

    2VDerTag 2NDerTag

    DerTag

    Pl Nom Def is Homomym with verb stem in тне-мс. This is relative for Clt/Cop with ScPl1 and ScPl2

    in SP Gen Indef the next word can be кель

    2023_03_15 important part of regular inflection


    This (part of) documentation was generated from src/cg3/disambiguator.cg3


    src-cg3-functions.cg3.md

    negation marker for fits between negation and conneg

    MOOD-V

    Erzya and Moksha

    this needs Moksha, too.

    finite auxiliary verbs with

    макссь чарькодемс, Deal with DATAUX separately; they also take MS

    finite auxiliary taking supine MO/ME

    finite supaux 2023_03_13


    This (part of) documentation was generated from src/cg3/functions.cg3


    src-fst-morphology-affixes-adjectives.lexc.md

    Adjective inflection

    Adjectives and other parts of speech in ERZYA are compared by means of either a particle or ablative case marking on the standard of comparison

    ordinals in -це

    истямо:истя

    кондямо:кондя

    кодамо:кода кодамо:кода кодатнэ кодатне


    This (part of) documentation was generated from src/fst/morphology/affixes/adjectives.lexc


    src-fst-morphology-affixes-adpositions.lexc.md

    The Erzya language postpositions can be broken into many subgroups according to morphological and semantic criteria

    Some of the nouns have defective paradigms: € кудыкельганть

    ало:ал alo-SPAT-1Arg

    This allows for possessor indices, word end or focus e.g. вельде, вельдеяк, вельдензэ ?вельдензэль, вельдензэтне

    This allows for word end, possessor indices, predication

    postposition that is in ablative case алдо:алдо

    postposition that is in elative case потсто:потсто

    postposition that is in illative case эземс:эзем

    postposition that is in illative case эйс:э

    postposition that is in inessive case эйсэ:эйсэ, кисэ

    postposition that is in lative case ютков:ютков

    postposition that is in locative case ало:ало

    postposition that has no continuation пачк

    postposition that is in ablative case алдо:алдо

    postposition that is in elative case потсто:потсто

    postposition that is in illative case малас:мала

    postposition that is in illative case малас:мала

    postposition that is in illative case потс:пот

    postposition that is in illative case эйс:э

    postposition that is in inessive case потсо:потсо

    postposition that is in lative case алов:ало

    postposition that is in locative case ало:ало

    postposition that is in prolative case перька:перька

    +Temp: K ; перть

    +Ela+Temp: PO_POSS_OR_END_FOC ; пингстэ


    This (part of) documentation was generated from src/fst/morphology/affixes/adpositions.lexc


    src-fst-morphology-affixes-adverbs.lexc.md

    Adverb inflection

    The Erzya language adverbs do not compare.

    Not a real particle; it can take a clitic седеяк

    LEXICON ADV-SPAT_ пачк

    LEXICON ADV_IS_LAT алов

    LEXICON ADV_IS_LOC ало

    LEXICON ADV/PO/PRON-SPAT_ALO ало:ал

    LEXICON ADV-SPAT_ALO ало:ал

    “стядо”

    spatial adverbs dependent and independent case marking

    This marking would indicate a word form that may be


    This (part of) documentation was generated from src/fst/morphology/affixes/adverbs.lexc


    src-fst-morphology-affixes-interjections.lexc.md

    Interjections

    The Erzya language interjections…


    This (part of) documentation was generated from src/fst/morphology/affixes/interjections.lexc


    src-fst-morphology-affixes-nonverbalConjugation.lexc.md

    Non-Verbal conjugation

    In the Erzya language nominals and adverbs also conjugate

    Used with deverbals

    This is where adjectives get their plural T.

    used with infinitives

    Conjugation

    NON-VERB CONJUGATION

    Conjugation

    _KAL-NomSg-Conjugation-only

    This allows Clt/Cop+Prs Sg1|Sg2|Pl1|Pl2 Clt/Cop+Prt2 Sg1|Sg2|Sg3|Pl1|Pl2|Pl3 K 2019-01-26

    _KUDO-NomPl-Conjugation-only

    _KUDO-NomPl-Conjugation-only-mutual

    Are there copula verb combinations? 2024-08-06


    This (part of) documentation was generated from src/fst/morphology/affixes/nonverbalConjugation.lexc


    src-fst-morphology-affixes-nouns.lexc.md

    Noun inflection

    Nouns in ERZYA inflect for number, case and declension (definite, indefinite and possessive).

    LEXICON N_PELE пеле:пель, ало:ал

    KINSHIP

    HUMAN

    PLACE

    LATIVE

    VOCATIVE

    NAMES OF MONTHS

    COMMON NOUNS

    кардаз:карда

    панго:панг

    потмо:пот

    Front vowel, non-palatal consonant before vowel Front vowel, non-palatal consonant before vowel

    Front vowel, palatal consonant before vowel

    Front vowel, non-palatal consonant before vowel Front vowel, non-palatal consonant before vowel

    Does this need a diminutive?

    NMN

    harmony: front

    DERIVATION

    pango:pang

    N_KUDO-Def-Declension

    N_KUDO-Def-Declension

    N_KUDO-Def-Declension

    Plurale tantum

    DEFINITE SINGULAR TAGS

    INDEFINITE DECLENSION

    SG-NOM-INDEF_LAK ;

    SG-NOM-INDEF_KAL ;

    SG-NOM-INDEF_OSH ;

    INDEFINITE TAGS

    POSSESSIVE DECLENSION

    CASES BEFORE POSSESSIVE TAGS

    DEFINITE PLURAL

    Cases for тнэ

    NP head ellipsis declension, Modifiers without nouns = MWN

    Nouns1S_A

    POSSESSIVE marking followed by clitics

    Possessor indices

    The Erzya language possessor indices or possessive suffixes may be followed by a number of morpheme types

    These are possessor indices that can be followed by predicate marking in the present there is no destinction between ScSg3 and ScPl3 Possessor indices allowing (1) #, (2) Foc, (3) Der/Pr ()

    This appears with kindred terminology

    Is “_KAL” necessary ?

    DAT-PXPL1 ;

    POSSESSIVE TAGS

    These are possessor Indices for non-nominative singular NonNomSg

    word boundary or focus


    This (part of) documentation was generated from src/fst/morphology/affixes/nouns.lexc


    src-fst-morphology-affixes-pronouns.lexc.md

    Pronoun inflection

    Erzya pronouns inflect in many the same cases as regular nouns.

    Closed class personal pronouns

    +Sem/Hum+Sg+Nom:е ENDLEX ; кие:ки

    +Sem/Obj: CLT/COP_SG ; singular

    мон:мо

    тон:то

    сон:со

    минь:

    тынь:ты

    сынь:сы

    Obligatory Possessor Index

    Demonstrative

    Interrogative

    What should be done

    кона:кона This is not the same as indefinite PronRel-kona

    What should be done

    LEXICON PRON-IS-INTERR-SPAT-INE косо

    What should be done

    Relative pronouns

    ки:ки

    ки

    мезе+Pron:мезИ2 Misc_Pronouns1 ; мезе+Pron+Rel+Gen:мень K ; ки+Pron:ки Misc_Pronouns1 ;

    Some pronoun continuation have been moved here Out of TestLexc-noun.txt


    This (part of) documentation was generated from src/fst/morphology/affixes/pronouns.lexc


    src-fst-morphology-affixes-propernouns.lexc.md

    Proper noun inflection

    Erzya proper nouns inflect in the same cases as regular nouns.

    Андрей:Андре

    Вили:Вил

    Russian type Surnames Абдеев:Абдеев

    Багрий:Багр

    Аморский:Аморск

    Front-vowel stem

    DECLENSION LIMITATIONS


    This (part of) documentation was generated from src/fst/morphology/affixes/propernouns.lexc


    src-fst-morphology-affixes-quantifiers.lexc.md

    Quantifier inflection

    Erzya quantifiers inflect in many the same cases as regular nouns.

    extra numerals

    Now regular

    кавонст

    омбонст

    кавонест is a pronoun like the Finnish molemmat This means a radical increase in the Erzya pronoun inventory: 6 x for each numeral 2 and above

    кавксоненек

    once, twice; весть, кавксть, аламоксть twofold, threefold; веенькирда, кавонькирда, колмонькирда

    васенцеде advmod:multimprf > advmod:ordimprf

    васняяк ‘first of all’

    Numeral with a range limitation to adnominal phrase

    2012-08-09


    This (part of) documentation was generated from src/fst/morphology/affixes/quantifiers.lexc


    src-fst-morphology-affixes-symbols.lexc.md

    Symbol affixes


    This (part of) documentation was generated from src/fst/morphology/affixes/symbols.lexc


    src-fst-morphology-affixes-verbs.lexc.md

    Verb inflection

    Erzya language verbs inflect for person, subject and object.

    OBJECT FLAGS AND +V tags а+V:а

    **LEXICON V-AUX-NEG-PRT1 ** а+V:эзь

    **LEXICON TV_KADOMS **

    **LEXICON TV_NEVTEMS_SUB **

    **LEXICON TV_SAVTOMS_SUB **

    **LEXICON TV_SAVTOMS **

    **LEXICON TV_SAVTOMS-SG3_SUBJ/ZERO **

    **LEXICON TV_CHACHTOMS **

    **LEXICON TV_KUNDAMS_SUB **

    **LEXICON TV_KUNDAMS **

    **LEXICON TV_SATOMS **

    **LEXICON TV_TUEMS **

    **LEXICON TV_TEEMS **

    VERBS WITH THIRD PERSON OBJECTS @U.CONJ-PX.13@

    VERBS WITH INTRANSITIVE TAGS +V

    AUXILIARY VERBS

    DERIVATION

    VERBS AFTER TRANSITIVITY Tags OBJECT FLAGS

    теемс:тей теемс:тей

    no deverbals

    no deverbals

    no deverbals

    DERIVATION

    LEXICON TV_NEKSHNEMS Alternates with TRA LEXICON TV_NEKSHNEMS Alternates with TRA LEXICON TV_NEKSHNEMS Alternates with TRA

    This is fed by actors and participles in N_myv, A_myv and Prc_myv This is fed by actors and participles in N_myv, A_myv and Prc_myv

    CONJUGATION

    Indicative Preterite I

    INDICATIVE

    Indicative NonPast

    INDICATIVE PRETERITE 2

    DESIDERATIVE

    CONJUNCTIVE

    redo conj 2012-11-07 begin

    redo conj 2012-11-07 end

    begin

    end

    OPTATIVE

    IMPERATIVE

    PRECATIVE

    OPTATIVE

    2012-11-09

    Given in Grammar 2000

    Used with deverbals

    ваномс+V+Imprt+ScPl2+Clt/Ga: look/katsoa


    This (part of) documentation was generated from src/fst/morphology/affixes/verbs.lexc


    src-fst-morphology-clitics.lexc.md

    Clitics

    The Erzya language clitics…

    END


    This (part of) documentation was generated from src/fst/morphology/clitics.lexc


    src-fst-morphology-phonology.twolc.md

    The Erzya morphophonological/twolc rules file

    This file documents the phonology.twolc file

    Alphabet

    ӓ Ӓ ҥ Ҥ і І ѳ Ѳ Pre-Soviet 1930s letters

    Special letters in the root that might be useful in dialect research and etymology later

    идиса, идима ашоян disallow о:0

    вт%{оеэ%}мО1

    %{frontHard%}:0 — front harmony hard %{frontSoft%}:0 — front harmony soft %{back%}:0 — back harmony %{backHard%}:0 — back harmony

    %^OldAE:0 — This allows Ӓ4 and Ӓ3 to be realized as я %^NoLinkVow:0 — No linking vowel is used only after consonants for error

    verbStemVowStrong:0

    Ӓ3 Ӓ4 as я

    A1:o

    Y2:yi

    %{оеэ%}:е неемс+V+Ger+Ill+PxPl1: –see/nähdä–

    %{оеэ%}:о псака+N+SP+Abe+PxSg3+Der+Der/MWN+N+SP+Tra+Indef: cat/kissa

    %{оеэ%}:э

    %{оеэØ%}:0 %{оеэØ%}:е панемс+V+Ind+ConNeg: drive/ajaa

    вадемс+V+Der/Ovt+Prc/Telic+Sg+Nom+Def: the greased one/

    %{оеэØ%}:э кев+N+SP+Ill+PxSg2: rock/kivi

    %{оеэØ%}:о ков+N+SP+Ill+PxSg2: moon/kuu

    %{уиыØ%}:и панемс+V+Inf+Dial/NW: drive/ajaa

    %{уиыØ%}:ы кев+N+SP+Ill+PxSg2: rock/kivi

    %{уиыØ%}:у ков+N+SP+Ill+PxSg2: moon/kuu

    O1:e

    O1:o

    %{оэØ%}:e

    тев+N+Sg+Nom+PxSg3+Err/Orth-no-linking-vowel: thing/juttu

    %{оэØ%}:o

    псака+N+SP+Abe+PxSg3+Der+Der/MWN+N+SP+Tra+Indef: cat/kissa

    %{оэØ%}:0

    O1:0

    %{ое%}:е

    %{ое%}:о

    A2:a
    путомс+V+Prec+ScSg2: put/laittaa

    и:ы

    j:0

    **Е3:э always ** %> т н _ 2013-02-23

    **Е3:э sometimes ** %> т н _ 2013-02-23

    **ye:e always **
    сыр

    Н1:н
    Н1:к

    а: и Dimin

    о: ы Dimin

    у: и Dimin

    о regressive raising у озномс+V+Ind+Prs+ScSg1+OcSg3+Dial/NW: bless/siunata

    э: и Dimin

    а: и Dimin

    о: и Dimin

    у: и Dimin

    я: и Dimin

    ё: и Dimin

    ю: и Dimin

    е: и Dimin

    a:ya

    n loss with plural ведун+N+Pl+Indef: knower/tietäjä

    v:0

    G1:0

    G1:g

    G1:k

    G2:g

    G2:k

    G4:0
    саемс+V+Ind+Prs+ConNeg+Clt/Ga:

    G4:k

    потмо+N+Relator+SP+Ela+Indef: inside/sisäosa

    imperative suffix K1:t

    лыказевемс+V+Imprt+ScSg2: have taken

    K1:к
    ливтемс+V+Prec+ScSg2: set out/laittaa esille

    U4:y
    кал+N+Sg+Nom+Def: fish/kala

    пильге+N+Pl+Nom+Indef leg; foot/jalka

    U4:0

    вадемс+V+Der/Ovt+Prc/Telic+Sg+Nom+Def: the greased one/

    валдо+N+Pl+Nom+Indef light/valo

    t:d
    ловомс+V+Ind+Prs+ScSg1+OcSg2: regard/pitää jonain

    s:0

    d:t

    d:d

    y:y

    y:0

    меремс+V+Ind+Prt1+ScSg3: say/sanoa

    Disallow TLoss after non-t

    Disallow ^H before t and subsequent {ЬØ} Disallow RegrRaise after A

    Disallow vow loss before break

    Disallow OldAE when no Ä

    Disallow KLoss after non-k

    Disallow SLoss after non-s

    Disallow %^WLoss after non-v

    Disallow Н1:н after Letters

    р н :Vows (HarmDummies:)] (ь:) %> _ %> %{оеэØ%}: ;

    Disallow soft loss

    Disallow SoftRetain

    Disallow SoftRetain чувто+N+Pl+Nom+Def: tree/puu

    веле+N+SP+Tra+PxSg2

    псака+N+SP+Abe+PxSg2+Clt/Cop+Prt2+ScPl3+Clt/Gak

    псака+N+SP+Abe+PxSg3+Der+Der/MWN+N+SP+Tra+Indef: cat/kissa

    веле+N+SP+Tra+PxSg2+Clt/Cop+Prt2+ScPl3: village/kylä

    Disallow %^NoLinkVow after vowel

    Disallow s for control of stems with inessive…

    Disallow dano after non-voiced

    Disallow dano after non-voiced

    Disallow k for control of comparative with stem types


    This (part of) documentation was generated from src/fst/morphology/phonology.twolc


    src-fst-morphology-root.lexc.md

    Morphology

    INTRODUCTION TO MORPHOLOGICAL ANALYSER OF ERZYA.

    Analysis symbols

    The morphological analyses of wordforms of ERZYA are presented in this system in terms of following symbols. (It is highly suggested to follow existing standards when adding new tags).

    The parts-of-speech are:

    Parts of speech are further split up into:

    Adjectives

    Adverbs

    Interjections

    Nouns

    Particles

    Postpositions + Spat, + Temp

    Pronouns

    Quantifiers (numerals)

    Quantifiers and Numerals are classified under:

    Nominals are inflected for Number and Case

    Number

    Case

    Possession and other declension types are marked with:

    The comparative forms are:

    Verb moods are:

    Infinitive moods

    Tenses in the indicative and infrequently in the conditional

    Verb personal forms are:

    Other verb forms are

    The Usage extents are marked using following tags:

    Dialect tags

    Orthography tags

    Abbreviated words are classified with:

    Special symbols

    Delimiter marks are classified with:

    The verbs are syntactically split according to transitivity:

    Auxiliary verbs

    Special multiword units are analysed with:

    Non-dictionary words can be recognised with:

    Question and Focus particles:

    Semantic tags

    Semantic tags to help disambiguation & synt. analysis: (before POS) Borrowed from main/langs/sme/src/morphology/root.lexc

    Simplex tags

    Multiple Semantic tags:

    Semantics are classified with

    Semantic Fields

    Other tags

    Verbal arguments

    Derivations are classified under the morphophonetic form of the suffix, the source and target part-of-speech.

    Homonymy

    Der begin

    Declaring noun derivations

    Modifier without noun

    Declaring Indefinite Pronoun derivations

    DECLARING NOUN DERIVATIONS

    DECLARING NUMERAL DERIVATIONS

    DECLARING DEVERBAL DERIVATIONS OF VERBS

    Morphophonology

    To represent phonologic variations in word forms we use the following symbols in the lexicon files:

    And following triggers to control variation

    Special letters in the root that might be useful in dialect research and etymology later

    вт%{оеэ%}мО1 suffix-internal archivowel

    %^OldAE — This allows Ӓ4 and Ӓ3 to be realized as я

    MISC

    Development tag

    Compounding

    Tags

    Imperative clitics

    Tags distinguishing different versions of the same lemma (before POS)

    Symbols that need to be escaped on the lower side (towards twolc):

    Flag diacritics

    We have manually optimised the structure of our lexicon using following flag diacritics to restrict morhpological combinatorics - only allow compounds with verbs if the verb is further derived into a noun again: | @P.NeedNoun.ON@ | (Dis)allow compounds with verbs unless nominalised | @D.NeedNoun.ON@ | (Dis)allow compounds with verbs unless nominalised | @C.NeedNoun@ | (Dis)allow compounds with verbs unless nominalised

    For languages that allow compounding, the following flag diacritics are needed to control position-based compounding restrictions for nominals. Their use is handled automatically if combined with +CmpN/xxx tags. If not used, they will do no harm. | @P.CmpFrst.FALSE@ | Require that words tagged as such only appear first | @D.CmpPref.TRUE@ | Block such words from entering ENDLEX | @P.CmpPref.FALSE@ | Block these words from making further compounds | @D.CmpLast.TRUE@ | Block such words from entering R | @D.CmpNone.TRUE@ | Combines with the next tag to prohibit compounding | @U.CmpNone.FALSE@ | Combines with the prev tag to prohibit compounding | @P.CmpOnly.TRUE@ | Sets a flag to indicate that the word has passed R | @D.CmpOnly.FALSE@ | Disallow words coming directly from root.

    Use the following flag diacritics to control downcasing of derived proper nouns (e.g. Finnish Pariisi -> pariisilainen). See e.g. North Sámi for how to use these flags. There exists a ready-made regex that will do the actual down-casing given the proper use of these flags. | @U.Cap.Obl@ | Allowing downcasing of derived names: deatnulasj. | @U.Cap.Opt@ | Allowing downcasing of derived names: deatnulasj.

    Flags used to identify parts of speech

    Flags used with +Clt/Cop nonverbal predication

    Flags used with transitivity

    problematic

    This allows or disallows combining with hyphen through loop especially for acronyms 2012-11-04

    This disallows secondary compounding

    Linking vowel for use with Translative

    FLAGS USED WITH COLLECTIVE NOUNS

    number

    Removal

    Flag diacritic Explanation
    @U.number.one@ Flag used to give arabic numerals in smj different cases ;
    @U.number.two@ Flag used to give arabic numerals in smj different cases ;
    @U.number.three@ Flag used to give arabic numerals in smj different cases ;
    @U.number.four@ Flag used to give arabic numerals in smj different cases ;
    @U.number.five@ Flag used to give arabic numerals in smj different cases ;
    @U.number.six@ Flag used to give arabic numerals in smj different cases ;
    @U.number.seven@ Flag used to give arabic numerals in smj different cases ;
    @U.number.eight@ Flag used to give arabic numerals in smj different cases ;
    @U.number.nine@ Flag used to give arabic numerals in smj different cases ;
    @U.number.zero@ Flag used to give arabic numerals in smj different cases ;

    The word forms in ERZYA start from the lexeme roots of basic word classes, or optionally from prefixes: Here follow all contlexes, appr 20.

    CyrillicFemaleName ; HUNSPELL Type name derivation RussianMalenamesDerive ; ! RussianSurnamesDerive ;

    увол-авол

    alo-SPAT-1Arg ; >PO_KAL-LOC


    This (part of) documentation was generated from src/fst/morphology/root.lexc


    src-fst-morphology-stems-adjectives-russian-like_newwords.lexc.md

    This is where new words are added as lexc entries before they are added to the xml source files. од:од A_KAL “(eng) /(fin)/(rus) “ ;

    ADD ADJECTIVES BELOW


    This (part of) documentation was generated from src/fst/morphology/stems/adjectives-russian-like_newwords.lexc


    src-fst-morphology-stems-adjectives_newwords.lexc.md

    This is where new words are added as lexc entries before they are added to the xml source files. эрзя-мокшонь:эрзя-мокшонь A_IS_GEN “(eng) /(fin) /(rus) “ ;

    ADD ADJECTIVES BELOW


    This (part of) documentation was generated from src/fst/morphology/stems/adjectives_newwords.lexc


    src-fst-morphology-stems-adverbs_newwords.lexc.md

    This is where new words are added as lexc entries before they are added to the xml source files. лембстэ:лембстэ ADV_ “(eng) /(fin) /(rus) “ ;

    ADD ADVERBS BELOW


    This (part of) documentation was generated from src/fst/morphology/stems/adverbs_newwords.lexc


    src-fst-morphology-stems-exceptions.lexc.md

    Exceptions are quite strange word-forms. the ones that do not fit anywhere else. This file contains all enumerated word forms that cannot reasonably be created from lexical data by regular inflection. Usually there should be next to none exceptions, it’s always better to have a paradigm that covers only one or few words than an exception since these will not work nicely with e.g. compounding scheme or possibly many end applications.

    verbs of negation have partial inflection: € аволь € иля € эзь

    The verb ярсамс has additional irregular forms: € ярстано € ярстадо

    The verb сеземс

    Some of the nouns have archaic consonant stem forms left: € ийть

    Periferal

    Some random Russian elements:

    Some of the nouns have special forms for Gen PxSg1 and PxSg2:

    Reciprocal pronouns These might be done with flags

    These two stems have м loss but its presence can be observed in the choice of “тнэ” over “тне” This has special hard after lost consonant This has special hard after lost consonant

    1930s Phonetic transcription дс » ц гт » к мекевлангт+Adv+Use/NG+Err/Orth:мекевланг K ; Half way between morphology and phonetics with a Russian twist

    ADPOSITIONS

    IDEOPHONES

    are dealt with as adverbs

    PRONOUNS

    QUANTIFIERS

    сисем+Num+Ord:сисеме NUMORD_KUDO ; This is irregularly formed, cf. сисемце

    NOUNS

    NOUNS WRITTEN Appart

    PLACE NAMES

    GEO

    ANIMAL NAMES

    FIRST NAMES

    100 % homographs of Russian words

    adjectives in ой Adj-od » A_RU-OJ with +Use/SpellNoSugg

    +SP+Gen+Indef attributes as adjectives

    Russian language words found in Erzya texts

    Old Bible Names and words

    RUSSIAN VERBS

    unrecognized

    Problems with synchronization missing lemmas

    COLLECTIVE NOUNS


    This (part of) documentation was generated from src/fst/morphology/stems/exceptions.lexc


    src-fst-morphology-stems-genitive_attributes.lexc.md

    This is where new words are added as lexc entries before they are added to the xml source files. Ботужале+N+Prop+SP+Gen+Indef:ботужале A_IS_PROP_GEN ;

    ADD ADJECTIVES BELOW


    This (part of) documentation was generated from src/fst/morphology/stems/genitive_attributes.lexc


    src-fst-morphology-stems-hyphenated-nouns.lexc.md

    These are nouns with parallel declension

    ават%-тейтерть аванзо-тетянзо ават%-цёрат атявтт%-ававтт атят%-ават атят%-бабат атят%-сэрдят бабат%-нуцькат барант%-каткат боярт%-азорт боярт%-боярават

    вирть%-лугат вирть%-паксят вирть%-укшторт ворт%-грабительть ворт%-розбойникть эрзят%-мокшот


    This (part of) documentation was generated from src/fst/morphology/stems/hyphenated-nouns.lexc


    src-fst-morphology-stems-hyphenated-verbs.lexc.md

    These are verbs with parallel conjugation

    REDUPLICATION

    авардемс%-авардемс ардомс%-ардомс ардтневтемс%-ардтневтемс арсемс%-арсемс аштемс%-аштемс ванномс%-ванномс ваномс%-ваномс вешнемс%-вешнемс

    %-And such

    авардемс%-теемс арсемс%-теемс аштемс%-теемс ванномс%-теемс ваномс%-теемс

    андомс%-симдемс аштемс%-учомс велямс%-чарамс вастомс%-дёлямс васькамс%-оймамс витнемс%-петнемс ёмавтомс%-аравтомс ярсамс%-симемс

    SERIAL

    витнемс%-ютавтомс


    This (part of) documentation was generated from src/fst/morphology/stems/hyphenated-verbs.lexc


    src-fst-morphology-stems-myv-propernouns.lexc.md

    -kal

    -osh

    -kudo

    -kal

    -osh

    -kudo

    Place names, Settlements

    Rivers


    This (part of) documentation was generated from src/fst/morphology/stems/myv-propernouns.lexc


    src-fst-morphology-stems-nouns_newwords.lexc.md

    This is where new words are added as lexc entries before they are added to the xml source files. автор:автор N_KAL ;

    ADD NOUNS BELOW


    This (part of) documentation was generated from src/fst/morphology/stems/nouns_newwords.lexc


    src-fst-morphology-stems-nouns_russian_100_newwords.lexc.md

    This is where new Russian-equivalent nouns are added as lexc entries. This makes for a shared list in Mordvin analyser development автор:автор N_KAL_rus100 ;

    ADD NOUNS BELOW


    This (part of) documentation was generated from src/fst/morphology/stems/nouns_russian_100_newwords.lexc


    src-fst-morphology-stems-propernouns_newwords.lexc.md

    This is where new words are added as lexc entries before they are added to the xml source files. автор:автор N_KAL “(eng) /(fin) /(rus) “ ;

    ADD NOUNS BELOW


    This (part of) documentation was generated from src/fst/morphology/stems/propernouns_newwords.lexc


    src-fst-morphology-stems-rusMaleNameDer.lexc.md

    The derivable male given names have been moved to the template urj-Cyrl-propernouns.lexc.


    This (part of) documentation was generated from src/fst/morphology/stems/rusMaleNameDer.lexc


    src-fst-morphology-stems-verbs_newwords.lexc.md

    This is where new words are added as lexc entries before they are added to the xml source files. ливтевкшнемс+V:ливтевкшне TV_KUNDAMS “(eng) /(fin) /(rus) “ ;

    ADD VERBS BELOW

    These verbs just need Finnish translations A-M

    N-End


    This (part of) documentation was generated from src/fst/morphology/stems/verbs_newwords.lexc


    src-fst-phonetics-txt2ipa.xfscript.md

    retroflex plosive, voiceless t ʈ 0288, 648 ( = ASCII 096) retroflex plosive, voiced d ɖ 0256, 598 labiodental nasal F ɱ 0271, 625 retroflex nasal n ɳ 0273, 627 palatal nasal J ɲ 0272, 626 velar nasal N ŋ 014B, 331 uvular nasal N\ ɴ 0274, 628

    bilabial trill B\ ʙ 0299, 665 uvular trill R\ ʀ 0280, 640 alveolar tap 4 ɾ 027E, 638 retroflex flap r ɽ 027D, 637 bilabial fricative, voiceless p\ ɸ 0278, 632 bilabial fricative, voiced B β 03B2, 946 dental fricative, voiceless T θ 03B8, 952 dental fricative, voiced D ð 00F0, 240 postalveolar fricative, voiceless S ʃ 0283, 643 postalveolar fricative, voiced Z ʒ 0292, 658 retroflex fricative, voiceless s ʂ 0282, 642 retroflex fricative, voiced z` ʐ 0290, 656 palatal fricative, voiceless C ç 00E7, 231 palatal fricative, voiced j\ ʝ 029D, 669 velar fricative, voiced G ɣ 0263, 611 uvular fricative, voiceless X χ 03C7, 967 uvular fricative, voiced R ʁ 0281, 641 pharyngeal fricative, voiceless X\ ħ 0127, 295 pharyngeal fricative, voiced ?\ ʕ 0295, 661 glottal fricative, voiced h\ ɦ 0266, 614

    alveolar lateral fricative, vl. K alveolar lateral fricative, vd. K\

    labiodental approximant P (or v) alveolar approximant r\ retroflex approximant r` velar approximant M\

    retroflex lateral approximant l` palatal lateral approximant L velar lateral approximant L
    Clicks

    bilabial O\ (O = capital letter) dental |
    (post)alveolar !\ palatoalveolar =\ alveolar lateral ||
    Ejectives, implosives

    ejective > e.g. ejective p p> implosive < e.g. implosive b b< Vowels

    close back unrounded M close central unrounded 1 close central rounded } lax i I lax y Y lax u U

    close-mid front rounded 2 close-mid central unrounded @\ close-mid central rounded 8 close-mid back unrounded 7

    schwa ə @

    open-mid front unrounded E open-mid front rounded 9 open-mid central unrounded 3 open-mid central rounded 3\ open-mid back unrounded V open-mid back rounded O

    ash (ae digraph) { open schwa (turned a) 6

    open front rounded & open back unrounded A open back rounded Q Other symbols

    voiceless labial-velar fricative W voiced labial-palatal approx. H voiceless epiglottal fricative H\ voiced epiglottal fricative <\ epiglottal plosive >\

    alveolo-palatal fricative, vl. s\ alveolo-palatal fricative, voiced z\ alveolar lateral flap l\ simultaneous S and x x\ tie bar _ Suprasegmentals

    primary stress “ secondary stress % long : half-long :\ extra-short _X linking mark -
    Tones and word accents

    level extra high _T level high _H level mid _M level low _L level extra low _B downstep ! upstep ^ (caret, circumflex)

    contour, rising contour, falling _F contour, high rising _H_T contour, low rising _B_L

    contour, rising-falling _R_F (NB Instead of being written as diacritics with _, all prosodic marks can alternatively be placed in a separate tier, set off by < >, as recommended for the next two symbols.) global rise global fall Diacritics

    voiceless 0 (0 = figure), e.g. n_0 voiced _v aspirated _h more rounded _O (O = letter) less rounded _c advanced _+ retracted _- centralized _” syllabic = (or _=) e.g. n= (or n=) non-syllabic _^ rhoticity `

    breathy voiced _t creaky voiced _k linguolabial _N labialized _w palatalized ‘ (or _j) e.g. t’ (or t_j) velarized _G pharyngealized _?\

    dental d apical _a laminal _m nasalized ~ (or _~) e.g. A~ (or A~) nasal release _n lateral release _l no audible release _}

    velarized or pharyngealized _e velarized l, alternatively 5 raised _r lowered _o advanced tongue root _A retracted tongue root _q


    This (part of) documentation was generated from src/fst/phonetics/txt2ipa.xfscript


    src-fst-transcriptions-transcriptor-abbrevs2text.lexc.md

    We describe here how abbreviations are in Erzya are read out, e.g. for text-to-speech systems.

    For example:


    This (part of) documentation was generated from src/fst/transcriptions/transcriptor-abbrevs2text.lexc


    tools-grammarcheckers-grammarchecker.cg3.md

    E R Z Y A G R A M M A R C H E C K E R

    DELIMITERS

    TAGS AND SETS

    Upper and lower case

    This will be expanded for homonymy at first

    This will be expanded for homonymy at first, i.e., diminutives

    used with Dat PxSg1

    Derivation tags

    2VDerTag 2NDerTag

    DerTag

    Grammarchecker sets


    This (part of) documentation was generated from tools/grammarcheckers/grammarchecker.cg3


    tools-tokenisers-tokeniser-disamb-gt-desc.pmscript.md

    Tokeniser for myv

    Usage:

    $ make
    $ echo "ja, ja" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
    $ echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa boasttu olmmoš, man mielde lahtuid." | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
    $ echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
    $ echo "márffibiillagáffe" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
    

    Pmatch documentation: https://github.com/hfst/hfst/wiki/HfstPmatch

    Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words:

    Whitespace contains ASCII white space and the List contains some unicode white space characters

    Apart from what’s in our morphology, there are

    1. unknown word-like forms, and
    2. unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a Unknowns are made of:
      • lower-case ASCII
      • upper-case ASCII ASCII digits
      • select symbols
      • Combining diacritics as individual symbols,
      • various symbols from Private area (probably Microsoft), so far:
      • U+F0B7 for “x in box”

    Unknown handling

    Unknowns are tagged ?? and treated specially with hfst-tokenise hfst-tokenise –giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it’s safer to let hfst-tokenise handle them.

    Finally we mark as a token any sequence making up a:


    This (part of) documentation was generated from tools/tokenisers/tokeniser-disamb-gt-desc.pmscript


    tools-tokenisers-tokeniser-gramcheck-gt-desc.pmscript.md

    Grammar checker tokenisation for myv

    Requires a recent version of HFST (3.10.0 / git revision>=3aecdbc) Then just:

    $ make
    $ echo "ja, ja" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
    

    More usage examples:

    $ echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa boasttu olmmoš, man mielde lahtuid." | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
    $ echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
    $ echo "márffibiillagáffe" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
    

    Pmatch documentation: https://github.com/hfst/hfst/wiki/HfstPmatch

    Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words:

    Whitespace contains ASCII white space and the List contains some unicode white space characters

    Apart from what’s in our morphology, there are 1) unknown word-like forms, and 2) unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a

    TODO: Could use something like this, but built-in’s don’t include šžđčŋ:

    Simply give an empty reading when something is unknown: hfst-tokenise –giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it’s safer to let hfst-tokenise handle them.

    Finally we mark as a token any sequence making up a:


    This (part of) documentation was generated from tools/tokenisers/tokeniser-gramcheck-gt-desc.pmscript


    tools-tokenisers-tokeniser-tts-cggt-desc.pmscript.md

    TTS tokenisation for smj

    Requires a recent version of HFST (3.10.0 / git revision>=3aecdbc) Then just:

    make
    echo "ja, ja" \
    | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
    

    More usage examples:

    echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa \
    boasttu olmmoš, man mielde lahtuid." \
    | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
    echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" \
    | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
    echo "márffibiillagáffe" \
    | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
    

    Pmatch documentation: https://kitwiki.csc.fi/twiki/bin/view/KitWiki/HfstPmatch

    Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words:

    Whitespace contains ASCII white space and the List contains some unicode white space characters

    Apart from what’s in our morphology, there are 1) unknown word-like forms, and 2) unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a

    TODO: Could use something like this, but built-in’s don’t include šžđčŋ:

    Simply give an empty reading when something is unknown: hfst-tokenise –giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it’s safer to let hfst-tokenise handle them.

    Needs hfst-tokenise to output things differently depending on the tag they get


    This (part of) documentation was generated from tools/tokenisers/tokeniser-tts-cggt-desc.pmscript

    Sitemap