Western Mari NLP Grammar

Finite state and Constraint Grammar based analysers, proofing tools and other resources

View the project on GitHub giellalt/lang-mrj

Page Content

  • Vowel harmony rules
  • Consonant loss or transformation rules
  • Vowel addition rules
  • Vowel loss rules
  • Devoicing
  • Semantic tags
  • Multiple Semantic tags:
  • Derivation
  • Morphophonology
  • Lexeme disambiguation tags
  • Flag diacritics
  • Root lexicon
  • The initial lexica
  • Tags
  • Western Mari language model documentation

    All doc-comment documentation in one large file.


    src-cg3-disambiguator.cg3.md

    This is the Hill Mari disambiguation file. It chooses the correct morphological analyses in any given sentence context.

    It was copied from the Eastern Mari cg3 file 18.11.21. tt.

    The file first defines sentence delimiters and tags and sets. Thereafter come the rules, each rule is listed below.

    Sentence delimiters

    The delimiters are: “<.>” “<!>” “<?>” “<…>” “<¶>” sent

    The Tags section lists all the tags inherited from the fst, and defines them for use in the syntactic analysis. The tags are documented in the root.lexc file, and here only listed for reference.

    The next section, Sets, contains sets defined on the basis of the tags listed here, those set names are not visible in the output.

    Tags

    Beginning and end of sentence

    BOS EOS

    Clause boundary

    Parts of speech tags

    N V A Adv CC CC CS Interj Pron Num Pcle Clt Po

    WORD is the set of all POS

    Verbal tense and mood tags

    Prs Prt1 Prt2 Fut Imprt Ind Cond Des

    Other verbal tags

    Act ConNeg FutPrc Ger Inf Nec Neg NegPrc Pass Prc PrfPrc

    Verbal person-number tags Sg1 Sg2 Sg3 Pl1 Pl2 Pl3

    Numeral tags

    Sg Pl

    Case tags

    Nom Gen Abl Dat Com Cns Acc Ins Ine Ill Cmpr (case)

    Other nominal tags

    Pers Refl Rel Interr Recipr Dem ABBR

    Adjective comparison tags

    Pos (?) Superl Comp

    Possessive suffix tags

    PxSg1 PxSg2 PxSg3 PxPl1 PxPl2 PxPl3

    Suffix ordering

    Numeral tags

    Card Coll Ord Temp (?)

    Particles

    Qst Foc

    Punctuation marks

    CLB PUCT LEFT RIGHT COMMA COLON

    Derivation tags

    Der/MWN modifier without [head] noun Der/sa Der/Pur Der/Caus Der/Nom

    Tags for internal testing

    CmpTest Err

    Sets

    Der/Date Der/Year Der/Hum Der/Lang Der/Domain Der/Feat-phys Der/Clth Der/Body Der/Act

    Sem/Ani Sem/Fem Sem/Group Sem/Mal Sem/Measr Sem/Money Sem/Obj Sem/Obj-el Sem/Org Sem/Perc-emo Sem/Plc Sem/Sign Sem/State-sick Sem/Sur Sem/Time Sem/Txt

    Rule section

    Early, word-internal rules

    CC or Pcle

    Specific words

    да

    и

    Interjection

    Predicative

    AifVövny selects A if вӧвны somewhere to the left

    Conjunctions

    Particles

    *InterrQ if question mark anywhere to the right

    *Interr removes Rel if question mark to the right somewhere

    Verbs

    Existential ulo

    Infinitives

    Adjectives

    *RemAdjBeforeProp removes A if Prop to the left

    *AdjBeforeMo selects A if Interr to the right

    *AdjBeforeConjAdj selects A if conjuction and A to the right ;

    *AdjNotN removes N if Pron Pers anywhere to the left

    *RemAdj2 removes A if no N or Pron in a clause

    Nouns

    *RemNomIfPronLeft removes Nom if Pron Nom anywhere to the left

    *RemNomIfPronRight removes Nom if Pron Nom anywhere to the right

    *NomBeforeConjNom selects N Nom if conjoined with N Nom

    *NafterDem selects N if Dem to the left (demonstratives tend to be sole modifiers)

    *NotANoun

    *NafterAbeforeEOS

    *RemNafterAdv removes N if adverb to the left

    Derivations

    Cases

    Proper nouns

    Numerals

    Pronouns

    Conjunctions

    Postpositions

    ConjNotPcle supposes we have found the particles earlier on, and now go for Conj.

    Adverbs

    Verbs

    Finite verb or Gerundium

    *RemGer removes Ger Gen if there is no verb to the right

    First or third person

    ConNeg or not


    This (part of) documentation was generated from src/cg3/disambiguator.cg3


    src-cg3-functions.cg3.md

    Nom Gen Abl Dat Com Cns Acc Ins Ine Ill Cmpr (case)

    These sets model noun phrases (NPs). The idea is to first define whatever can occur in front of the head of the NP, and thereafter negate that with the expression WORD - premodifiers.

    The set NOT-NPMOD is used to find barriers between NPs. Typical usage: … (*1 N BARRIER NPT-NPMOD) … meaning: Scan to the first noun, ignoring anything that can be part of the noun phrase of that noun (i.e., “scan to the next NP head”)

    These were the set types.

    NP attributives

    Postposition

    HABITIVE MAPPING

    Mari rules start here

    Mari rules stop here

    SUBJ MAPPING - leftovers

    OBJ MAPPING - leftovers

    HNOUN MAPPING


    This (part of) documentation was generated from src/cg3/functions.cg3


    src-fst-morphology-affixes-adjectives.lexc.md

    Hill Mari adjective inflection

    This file contains a handful of lexica, each with 3 subentries. The two first ones give +Pos+Attr dn +Comp:рак, respectively, whereeas the third entry gives a +Der/N tag and redirects to the relevant noun lexica for case inflection

    Temporary lexicon

    Ordinary lexica


    This (part of) documentation was generated from src/fst/morphology/affixes/adjectives.lexc


    src-fst-morphology-affixes-adverbs.lexc.md

    Compounding

    Does Hill Mari have anything like LEXICON R and %- N ;

    Interj lexica

    Postp lexica

    ADV_

    Temporal tags with cases


    This (part of) documentation was generated from src/fst/morphology/affixes/adverbs.lexc


    src-fst-morphology-affixes-clitics.lexc.md

    Clitics


    Hill Mari (Western Mari) adverbial clitics are not bound by POS.

    K

    WORDEND


    This (part of) documentation was generated from src/fst/morphology/affixes/clitics.lexc


    src-fst-morphology-affixes-nouns.lexc.md

    Noun inflection


    Ad hoc lexica

    Substandard and other lexica, i.e., hunspell

    Standard lexica

    LEXICON N_KOL кол:кол Back harmony

    LEXICON N_MOER мӧр:мӧр Front harmony

    LEXICON N_POCHTA почта:почта

    LEXICON N_OLMA олма:олма Back harmony

    LEXICON N_AEZAE ӓзӓ:ӓзӓ Front harmony

    LEXICON N_PECHEN1E печенье:печенье Front harmony

    LEXICON NMN_KOL кол:кол Back harmony

    LEXICON NMN_MOER мӧр:мӧр Front harmony

    LEXICON NMN_OLMA олма:олма Back harmony

    LEXICON NMN_AEZAE ӓзӓ:ӓзӓ Front harmony

    LEXICON N_AEVAE ӓвӓ:ӓвӓ

    PxSg1+NB+CASE singular possessa

    PxSg1+NB+CASE singular possessa

    PXSG3


    This (part of) documentation was generated from src/fst/morphology/affixes/nouns.lexc


    src-fst-morphology-affixes-pronouns.lexc.md

    Pronouns


    Hill Mari (Western Mari) pronouns…

    PRON_

    PERS-SG1

    PERS-SG2

    PERS-SG3

    PERS-PL1

    PERS-PL2

    PERS-PL3 нӹнӹ:нӹнӹ

    REFL ӹшке:ӹшк

    LEXICON PRON_MA ма+Pron:ма

    LEXICON DEM-SG тидӹ:ти

    LEXICON DEM-PL нинӹ:ни Plural pronoun with additional plural marking

    DemTag What are these тӹдӹмӓт, тӹдӹлӓнӓт

    Dem-Cx


    This (part of) documentation was generated from src/fst/morphology/affixes/pronouns.lexc


    src-fst-morphology-affixes-propernouns.lexc.md

    Proper noun inflection

    Hill Mari proper nouns inflect in the same cases as regular nouns

    PROP_

    PROP_KENKSH

    PROP-PLC_

    PROP_KOL_PLC

    PROP_KOL_FEM

    PROP_OLMA_FEM

    PROP_OLMA_MAL

    PROP_KOL_MAL

    LEXICON PROP_KOL кол:кол PROP_KOL

    PROP-PLC_MOER

    PROP_MOER_MAL

    PROP-PLC_TYERVYE

    PROP-PLC_OLMA

    LEXICON PROP_OLMA кол:кол

    Male given name for deriving patronyms

    Вили:Вил

    Female Given names

    … etc.

    Russian type Surnames Абдеев:Абдеев

    Багрий:Багр

    Аморский:Аморск


    This (part of) documentation was generated from src/fst/morphology/affixes/propernouns.lexc


    src-fst-morphology-affixes-quantifiers.lexc.md

    $ Quantifiers

    Hill Mari (Western Mari) numerals…

    NUM_


    This (part of) documentation was generated from src/fst/morphology/affixes/quantifiers.lexc


    src-fst-morphology-affixes-symbols.lexc.md

    Symbol affixes

    Noun_symbols_possibly_inflected

    Noun_symbols_never_inflected

    SYMBOL_connector

    SYMBOL_NO_suff

    SYMBOL_suff


    This (part of) documentation was generated from src/fst/morphology/affixes/symbols.lexc


    src-fst-morphology-affixes-verbs.lexc.md

    Hill Mari (Western Mari) verb inflection

    Irregular lexica

    Also lexica for +Hom tags

    Regular verbs

    Both -am and -em verbs and their subgroups.

    V_AM verbs

    LEXICON V_IAEM Approximately 619 2014-12-21 Front Vowel harmony for V_AM verbs LEXICON V_IAEM-SG3 Approximately 4 2014-12-21 Front Vowel harmony for V_AM-SG3 verbs LEXICON V_IAEM-3 Approximately 2 2014-12-21 Front Vowel harmony for V_AM-3 verbs

    LEXICON V_MOAM Approximately 465 2014-12-21 Back Vowel harmony for V_AM verbs LEXICON V_MOAM-SG3 Approximately 2 2014-12-21 Back Vowel harmony for V_AM-SG3 verbs Approximately 1 2014-12-21 Back Vowel harmony for V_AM-3 verbs

    STEM CHANGE

    LEXICON V_PYECHKAEM Approximately 12 2014-12-21 Back Vowel harmony for V_AM verbs

    LEXICON V_KACHKAM Approximately 11 2014-12-21 Back Vowel harmony for V_AM verbs

    V_EM verbs

    LEXICON V_KACHKAM Approximately 1052 2014-12-21 Front Vowel harmony for V_EM verbs LEXICON V_KACHKAM Approximately 2 2014-12-21 Front Vowel harmony for V_EM-SG3 verbs

    LEXICON V_KACHKAM Approximately 1315 2014-12-21 Back Vowel harmony for V_EM verbs LEXICON V_KACHKAM Approximately 4 2014-12-21 Back Vowel harmony for V_EM verbs LEXICON V_KACHKAM Approximately 1 2014-12-21 Back Vowel harmony for V_EM verbs

    Lexica pointing to suffix lexica

    Intermediate AM lexica

    LEXICON V_AM ӹштӓш:ӹшт

    пелӓш: онг пелеш What else is needed 2014-05-08

    Intermediate EM lexcia

    хӹдӹртӓш: хӹдӹртӹ What else is needed 2014-05-08

    Suffix lexica

    NONPAST

    am verbs

    INDPRSSG1-am

    INDPRSSG2-am

    INDPRSSG3-am

    INDPRSPL1-am

    INDPRSPL2-am

    INDPRSPL3-am

    INDPRSCONNEG-am

    INDPRSPL3CONNEG-am

    em verbs

    INDPRSSG3-em

    INDPRSPL3-em

    PRETERIT 1

    am

    INDPRT1SG1-am

    INDPRT1SG2-am

    INDPRT1SG3-am

    INDPRT1PL1-am

    INDPRT1PL2-am

    INDPRT1PL3-am

    INDPRT1CONNEG-am

    INDPRT1PL3CONNEG-am

    em

    INDPRT1SG1-em

    INDPRT1SG2-em

    INDPRT1SG3-em

    INDPRT1PL1-em

    INDPRT1PL2-em

    PRETERIT 2

    am

    INDPRT2SG1-am

    INDPRT2SG2-am

    INDPRT2SG3-am

    INDPRT2PL1-am

    INDPRT2PL2-am

    INDPRT2PL3-am

    INDPRT2NEG-am INDPRT2NEG-am INDPRT2NEG-am INDPRT2NEG-am INDPRT2NEG-am INDPRT2NEG-am INDPRT2NEG-am

    em

    IMPERATIVE

    IMPRTSG2-am

    IMPRTSG3-am

    IMPRTPL2-am

    IMPRTPL3-am

    IMPRTIISG2-am

    IMPRTIIPL2-am

    DESIDERATIVE

    DES-am DESSG1-am DESSG2-am DESSG3-am DESPL1-am DESPL2-am DESPL3-am

    DES-em DESSG1-em DESSG2-em DESSG3-em DESPL1-em DESPL2-em DESPL3-em

    INFINITIVE

    INF_BACK

    NEG-PRC_BACK

    PASS-PRC_BACK

    ACT-PRC_BACK

    INF_FRONT

    NEG-PRC_FRONT

    PASS-PRC_FRONT

    ACT-PRC_FRONT


    This (part of) documentation was generated from src/fst/morphology/affixes/verbs.lexc


    src-fst-morphology-phonology.twolc.md

    Morphophonological rules for Hill Mari

    This file documents the phonology.twolc file

    Alphabet, Sets and Definitions

    Alphabet

    Letters of the alphabet

    Archiphonemes for vowels

    Triggers

    Boundary symbols

    Sets

    Definitions

    Back Harmony BHARM

    Front Harmony FHARM

    LFH

    LBH

    Rules

    Palatalisation rules

    Palatal mark loss before vowel rule

    й Deletion in front of я Suffix and others rule

    Tests:

    Ь2:ь Palatal mark for V АМ +Prt1+Sg1 rule

    толаш+V+Ind+Prt1+Sg1

    Tests:

    Ь2:0 Palatal mark for V АМ +Prt1+Sg1 rule

    Ь2:ш Palatal mark for V ЕМ +Prt1+Sg1 rule

    Vowel rules

    Onset vowel in а rule

    Onset vowel in ӓ rule

    Onset vowel in я rule

    Onset vowel in е rule

    Onset vowel loss in suffix ыӹ0 rule

    Onset vowel loss in suffix Е3 rule

    Onset vowel loss in suffix Е3 rule

    Onset vowel Е2 realized in suffix :е rule

    Onset vowel Е2:э after retained vowel rule

    Onset vowel Е3:э after retained vowel rule

    Onset vowel е:э after retained vowel rule

    Stem final ы loss before Е2 rule

    Tests:

    Stem final ӹ loss before Е2 rule

    Tests:

    Vowel harmony rules

    Onset vowel ыӹ0 realized in suffix %{ыӹØ%}:ы rule

    Onset vowel %{ыӹØ%} realized in suffix %{ыӹØ%}:ӹ rule

    Onset vowel %{ыӹе%} realized in suffix %{ыӹе%}:ӹ rule

    Onset vowel %{ыӹе%} realized in suffix %{ыӹе%}:ы rule

    Onset vowel %{ыӹе%} realized in suffix %{ыӹе%}:е rule

    Onset vowel %{ыӹе%} realized in suffix %{ыӹе%}:э rule

    Onset vowel %{ыӹэ%} realized in Ine and Ill suffixes %{ыӹэ%}:0 rule

    Onset vowel %{ыӹэ%} realized in Ine and Ill suffixes %{ыӹэ%}:0 rule

    Tests:

    Affix mid or final front %{аӓ%}:ӓ rule

    Tests:

    ӹштӓш+Hom2+V+Ind+Prs+Sg3: do/tehdä

    Tests:

    Affix mid or final back %{аӓ%}:а rule

    толаш+V+Ind+Prs+Pl1: come/tulla

    Tests:
    Tests:

    Affix mid or final back %{аӓ%}:я rule

    Tests:

    Affix initial back а:я rule

    Not SgNom а:ы rule

    Not SgNom а:ӹ rule

    suffix-final vowel backed %{ыӹ%}ы rule

    suffix-final vowel fronted %{ыӹ%}:ӹ rule

    Consonant loss or transformation rules

    т:0 in am type verbs rule лӓктӓш:лӓкнӓ

    к:0 in am type verbs rule качкаш:качна

    з:ц in am type verbs rule вазаш:вацна

    н:0 before з:ц in am type verbs rule негӹнзӓш:негӹц

    Vowel addition rules

    Vowel gain

    0:ы between ш _ ж rule йиш:йишӹжӹ

    0:ӹ between ш _ ж rule йиш:йишӹжӹ

    Vowel loss rules

    suffix-final vowel loss before subsequent suffix-initial vowel %{ыӹ%}:0 rule

    Not SgNom for lat а:0 rule

    Not SgNom for lat е:0 rule

    Tests:

    Devoicing

    Onset consonant devoicing rule


    This (part of) documentation was generated from src/fst/morphology/phonology.twolc


    src-fst-morphology-root.lexc.md

    Western Mari morphological analyser

    The file declares the numticharacter symbols of Western Mari, and gives the Root lexicon.

    Analysis symbols The morphological analyses of wordforms for the Western Mari language are presented in this system in terms of the following symbols. (It is highly suggested to follow existing standards when adding new tags).

    The parts-of-speech tags:

    Subtags for pronouns

    Tags for nominals nominals

    Suffix ordering tags:

    Tags for numerals

    Tags for verbs

    Usage tags:

    Tag from older orthographic norms

    Other tags words

    Question and Focus particles:

    Tags to be checked and harmonised.

    Semantic tags

    Multiple Semantic tags:

    Semantics are classified with

    Derivation

    Derivations are classified under the morphophonetic form of the suffix, the source and target part-of-speech.

    Morphophonology

    To represent phonologic variations in word forms we use the following symbols in the lexicon files:

    Lexeme disambiguation tags

    Flag diacritics

    We have manually optimised the structure of our lexicon using following flag diacritics to restrict morhpological combinatorics - only allow compounds with verbs if the verb is further derived into a noun again:

    Flag Comment
    @P.NeedNoun.ON@ (Dis)allow compounds with verbs unless nominalised
    @D.NeedNoun.ON@ (Dis)allow compounds with verbs unless nominalised
    @C.NeedNoun@ (Dis)allow compounds with verbs unless nominalised

    For languages that allow compounding, the following flag diacritics are needed to control position-based compounding restrictions for nominals. Their use is handled automatically if combined with +CmpN/xxx tags. If not used, they will do no harm.

    Flag Comment
    @P.CmpFrst.FALSE@ Require that words tagged as such only appear first
    @D.CmpPref.TRUE@ Block such words from entering ENDLEX
    @P.CmpPref.FALSE@ Block these words from making further compounds
    @D.CmpLast.TRUE@ Block such words from entering R
    @D.CmpNone.TRUE@ Combines with the next tag to prohibit compounding
    @U.CmpNone.FALSE@ Combines with the prev tag to prohibit compounding
    @P.CmpOnly.TRUE@ Sets a flag to indicate that the word has passed R
    @D.CmpOnly.FALSE@ Disallow words coming directly from root.

    Use the following flag diacritics to control downcasing of derived proper nouns (e.g. Finnish Pariisi -> pariisilainen). See e.g. North Sámi for how to use these flags. There exists a ready-made regex that will do the actual down-casing given the proper use of these flags.

    Flag Comment
    @U.Cap.Obl@ Allowing downcasing of derived names: deatnulasj.
    @U.Cap.Opt@ Allowing downcasing of derived names: deatnulasj.

    Root lexicon

    The word forms in Hill Mari start from the lexeme roots of basic word classes, or optionally from prefixes: The assumption is that xml files with names pos.xml will provide the source material for the initial pos.lexc LEXICON Pos entries

    ENDLEX goes to # for now.


    This (part of) documentation was generated from src/fst/morphology/root.lexc


    src-fst-morphology-stems-adjectives_newwords.lexc.md

    This is where new words are added as lexc entries before they are added to the xml source files. йолтишкӓ:йолтишкӓ A_OLMA “(eng) /(fin) /(rus) “ ;

    ADD ADJECTIVES BELOW


    This (part of) documentation was generated from src/fst/morphology/stems/adjectives_newwords.lexc


    src-fst-morphology-stems-exceptions.lexc.md

    PROPER GIVEN NAMES

    PROPER PATRONYMS

    PROPER PLACE NAMES


    This (part of) documentation was generated from src/fst/morphology/stems/exceptions.lexc


    src-fst-morphology-stems-mrj-propernouns.lexc.md

    This may contain Meadow & Eastern Mari (mhr) place names. the letter ‹ҥ› is only used in (mhr)

    йолтишкӓ+N+Prop:йолтишкӓ PROP_OLMA “(eng) /(fin) /(rus) “ ;


    This (part of) documentation was generated from src/fst/morphology/stems/mrj-propernouns.lexc


    src-fst-morphology-stems-nouns_newwords.lexc.md

    This is where new words are added as lexc entries before they are added to the xml source files. йолтишкӓ+N:йолтишкӓ A_OLMA “(eng) /(fin) /(rus) “ ;

    ADD NOUNS BELOW


    This (part of) documentation was generated from src/fst/morphology/stems/nouns_newwords.lexc


    src-fst-morphology-stems-numerals.lexc.md

    Meadow & Eastern Mari numerals

    The initial lexica

    The Roman numerals ! —————— !


    This (part of) documentation was generated from src/fst/morphology/stems/numerals.lexc


    src-fst-morphology-stems-prefixes.lexc.md

    Prefixes Prefixes in the Western Mari language are bound to beginning of other words.


    This (part of) documentation was generated from src/fst/morphology/stems/prefixes.lexc


    src-fst-morphology-stems-propernouns_newwords.lexc.md

    This is where new words are added as lexc entries before they are added to the xml source files. йолтишкӓ+N+Prop:йолтишкӓ PROP_OLMA “(eng) /(fin) /(rus) “ ;

    ADD NOUNS BELOW

    MARI-LIKE NAMES

    PLACE NAMES


    This (part of) documentation was generated from src/fst/morphology/stems/propernouns_newwords.lexc


    src-fst-phonetics-txt2ipa.xfscript.md

    retroflex plosive, voiceless t ʈ 0288, 648 ( = ASCII 096) retroflex plosive, voiced d ɖ 0256, 598 labiodental nasal F ɱ 0271, 625 retroflex nasal n ɳ 0273, 627 palatal nasal J ɲ 0272, 626 velar nasal N ŋ 014B, 331 uvular nasal N\ ɴ 0274, 628

    bilabial trill B\ ʙ 0299, 665 uvular trill R\ ʀ 0280, 640 alveolar tap 4 ɾ 027E, 638 retroflex flap r ɽ 027D, 637 bilabial fricative, voiceless p\ ɸ 0278, 632 bilabial fricative, voiced B β 03B2, 946 dental fricative, voiceless T θ 03B8, 952 dental fricative, voiced D ð 00F0, 240 postalveolar fricative, voiceless S ʃ 0283, 643 postalveolar fricative, voiced Z ʒ 0292, 658 retroflex fricative, voiceless s ʂ 0282, 642 retroflex fricative, voiced z` ʐ 0290, 656 palatal fricative, voiceless C ç 00E7, 231 palatal fricative, voiced j\ ʝ 029D, 669 velar fricative, voiced G ɣ 0263, 611 uvular fricative, voiceless X χ 03C7, 967 uvular fricative, voiced R ʁ 0281, 641 pharyngeal fricative, voiceless X\ ħ 0127, 295 pharyngeal fricative, voiced ?\ ʕ 0295, 661 glottal fricative, voiced h\ ɦ 0266, 614

    alveolar lateral fricative, vl. K alveolar lateral fricative, vd. K\

    labiodental approximant P (or v) alveolar approximant r\ retroflex approximant r` velar approximant M\

    retroflex lateral approximant l` palatal lateral approximant L velar lateral approximant L
    Clicks

    bilabial O\ (O = capital letter) dental |
    (post)alveolar !\ palatoalveolar =\ alveolar lateral ||
    Ejectives, implosives

    ejective > e.g. ejective p p> implosive < e.g. implosive b b< Vowels

    close back unrounded M close central unrounded 1 close central rounded } lax i I lax y Y lax u U

    close-mid front rounded 2 close-mid central unrounded @\ close-mid central rounded 8 close-mid back unrounded 7

    schwa ə @

    open-mid front unrounded E open-mid front rounded 9 open-mid central unrounded 3 open-mid central rounded 3\ open-mid back unrounded V open-mid back rounded O

    ash (ae digraph) { open schwa (turned a) 6

    open front rounded & open back unrounded A open back rounded Q Other symbols

    voiceless labial-velar fricative W voiced labial-palatal approx. H voiceless epiglottal fricative H\ voiced epiglottal fricative <\ epiglottal plosive >\

    alveolo-palatal fricative, vl. s\ alveolo-palatal fricative, voiced z\ alveolar lateral flap l\ simultaneous S and x x\ tie bar _ Suprasegmentals

    primary stress “ secondary stress % long : half-long :\ extra-short _X linking mark -
    Tones and word accents

    level extra high _T level high _H level mid _M level low _L level extra low _B downstep ! upstep ^ (caret, circumflex)

    contour, rising contour, falling _F contour, high rising _H_T contour, low rising _B_L

    contour, rising-falling _R_F (NB Instead of being written as diacritics with _, all prosodic marks can alternatively be placed in a separate tier, set off by < >, as recommended for the next two symbols.) global rise global fall Diacritics

    voiceless 0 (0 = figure), e.g. n_0 voiced _v aspirated _h more rounded _O (O = letter) less rounded _c advanced _+ retracted _- centralized _” syllabic = (or _=) e.g. n= (or n=) non-syllabic _^ rhoticity `

    breathy voiced _t creaky voiced _k linguolabial _N labialized _w palatalized ‘ (or _j) e.g. t’ (or t_j) velarized _G pharyngealized _?\

    dental d apical _a laminal _m nasalized ~ (or _~) e.g. A~ (or A~) nasal release _n lateral release _l no audible release _}

    velarized or pharyngealized _e velarized l, alternatively 5 raised _r lowered _o advanced tongue root _A retracted tongue root _q


    This (part of) documentation was generated from src/fst/phonetics/txt2ipa.xfscript


    src-fst-transcriptions-transcriptor-abbrevs2text.lexc.md

    We describe here how abbreviations are in Western Mari are read out, e.g. for text-to-speech systems.

    For example:


    This (part of) documentation was generated from src/fst/transcriptions/transcriptor-abbrevs2text.lexc


    tools-grammarcheckers-grammarchecker.cg3.md

    [ L A N G U A G E ] G R A M M A R C H E C K E R

    DELIMITERS

    TAGS AND SETS

    Tags

    This section lists all the tags inherited from the fst, and used as tags in the syntactic analysis. The next section, Sets, contains sets defined on the basis of the tags listed here, those set names are not visible in the output.

    Beginning and end of sentence

    BOS EOS

    Parts of speech tags

    N A Adv V Pron CS CC CC-CS Po Pr Pcle Num Interj ABBR ACR CLB LEFT RIGHT WEB PPUNCT PUNCT

    COMMA ¶

    Tags for POS sub-categories

    Pers Dem Interr Indef Recipr Refl Rel Coll NomAg Prop Allegro Arab Romertall

    Tags for morphosyntactic properties

    Nom Acc Gen Ill Loc Com Ess Ess Sg Du Pl Cmp/SplitR Cmp/SgNom Cmp/SgGen Cmp/SgGen PxSg1 PxSg2 PxSg3 PxDu1 PxDu2 PxDu3 PxPl1 PxPl2 PxPl3 Px

    Comp Superl Attr Ord Qst IV TV Prt Prs Ind Pot Cond Imprt ImprtII Sg1 Sg2 Sg3 Du1 Du2 Du3 Pl1 Pl2 Pl3 Inf ConNeg Neg PrfPrc VGen PrsPrc Ger Sup Actio VAbess

    Err/Orth

    Semantic tags

    Sem/Act Sem/Ani Sem/Atr Sem/Body Sem/Clth Sem/Domain Sem/Feat-phys Sem/Fem Sem/Group Sem/Lang Sem/Mal Sem/Measr Sem/Money Sem/Obj Sem/Obj-el Sem/Org Sem/Perc-emo Sem/Plc Sem/Sign Sem/State-sick Sem/Sur Sem/Time Sem/Txt

    HUMAN

    PROP-ATTR PROP-SUR

    TIME-N-SET

    Syntactic tags

    @+FAUXV @+FMAINV @-FAUXV @-FMAINV @-FSUBJ> @-F<OBJ @-FOBJ> @-FSPRED<OBJ @-F<ADVL @-FADVL> @-F<SPRED @-F<OPRED @-FSPRED> @-FOPRED> @>ADVL @ADVL< @<ADVL @ADVL> @ADVL @HAB> @<HAB @>N @Interj @N< @>A @P< @>P @HNOUN @INTERJ @>Num @Pron< @>Pron @Num< @OBJ @<OBJ @OBJ> @OPRED @<OPRED @OPRED> @PCLE @COMP-CS< @SPRED @<SPRED @SPRED> @SUBJ @<SUBJ @SUBJ> SUBJ SPRED OPRED @PPRED @APP @APP-N< @APP-Pron< @APP>Pron @APP-Num< @APP-ADVL< @VOC @CVP @CNP OBJ

    -OTHERS SYN-V @X ## Sets containing sets of lists and tags This part of the file lists a large number of sets based partly upon the tags defined above, and partly upon lexemes drawn from the lexicon. See the sourcefile itself to inspect the sets, what follows here is an overview of the set types. ### Sets for Single-word sets INITIAL ### Sets for word or not WORD NOT-COMMA ### Case sets ADLVCASE CASE-AGREEMENT CASE NOT-NOM NOT-GEN NOT-ACC ### Verb sets NOT-V ### Sets for finiteness and mood REAL-NEG MOOD-V NOT-PRFPRC ### Sets for person SG1-V SG2-V SG3-V DU1-V DU2-V DU3-V PL1-V PL2-V PL3-V ### Pronoun sets ### Adjectival sets and their complements ### Adverbial sets and their complements ### Sets of elements with common syntactic behaviour ### NP sets defined according to their morphosyntactic features ### The PRE-NP-HEAD family of sets These sets model noun phrases (NPs). The idea is to first define whatever can occur in front of the head of the NP, and thereafter negate that with the expression **WORD - premodifiers**. ### Border sets and their complements ### Grammarchecker sets * * * This (part of) documentation was generated from [tools/grammarcheckers/grammarchecker.cg3](https://github.com/giellalt/lang-mrj/blob/main/tools/grammarcheckers/grammarchecker.cg3) --- # tools-tokenisers-tokeniser-disamb-gt-desc.1938.pmscript.md Requires a recent version of HFST (3.10.0 / git revision>=3aecdbc) Then just: $ make $ echo "ja, ja" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst Issues: - [X] Ambiguous input - Seems to work fine - [X] Ambiguous multiword expessions with ambiguous tokenisation - Seems to work – represented within lexc now; hfst-tokenise also supports forms on the analyses now - [X] Ambiguous multiword expessions need reorganising after CG - The module cg-mwesplit takes wordforms from readings and turns them into new cohorts - [X] Unknown words - The set-difference method only works for words without flag diacritics (even though we should be working only on the form-side?) and leads to binary blow-up: With only lower unknowns, we get 45M; lower+upper gives 67M, while no unknowns gives 27M - Fixed instead by treating empty analyses as unknown-tokens in hfst-tokenise, and outputting unmatched strings with a prefix - [ ] Treat input that's within superblanks as unmatched - probably requires a change in hfst-tokenise itself - [X] Try >1 space for ambiguous MWE's? – represented within lexc now - [ ] Try set-difference-unknowns method with regular hfst commands? More usage examples: $ echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa boasttu olmmoš, man mielde lahtuid." | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst $ echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst $ echo "márffibiillagáffe" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst Pmatch documentation: https://kitwiki.csc.fi/twiki/bin/view/KitWiki/HfstPmatch Apart from what's in our morphology, there are 1) unknown word-like forms, and 2) unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a TODO: Could use something like this, but built-in's don't include šžđčŋ: Simply give an empty reading when something is unknown: hfst-tokenise --giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it's safer to let hfst-tokenise handle them. Needs hfst-tokenise to output things differently depending on the tag they get * * * This (part of) documentation was generated from [tools/tokenisers/tokeniser-disamb-gt-desc.1938.pmscript](https://github.com/giellalt/lang-mrj/blob/main/tools/tokenisers/tokeniser-disamb-gt-desc.1938.pmscript) --- # tools-tokenisers-tokeniser-disamb-gt-desc.eighties.pmscript.md Requires a recent version of HFST (3.10.0 / git revision>=3aecdbc) Then just: $ make $ echo "ja, ja" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst Issues: - [X] Ambiguous input - Seems to work fine - [X] Ambiguous multiword expessions with ambiguous tokenisation - Seems to work – represented within lexc now; hfst-tokenise also supports forms on the analyses now - [X] Ambiguous multiword expessions need reorganising after CG - The module cg-mwesplit takes wordforms from readings and turns them into new cohorts - [X] Unknown words - The set-difference method only works for words without flag diacritics (even though we should be working only on the form-side?) and leads to binary blow-up: With only lower unknowns, we get 45M; lower+upper gives 67M, while no unknowns gives 27M - Fixed instead by treating empty analyses as unknown-tokens in hfst-tokenise, and outputting unmatched strings with a prefix - [ ] Treat input that's within superblanks as unmatched - probably requires a change in hfst-tokenise itself - [X] Try >1 space for ambiguous MWE's? – represented within lexc now - [ ] Try set-difference-unknowns method with regular hfst commands? More usage examples: $ echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa boasttu olmmoš, man mielde lahtuid." | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst $ echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst $ echo "márffibiillagáffe" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst Pmatch documentation: https://kitwiki.csc.fi/twiki/bin/view/KitWiki/HfstPmatch Apart from what's in our morphology, there are 1) unknown word-like forms, and 2) unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a TODO: Could use something like this, but built-in's don't include šžđčŋ: Simply give an empty reading when something is unknown: hfst-tokenise --giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it's safer to let hfst-tokenise handle them. Needs hfst-tokenise to output things differently depending on the tag they get * * * This (part of) documentation was generated from [tools/tokenisers/tokeniser-disamb-gt-desc.eighties.pmscript](https://github.com/giellalt/lang-mrj/blob/main/tools/tokenisers/tokeniser-disamb-gt-desc.eighties.pmscript) --- # tools-tokenisers-tokeniser-disamb-gt-desc.pmscript.md # Tokeniser for mrj Usage: ``` $ make $ echo "ja, ja" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst $ echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa boasttu olmmoš, man mielde lahtuid." | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst $ echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst $ echo "márffibiillagáffe" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst ``` Pmatch documentation: <https://github.com/hfst/hfst/wiki/HfstPmatch> Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words: * Punct contains ASCII punctuation marks * The symbol after m-dash is soft-hyphen `U+00AD` * The symbol following {•} is byte-order-mark / zero-width no-break space `U+FEFF`. Whitespace contains ASCII white space and the List contains some unicode white space characters * En Quad U+2000 to Zero-Width Joiner U+200d' * Narrow No-Break Space U+202F * Medium Mathematical Space U+205F * Word joiner U+2060 Apart from what's in our morphology, there are 1. unknown word-like forms, and 2. unmatched strings We want to give 1) a match, but let 2) be treated specially by `hfst-tokenise -a` Unknowns are made of: * lower-case ASCII * upper-case ASCII * some cyrillics * select extended latin symbols * extended cyrillics ASCII digits * select symbols * Combining diacritics as individual symbols, * various symbols from Private area (probably Microsoft), so far: * U+F0B7 for "x in box" ## Unknown handling Unknowns are tagged ?? and treated specially with `hfst-tokenise` hfst-tokenise --giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it's safer to let hfst-tokenise handle them. Finally we mark as a token any sequence making up a: * known word in context * unknown (OOV) token in context * sequence of word and punctuation * URL in context * * * This (part of) documentation was generated from [tools/tokenisers/tokeniser-disamb-gt-desc.pmscript](https://github.com/giellalt/lang-mrj/blob/main/tools/tokenisers/tokeniser-disamb-gt-desc.pmscript) --- # tools-tokenisers-tokeniser-disamb-gt-desc.thirties.pmscript.md Requires a recent version of HFST (3.10.0 / git revision>=3aecdbc) Then just: $ make $ echo "ja, ja" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst Issues: - [X] Ambiguous input - Seems to work fine - [X] Ambiguous multiword expessions with ambiguous tokenisation - Seems to work – represented within lexc now; hfst-tokenise also supports forms on the analyses now - [X] Ambiguous multiword expessions need reorganising after CG - The module cg-mwesplit takes wordforms from readings and turns them into new cohorts - [X] Unknown words - The set-difference method only works for words without flag diacritics (even though we should be working only on the form-side?) and leads to binary blow-up: With only lower unknowns, we get 45M; lower+upper gives 67M, while no unknowns gives 27M - Fixed instead by treating empty analyses as unknown-tokens in hfst-tokenise, and outputting unmatched strings with a prefix - [ ] Treat input that's within superblanks as unmatched - probably requires a change in hfst-tokenise itself - [X] Try >1 space for ambiguous MWE's? – represented within lexc now - [ ] Try set-difference-unknowns method with regular hfst commands? More usage examples: $ echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa boasttu olmmoš, man mielde lahtuid." | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst $ echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst $ echo "márffibiillagáffe" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst Pmatch documentation: https://kitwiki.csc.fi/twiki/bin/view/KitWiki/HfstPmatch Apart from what's in our morphology, there are 1) unknown word-like forms, and 2) unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a TODO: Could use something like this, but built-in's don't include šžđčŋ: Simply give an empty reading when something is unknown: hfst-tokenise --giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it's safer to let hfst-tokenise handle them. Needs hfst-tokenise to output things differently depending on the tag they get * * * This (part of) documentation was generated from [tools/tokenisers/tokeniser-disamb-gt-desc.thirties.pmscript](https://github.com/giellalt/lang-mrj/blob/main/tools/tokenisers/tokeniser-disamb-gt-desc.thirties.pmscript) --- # tools-tokenisers-tokeniser-gramcheck-gt-desc.pmscript.md # Grammar checker tokenisation for mrj Requires a recent version of HFST (3.10.0 / git revision>=3aecdbc) Then just: ``` $ make $ echo "ja, ja" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst ``` More usage examples: ``` $ echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa boasttu olmmoš, man mielde lahtuid." | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst $ echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst $ echo "márffibiillagáffe" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst ``` Pmatch documentation: <https://github.com/hfst/hfst/wiki/HfstPmatch> Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words: * Punct contains ASCII punctuation marks * The symbol after m-dash is soft-hyphen `U+00AD` * The symbol following {•} is byte-order-mark / zero-width no-break space `U+FEFF`. Whitespace contains ASCII white space and the List contains some unicode white space characters * En Quad U+2000 to Zero-Width Joiner U+200d' * Narrow No-Break Space U+202F * Medium Mathematical Space U+205F * Word joiner U+2060 Apart from what's in our morphology, there are 1) unknown word-like forms, and 2) unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a * select extended latin symbols * select symbols * various symbols from Private area (probably Microsoft), so far: * U+F0B7 for "x in box" TODO: Could use something like this, but built-in's don't include šžđčŋ: Simply give an empty reading when something is unknown: hfst-tokenise --giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it's safer to let hfst-tokenise handle them. Finally we mark as a token any sequence making up a: * known word in context * unknown (OOV) token in context * sequence of word and punctuation * URL in context * * * This (part of) documentation was generated from [tools/tokenisers/tokeniser-gramcheck-gt-desc.pmscript](https://github.com/giellalt/lang-mrj/blob/main/tools/tokenisers/tokeniser-gramcheck-gt-desc.pmscript) --- # tools-tokenisers-tokeniser-tts-cggt-desc.pmscript.md # TTS tokenisation for smj Requires a recent version of HFST (3.10.0 / git revision>=3aecdbc) Then just: ```sh make echo "ja, ja" \ | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst ``` More usage examples: ```sh echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa \ boasttu olmmoš, man mielde lahtuid." \ | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" \ | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst echo "márffibiillagáffe" \ | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst ``` Pmatch documentation: <https://kitwiki.csc.fi/twiki/bin/view/KitWiki/HfstPmatch> Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words: * Punct contains ASCII punctuation marks * The symbol after m-dash is soft-hyphen `U+00AD` * The symbol following {•} is byte-order-mark / zero-width no-break space `U+FEFF`. Whitespace contains ASCII white space and the List contains some unicode white space characters * En Quad U+2000 to Zero-Width Joiner U+200d' * Narrow No-Break Space U+202F * Medium Mathematical Space U+205F * Word joiner U+2060 Apart from what's in our morphology, there are 1) unknown word-like forms, and 2) unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a * select extended latin symbols * select symbols * various symbols from Private area (probably Microsoft), so far: * U+F0B7 for "x in box" TODO: Could use something like this, but built-in's don't include šžđčŋ: Simply give an empty reading when something is unknown: hfst-tokenise --giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it's safer to let hfst-tokenise handle them. Needs hfst-tokenise to output things differently depending on the tag they get * * * This (part of) documentation was generated from [tools/tokenisers/tokeniser-tts-cggt-desc.pmscript](https://github.com/giellalt/lang-mrj/blob/main/tools/tokenisers/tokeniser-tts-cggt-desc.pmscript)