Erzya NLP Grammar

Finite state and Constraint Grammar based analysers, proofing tools and other resources

View the project on GitHub giellalt/lang-myv

Page Content

Erzya language model documentation

All doc-comment documentation in one large file.


src-cg3-disambiguator.cg3.md

DELIMITERS

TAGS AND SETS

Tags

This section lists all the tags inherited from the fst, and used as tags in the syntactic analysis. The next section, Sets, contains sets defined on the basis of the tags listed here, those set names are not visible in the output.

Beginning and end of sentence

BOS EOS

Parts of speech tags

Dialect homonyms of Sg Gen Def

foreign

Semantic tags

noun phrase heads

Syntactic tags

Upper and lower case

Sets containing sets of lists and tags

This part of the file lists a large number of sets based partly upon the tags defined above, and partly upon lexemes drawn from the lexicon. See the sourcefile itself to inspect the sets, what follows here is an overview of the set types.

Sets for Single-word sets

the set INITIAL for initial letters INITIAL

Sets for word or not

Derivational affixes

Case sets

ADLVCASE

Verb sets

NOT-V

Sets for finiteness and mood

MOOD-V

Homonymy for subject conjugation and subject-object conjugation with Pl3 object

VFIN

VFIN-POS

Sets for person

Pronoun sets

кортамс мезде

words that go with эрьва for кизэ homonymy PxSg2 for кизэ homonymy PxSg1 This will be expanded for homonymy at first

This will be expanded for homonymy at first, i.e., diminutives

these have homonyms

used with Dat PxSg1

Derivation tags

2VDerTag 2NDerTag

DerTag

Pl Nom Def is Homomym with verb stem in тне-мс. This is relative for Clt/Cop with ScPl1 and ScPl2

in SP Gen Indef the next word can be кель

2023_03_15 important part of regular inflection


This (part of) documentation was generated from src/cg3/disambiguator.cg3


src-cg3-functions.cg3.md

negation marker for fits between negation and conneg

MOOD-V

Erzya and Moksha

this needs Moksha, too.

finite auxiliary verbs with

макссь чарькодемс, Deal with DATAUX separately; they also take MS

finite auxiliary taking supine MO/ME

finite supaux 2023_03_13


This (part of) documentation was generated from src/cg3/functions.cg3


src-fst-morphology-affixes-adjectives.lexc.md

Adjective inflection

Adjectives and other parts of speech in ERZYA are compared by means of either a particle or ablative case marking on the standard of comparison

ordinals in -це

истямо:истя

кондямо:кондя

кодамо:кода кодамо:кода кодатнэ кодатне


This (part of) documentation was generated from src/fst/morphology/affixes/adjectives.lexc


src-fst-morphology-affixes-adpositions.lexc.md

The Erzya language postpositions can be broken into many subgroups according to morphological and semantic criteria

Some of the nouns have defective paradigms: € кудыкельганть

ало:ал alo-SPAT-1Arg

This allows for possessor indices, word end or focus e.g. вельде, вельдеяк, вельдензэ ?вельдензэль, вельдензэтне

This allows for word end, possessor indices, predication

postposition that is in ablative case алдо:алдо

postposition that is in elative case потсто:потсто

postposition that is in illative case эземс:эзем

postposition that is in illative case эйс:э

postposition that is in inessive case эйсэ:эйсэ, кисэ

postposition that is in lative case ютков:ютков

postposition that is in locative case ало:ало

postposition that has no continuation пачк

postposition that is in ablative case алдо:алдо

postposition that is in elative case потсто:потсто

postposition that is in illative case малас:мала

postposition that is in illative case малас:мала

postposition that is in illative case потс:пот

postposition that is in illative case эйс:э

postposition that is in inessive case потсо:потсо

postposition that is in lative case алов:ало

postposition that is in locative case ало:ало

postposition that is in prolative case перька:перька

+Temp: K ; перть

+Ela+Temp: PO_POSS_OR_END_FOC ; пингстэ


This (part of) documentation was generated from src/fst/morphology/affixes/adpositions.lexc


src-fst-morphology-affixes-adverbs.lexc.md

Adverb inflection

The Erzya language adverbs do not compare.

Not a real particle; it can take a clitic седеяк

LEXICON ADV-SPAT_ пачк

LEXICON ADV_IS_LAT алов

LEXICON ADV_IS_LOC ало

LEXICON ADV/PO/PRON-SPAT_ALO ало:ал

LEXICON ADV-SPAT_ALO ало:ал

“стядо”

spatial adverbs dependent and independent case marking

This marking would indicate a word form that may be


This (part of) documentation was generated from src/fst/morphology/affixes/adverbs.lexc


src-fst-morphology-affixes-interjections.lexc.md

Interjections

The Erzya language interjections…


This (part of) documentation was generated from src/fst/morphology/affixes/interjections.lexc


src-fst-morphology-affixes-nonverbalConjugation.lexc.md

Non-Verbal conjugation

In the Erzya language nominals and adverbs also conjugate

Used with deverbals

This is where adjectives get their plural T.

used with infinitives

Conjugation

NON-VERB CONJUGATION

Conjugation

_KAL-NomSg-Conjugation-only

This allows Clt/Cop+Prs Sg1|Sg2|Pl1|Pl2 Clt/Cop+Prt2 Sg1|Sg2|Sg3|Pl1|Pl2|Pl3 K 2019-01-26

_KUDO-NomPl-Conjugation-only

_KUDO-NomPl-Conjugation-only-mutual

Are there copula verb combinations? 2024-08-06


This (part of) documentation was generated from src/fst/morphology/affixes/nonverbalConjugation.lexc


src-fst-morphology-affixes-nouns.lexc.md

Noun inflection

Nouns in ERZYA inflect for number, case and declension (definite, indefinite and possessive).

LEXICON N_PELE пеле:пель, ало:ал

KINSHIP

HUMAN

PLACE

LATIVE

VOCATIVE

NAMES OF MONTHS

COMMON NOUNS

кардаз:карда

панго:панг

потмо:пот

Front vowel, non-palatal consonant before vowel Front vowel, non-palatal consonant before vowel

Front vowel, palatal consonant before vowel

Front vowel, non-palatal consonant before vowel Front vowel, non-palatal consonant before vowel

Does this need a diminutive?

NMN

harmony: front

DERIVATION

pango:pang

N_KUDO-Def-Declension

N_KUDO-Def-Declension

N_KUDO-Def-Declension

Plurale tantum

DEFINITE SINGULAR TAGS

INDEFINITE DECLENSION

SG-NOM-INDEF_LAK ;

SG-NOM-INDEF_KAL ;

SG-NOM-INDEF_OSH ;

INDEFINITE TAGS

POSSESSIVE DECLENSION

CASES BEFORE POSSESSIVE TAGS

DEFINITE PLURAL

Cases for тнэ

NP head ellipsis declension, Modifiers without nouns = MWN

Nouns1S_A

POSSESSIVE marking followed by clitics

Possessor indices

The Erzya language possessor indices or possessive suffixes may be followed by a number of morpheme types

These are possessor indices that can be followed by predicate marking in the present there is no destinction between ScSg3 and ScPl3 Possessor indices allowing (1) #, (2) Foc, (3) Der/Pr ()

This appears with kindred terminology

Is “_KAL” necessary ?

DAT-PXPL1 ;

POSSESSIVE TAGS

These are possessor Indices for non-nominative singular NonNomSg

word boundary or focus


This (part of) documentation was generated from src/fst/morphology/affixes/nouns.lexc


src-fst-morphology-affixes-pronouns.lexc.md

Pronoun inflection

Erzya pronouns inflect in many the same cases as regular nouns.

Closed class personal pronouns

+Sem/Hum+Sg+Nom:е ENDLEX ; кие:ки

+Sem/Obj: CLT/COP_SG ; singular

мон:мо

тон:то

сон:со

минь:

тынь:ты

сынь:сы

Obligatory Possessor Index

Demonstrative

Interrogative

What should be done

кона:кона This is not the same as indefinite PronRel-kona

What should be done

LEXICON PRON-IS-INTERR-SPAT-INE косо

What should be done

Relative pronouns

ки:ки

ки

мезе+Pron:мезИ2 Misc_Pronouns1 ; мезе+Pron+Rel+Gen:мень K ; ки+Pron:ки Misc_Pronouns1 ;

Some pronoun continuation have been moved here Out of TestLexc-noun.txt


This (part of) documentation was generated from src/fst/morphology/affixes/pronouns.lexc


src-fst-morphology-affixes-propernouns.lexc.md

Proper noun inflection

Erzya proper nouns inflect in the same cases as regular nouns.

Андрей:Андре

Вили:Вил

Russian type Surnames Абдеев:Абдеев

Багрий:Багр

Аморский:Аморск

Front-vowel stem

DECLENSION LIMITATIONS


This (part of) documentation was generated from src/fst/morphology/affixes/propernouns.lexc


src-fst-morphology-affixes-quantifiers.lexc.md

Quantifier inflection

Erzya quantifiers inflect in many the same cases as regular nouns.

Now regular

кавонст

омбонст

кавонест is a pronoun like the Finnish molemmat This means a radical increase in the Erzya pronoun inventory: 6 x for each numeral 2 and above

кавксоненек

once, twice; весть, кавксть, аламоксть twofold, threefold; веенькирда, кавонькирда, колмонькирда

васенцеде advmod:multimprf > advmod:ordimprf

васняяк ‘first of all’

Numeral with a range limitation to adnominal phrase

2012-08-09


This (part of) documentation was generated from src/fst/morphology/affixes/quantifiers.lexc


src-fst-morphology-affixes-symbols.lexc.md

Symbol affixes


This (part of) documentation was generated from src/fst/morphology/affixes/symbols.lexc


src-fst-morphology-affixes-verbs.lexc.md

Verb inflection

Erzya language verbs inflect for person, subject and object.

OBJECT FLAGS AND +V tags а+V:а

**LEXICON V-AUX-NEG-PRT1 ** а+V:эзь

**LEXICON TV_KADOMS **

**LEXICON TV_NEVTEMS_SUB **

**LEXICON TV_SAVTOMS_SUB **

**LEXICON TV_SAVTOMS **

**LEXICON TV_SAVTOMS-SG3_SUBJ/ZERO **

**LEXICON TV_CHACHTOMS **

**LEXICON TV_KUNDAMS_SUB **

**LEXICON TV_KUNDAMS **

**LEXICON TV_SATOMS **

**LEXICON TV_TUEMS **

**LEXICON TV_TEEMS **

VERBS WITH THIRD PERSON OBJECTS @U.CONJ-PX.13@

VERBS WITH INTRANSITIVE TAGS +V

AUXILIARY VERBS

DERIVATION

VERBS AFTER TRANSITIVITY Tags OBJECT FLAGS

теемс:тей теемс:тей

no deverbals

no deverbals

no deverbals

DERIVATION

LEXICON TV_NEKSHNEMS Alternates with TRA LEXICON TV_NEKSHNEMS Alternates with TRA LEXICON TV_NEKSHNEMS Alternates with TRA

This is fed by actors and participles in N_myv, A_myv and Prc_myv This is fed by actors and participles in N_myv, A_myv and Prc_myv

CONJUGATION

Indicative Preterite I

INDICATIVE

Indicative NonPast

INDICATIVE PRETERITE 2

DESIDERATIVE

CONJUNCTIVE

redo conj 2012-11-07 begin

redo conj 2012-11-07 end

begin

end

OPTATIVE

IMPERATIVE

PRECATIVE

OPTATIVE

2012-11-09

Given in Grammar 2000

Used with deverbals

ваномс+V+Imprt+ScPl2+Clt/Ga: look/katsoa


This (part of) documentation was generated from src/fst/morphology/affixes/verbs.lexc


src-fst-morphology-clitics.lexc.md

Clitics

The Erzya language clitics…

END


This (part of) documentation was generated from src/fst/morphology/clitics.lexc


src-fst-morphology-phonology.twolc.md

The Erzya morphophonological/twolc rules file

This file documents the phonology.twolc file

Alphabet

ӓ Ӓ ҥ Ҥ і І ѳ Ѳ Pre-Soviet 1930s letters

Special letters in the root that might be useful in dialect research and etymology later

идиса, идима ашоян disallow о:0

вт%{оеэ%}мО1

%{frontHard%}:0 — front harmony hard %{frontSoft%}:0 — front harmony soft %{back%}:0 — back harmony %{backHard%}:0 — back harmony

%^OldAE:0 — This allows Ӓ4 and Ӓ3 to be realized as я %^NoLinkVow:0 — No linking vowel is used only after consonants for error

verbStemVowStrong:0

Ӓ3 Ӓ4 as я

A1:o

Y2:yi

%{оеэ%}:е неемс+V+Ger+Ill+PxPl1: –see/nähdä–

%{оеэ%}:о псака+N+SP+Abe+PxSg3+Der+Der/MWN+N+SP+Tra+Indef: cat/kissa

%{оеэ%}:э

%{оеэØ%}:0 %{оеэØ%}:е панемс+V+Ind+ConNeg: drive/ajaa

вадемс+V+Der/Ovt+Prc/Telic+Sg+Nom+Def: the greased one/

%{оеэØ%}:э кев+N+SP+Ill+PxSg2: rock/kivi

%{оеэØ%}:о ков+N+SP+Ill+PxSg2: moon/kuu

%{уиыØ%}:и панемс+V+Inf+Dial/NW: drive/ajaa

%{уиыØ%}:ы кев+N+SP+Ill+PxSg2: rock/kivi

%{уиыØ%}:у ков+N+SP+Ill+PxSg2: moon/kuu

O1:e

O1:o

%{оэØ%}:e

тев+N+Sg+Nom+PxSg3+Err/Orth-no-linking-vowel: thing/juttu

%{оэØ%}:o

псака+N+SP+Abe+PxSg3+Der+Der/MWN+N+SP+Tra+Indef: cat/kissa

%{оэØ%}:0

O1:0

%{ое%}:е

%{ое%}:о

A2:a
путомс+V+Prec+ScSg2: put/laittaa

и:ы

j:0

**Е3:э always ** %> т н _ 2013-02-23

**Е3:э sometimes ** %> т н _ 2013-02-23

**ye:e always **
сыр

Н1:н
Н1:к

а: и Dimin

о: ы Dimin

у: и Dimin

о regressive raising у озномс+V+Ind+Prs+ScSg1+OcSg3+Dial/NW: bless/siunata

э: и Dimin

а: и Dimin

о: и Dimin

у: и Dimin

я: и Dimin

ё: и Dimin

ю: и Dimin

е: и Dimin

a:ya

n loss with plural ведун+N+Pl+Indef: knower/tietäjä

v:0

G1:0

G1:g

G1:k

G2:g

G2:k

G4:0
саемс+V+Ind+Prs+ConNeg+Clt/Ga:

G4:k

потмо+N+Relator+SP+Ela+Indef: inside/sisäosa

imperative suffix K1:t

лыказевемс+V+Imprt+ScSg2: have taken

K1:к
ливтемс+V+Prec+ScSg2: set out/laittaa esille

U4:y
кал+N+Sg+Nom+Def: fish/kala

пильге+N+Pl+Nom+Indef leg; foot/jalka

U4:0

вадемс+V+Der/Ovt+Prc/Telic+Sg+Nom+Def: the greased one/

валдо+N+Pl+Nom+Indef light/valo

t:d
ловомс+V+Ind+Prs+ScSg1+OcSg2: regard/pitää jonain

s:0

d:t

d:d

y:y

y:0

меремс+V+Ind+Prt1+ScSg3: say/sanoa

Disallow TLoss after non-t

Disallow RegrRaise after A

Disallow vow loss before break

Disallow OldAE when no Ä

Disallow KLoss after non-k

Disallow SLoss after non-s

Disallow %^WLoss after non-v

Disallow Н1:н after Letters

р н :Vows (HarmDummies:)] (ь:) %> _ %> %{оеэØ%}: ;

Disallow soft loss

Disallow soft loss чувто+N+Pl+Nom+Def: tree/puu

веле+N+SP+Tra+PxSg2

псака+N+SP+Abe+PxSg2+Clt/Cop+Prt2+ScPl3+Clt/Gak

псака+N+SP+Abe+PxSg3+Der+Der/MWN+N+SP+Tra+Indef: cat/kissa

веле+N+SP+Tra+PxSg2+Clt/Cop+Prt2+ScPl3: village/kylä

Disallow %^NoLinkVow after vowel

Disallow s for control of stems with inessive…

Disallow dano after non-voiced

Disallow dano after non-voiced

Disallow k for control of comparative with stem types


This (part of) documentation was generated from src/fst/morphology/phonology.twolc


src-fst-morphology-root.lexc.md

Morphology

INTRODUCTION TO MORPHOLOGICAL ANALYSER OF ERZYA.

Analysis symbols

The morphological analyses of wordforms of ERZYA are presented in this system in terms of following symbols. (It is highly suggested to follow existing standards when adding new tags).

The parts-of-speech are:

Parts of speech are further split up into:

Adjectives

Adverbs

Interjections

Nouns

Particles

Postpositions + Spat, + Temp

Pronouns

Quantifiers (numerals)

Quantifiers and Numerals are classified under:

Nominals are inflected for Number and Case

Number

Case

Possession and other declension types are marked with:

The comparative forms are:

Verb moods are:

Infinitive moods

Tenses in the indicative and infrequently in the conditional

Verb personal forms are:

Other verb forms are

The Usage extents are marked using following tags:

Dialect tags

Orthography tags

Abbreviated words are classified with:

Special symbols

Delimiter marks are classified with:

The verbs are syntactically split according to transitivity:

Auxiliary verbs

Special multiword units are analysed with:

Non-dictionary words can be recognised with:

Question and Focus particles:

Semantic tags

Semantic tags to help disambiguation & synt. analysis: (before POS) Borrowed from main/langs/sme/src/morphology/root.lexc

Simplex tags

Multiple Semantic tags:

Semantics are classified with

Semantic Fields

Other tags

Verbal arguments

Derivations are classified under the morphophonetic form of the suffix, the source and target part-of-speech.

Homonymy

Der begin

Declaring noun derivations

Modifier without noun

Declaring Indefinite Pronoun derivations

DECLARING NOUN DERIVATIONS

DECLARING NUMERAL DERIVATIONS

DECLARING DEVERBAL DERIVATIONS OF VERBS

Morphophonology

To represent phonologic variations in word forms we use the following symbols in the lexicon files:

And following triggers to control variation

Special letters in the root that might be useful in dialect research and etymology later

вт%{оеэ%}мО1 suffix-internal archivowel

%^OldAE — This allows Ӓ4 and Ӓ3 to be realized as я

MISC

Development tag

Compounding

Tags

Imperative clitics

Tags distinguishing different versions of the same lemma (before POS)

Symbols that need to be escaped on the lower side (towards twolc):

Flag diacritics

We have manually optimised the structure of our lexicon using following flag diacritics to restrict morhpological combinatorics - only allow compounds with verbs if the verb is further derived into a noun again: | @P.NeedNoun.ON@ | (Dis)allow compounds with verbs unless nominalised | @D.NeedNoun.ON@ | (Dis)allow compounds with verbs unless nominalised | @C.NeedNoun@ | (Dis)allow compounds with verbs unless nominalised

For languages that allow compounding, the following flag diacritics are needed to control position-based compounding restrictions for nominals. Their use is handled automatically if combined with +CmpN/xxx tags. If not used, they will do no harm. | @P.CmpFrst.FALSE@ | Require that words tagged as such only appear first | @D.CmpPref.TRUE@ | Block such words from entering ENDLEX | @P.CmpPref.FALSE@ | Block these words from making further compounds | @D.CmpLast.TRUE@ | Block such words from entering R | @D.CmpNone.TRUE@ | Combines with the next tag to prohibit compounding | @U.CmpNone.FALSE@ | Combines with the prev tag to prohibit compounding | @P.CmpOnly.TRUE@ | Sets a flag to indicate that the word has passed R | @D.CmpOnly.FALSE@ | Disallow words coming directly from root.

Use the following flag diacritics to control downcasing of derived proper nouns (e.g. Finnish Pariisi -> pariisilainen). See e.g. North Sámi for how to use these flags. There exists a ready-made regex that will do the actual down-casing given the proper use of these flags. | @U.Cap.Obl@ | Allowing downcasing of derived names: deatnulasj. | @U.Cap.Opt@ | Allowing downcasing of derived names: deatnulasj.

Flags used to identify parts of speech

Flags used with +Clt/Cop nonverbal predication

Flags used with transitivity

problematic

This allows or disallows combining with hyphen through loop especially for acronyms 2012-11-04

This disallows secondary compounding

Linking vowel for use with Translative

FLAGS USED WITH COLLECTIVE NOUNS

number

Removal

Flag diacritic Explanation
@U.number.one@ Flag used to give arabic numerals in smj different cases ;
@U.number.two@ Flag used to give arabic numerals in smj different cases ;
@U.number.three@ Flag used to give arabic numerals in smj different cases ;
@U.number.four@ Flag used to give arabic numerals in smj different cases ;
@U.number.five@ Flag used to give arabic numerals in smj different cases ;
@U.number.six@ Flag used to give arabic numerals in smj different cases ;
@U.number.seven@ Flag used to give arabic numerals in smj different cases ;
@U.number.eight@ Flag used to give arabic numerals in smj different cases ;
@U.number.nine@ Flag used to give arabic numerals in smj different cases ;
@U.number.zero@ Flag used to give arabic numerals in smj different cases ;

The word forms in ERZYA start from the lexeme roots of basic word classes, or optionally from prefixes: Here follow all contlexes, appr 20.

CyrillicFemaleName ; HUNSPELL Type name derivation RussianMalenamesDerive ; ! RussianSurnamesDerive ;

увол-авол

alo-SPAT-1Arg ; >PO_KAL-LOC


This (part of) documentation was generated from src/fst/morphology/root.lexc


src-fst-morphology-stems-adjectives-russian-like_newwords.lexc.md

This is where new words are added as lexc entries before they are added to the xml source files. од:од A_KAL “(eng) /(fin)/(rus) “ ;

ADD ADJECTIVES BELOW


This (part of) documentation was generated from src/fst/morphology/stems/adjectives-russian-like_newwords.lexc


src-fst-morphology-stems-adjectives_newwords.lexc.md

This is where new words are added as lexc entries before they are added to the xml source files. эрзя-мокшонь:эрзя-мокшонь A_IS_GEN “(eng) /(fin) /(rus) “ ;

ADD ADJECTIVES BELOW


This (part of) documentation was generated from src/fst/morphology/stems/adjectives_newwords.lexc


src-fst-morphology-stems-adverbs_newwords.lexc.md

This is where new words are added as lexc entries before they are added to the xml source files. лембстэ:лембстэ ADV_ “(eng) /(fin) /(rus) “ ;

ADD ADVERBS BELOW


This (part of) documentation was generated from src/fst/morphology/stems/adverbs_newwords.lexc


src-fst-morphology-stems-exceptions.lexc.md

Exceptions are quite strange word-forms. the ones that do not fit anywhere else. This file contains all enumerated word forms that cannot reasonably be created from lexical data by regular inflection. Usually there should be next to none exceptions, it’s always better to have a paradigm that covers only one or few words than an exception since these will not work nicely with e.g. compounding scheme or possibly many end applications.

verbs of negation have partial inflection: € аволь € иля € эзь

The verb ярсамс has additional irregular forms: € ярстано € ярстадо

The verb сеземс

Some of the nouns have archaic consonant stem forms left: € ийть

Periferal

Some random Russian elements:

Some of the nouns have special forms for Gen PxSg1 and PxSg2:

Reciprocal pronouns These might be done with flags

These two stems have м loss but its presence can be observed in the choice of “тнэ” over “тне” This has special hard after lost consonant This has special hard after lost consonant

1930s Phonetic transcription дс » ц гт » к мекевлангт+Adv+Use/NG+Err/Orth:мекевланг K ; Half way between morphology and phonetics with a Russian twist

ADPOSITIONS

IDEOPHONES

are dealt with as adverbs

PRONOUNS

QUANTIFIERS

сисем+Num+Ord:сисеме NUMORD_KUDO ; This is irregularly formed, cf. сисемце

NOUNS

NOUNS WRITTEN Appart

PLACE NAMES

GEO

ANIMAL NAMES

FIRST NAMES

100 % homographs of Russian words

adjectives in ой Adj-od » A_RU-OJ with +Use/SpellNoSugg

+SP+Gen+Indef attributes as adjectives

Russian language words found in Erzya texts

Old Bible Names and words

RUSSIAN VERBS

unrecognized

Problems with synchronization missing lemmas

COLLECTIVE NOUNS


This (part of) documentation was generated from src/fst/morphology/stems/exceptions.lexc


src-fst-morphology-stems-genitive_attributes.lexc.md

This is where new words are added as lexc entries before they are added to the xml source files. Ботужале+N+Prop+SP+Gen+Indef:ботужале A_IS_PROP_GEN ;

ADD ADJECTIVES BELOW


This (part of) documentation was generated from src/fst/morphology/stems/genitive_attributes.lexc


src-fst-morphology-stems-hyphenated-nouns.lexc.md

These are nouns with parallel declension

ават%-тейтерть аванзо-тетянзо ават%-цёрат атявтт%-ававтт атят%-ават атят%-бабат атят%-сэрдят бабат%-нуцькат барант%-каткат боярт%-азорт боярт%-боярават

вирть%-лугат вирть%-паксят вирть%-укшторт ворт%-грабительть ворт%-розбойникть эрзят%-мокшот


This (part of) documentation was generated from src/fst/morphology/stems/hyphenated-nouns.lexc


src-fst-morphology-stems-hyphenated-verbs.lexc.md

These are verbs with parallel conjugation

REDUPLICATION

авардемс%-авардемс ардомс%-ардомс ардтневтемс%-ардтневтемс арсемс%-арсемс аштемс%-аштемс ванномс%-ванномс ваномс%-ваномс вешнемс%-вешнемс

%-And such

авардемс%-теемс арсемс%-теемс аштемс%-теемс ванномс%-теемс ваномс%-теемс

андомс%-симдемс аштемс%-учомс велямс%-чарамс вастомс%-дёлямс васькамс%-оймамс витнемс%-петнемс ёмавтомс%-аравтомс ярсамс%-симемс

SERIAL

витнемс%-ютавтомс


This (part of) documentation was generated from src/fst/morphology/stems/hyphenated-verbs.lexc


src-fst-morphology-stems-myv-propernouns.lexc.md

-kal

-osh

-kudo

-kal

-osh

-kudo

Place names, Settlements

Rivers


This (part of) documentation was generated from src/fst/morphology/stems/myv-propernouns.lexc


src-fst-morphology-stems-nouns_newwords.lexc.md

This is where new words are added as lexc entries before they are added to the xml source files. автор:автор N_KAL “(eng) /(fin) /(rus) “ ;

ADD NOUNS BELOW


This (part of) documentation was generated from src/fst/morphology/stems/nouns_newwords.lexc


src-fst-morphology-stems-propernouns_newwords.lexc.md

This is where new words are added as lexc entries before they are added to the xml source files. автор:автор N_KAL “(eng) /(fin) /(rus) “ ;

ADD NOUNS BELOW


This (part of) documentation was generated from src/fst/morphology/stems/propernouns_newwords.lexc


src-fst-morphology-stems-rusMaleNameDer.lexc.md

The derivable male given names have been moved to the template urj-Cyrl-propernouns.lexc.


This (part of) documentation was generated from src/fst/morphology/stems/rusMaleNameDer.lexc


src-fst-morphology-stems-verbs_newwords.lexc.md

This is where new words are added as lexc entries before they are added to the xml source files. ливтевкшнемс+V:ливтевкшне TV_KUNDAMS “(eng) /(fin) /(rus) “ ;

ADD VERBS BELOW

These verbs just need Finnish translations A-M

N-End


This (part of) documentation was generated from src/fst/morphology/stems/verbs_newwords.lexc


src-fst-phonetics-txt2ipa.xfscript.md

retroflex plosive, voiceless t ʈ 0288, 648 ( = ASCII 096) retroflex plosive, voiced d ɖ 0256, 598 labiodental nasal F ɱ 0271, 625 retroflex nasal n ɳ 0273, 627 palatal nasal J ɲ 0272, 626 velar nasal N ŋ 014B, 331 uvular nasal N\ ɴ 0274, 628

bilabial trill B\ ʙ 0299, 665 uvular trill R\ ʀ 0280, 640 alveolar tap 4 ɾ 027E, 638 retroflex flap r ɽ 027D, 637 bilabial fricative, voiceless p\ ɸ 0278, 632 bilabial fricative, voiced B β 03B2, 946 dental fricative, voiceless T θ 03B8, 952 dental fricative, voiced D ð 00F0, 240 postalveolar fricative, voiceless S ʃ 0283, 643 postalveolar fricative, voiced Z ʒ 0292, 658 retroflex fricative, voiceless s ʂ 0282, 642 retroflex fricative, voiced z` ʐ 0290, 656 palatal fricative, voiceless C ç 00E7, 231 palatal fricative, voiced j\ ʝ 029D, 669 velar fricative, voiced G ɣ 0263, 611 uvular fricative, voiceless X χ 03C7, 967 uvular fricative, voiced R ʁ 0281, 641 pharyngeal fricative, voiceless X\ ħ 0127, 295 pharyngeal fricative, voiced ?\ ʕ 0295, 661 glottal fricative, voiced h\ ɦ 0266, 614

alveolar lateral fricative, vl. K alveolar lateral fricative, vd. K\

labiodental approximant P (or v) alveolar approximant r\ retroflex approximant r` velar approximant M\

retroflex lateral approximant l` palatal lateral approximant L velar lateral approximant L
Clicks

bilabial O\ (O = capital letter) dental |
(post)alveolar !\ palatoalveolar =\ alveolar lateral ||
Ejectives, implosives

ejective > e.g. ejective p p> implosive < e.g. implosive b b< Vowels

close back unrounded M close central unrounded 1 close central rounded } lax i I lax y Y lax u U

close-mid front rounded 2 close-mid central unrounded @\ close-mid central rounded 8 close-mid back unrounded 7

schwa ə @

open-mid front unrounded E open-mid front rounded 9 open-mid central unrounded 3 open-mid central rounded 3\ open-mid back unrounded V open-mid back rounded O

ash (ae digraph) { open schwa (turned a) 6

open front rounded & open back unrounded A open back rounded Q Other symbols

voiceless labial-velar fricative W voiced labial-palatal approx. H voiceless epiglottal fricative H\ voiced epiglottal fricative <\ epiglottal plosive >\

alveolo-palatal fricative, vl. s\ alveolo-palatal fricative, voiced z\ alveolar lateral flap l\ simultaneous S and x x\ tie bar _ Suprasegmentals

primary stress “ secondary stress % long : half-long :\ extra-short _X linking mark -
Tones and word accents

level extra high _T level high _H level mid _M level low _L level extra low _B downstep ! upstep ^ (caret, circumflex)

contour, rising contour, falling _F contour, high rising _H_T contour, low rising _B_L

contour, rising-falling _R_F (NB Instead of being written as diacritics with _, all prosodic marks can alternatively be placed in a separate tier, set off by < >, as recommended for the next two symbols.) global rise global fall Diacritics

voiceless 0 (0 = figure), e.g. n_0 voiced _v aspirated _h more rounded _O (O = letter) less rounded _c advanced _+ retracted _- centralized _” syllabic = (or _=) e.g. n= (or n=) non-syllabic _^ rhoticity `

breathy voiced _t creaky voiced _k linguolabial _N labialized _w palatalized ‘ (or _j) e.g. t’ (or t_j) velarized _G pharyngealized _?\

dental d apical _a laminal _m nasalized ~ (or _~) e.g. A~ (or A~) nasal release _n lateral release _l no audible release _}

velarized or pharyngealized _e velarized l, alternatively 5 raised _r lowered _o advanced tongue root _A retracted tongue root _q


This (part of) documentation was generated from src/fst/phonetics/txt2ipa.xfscript


src-fst-transcriptions-transcriptor-abbrevs2text.lexc.md

We describe here how abbreviations are in Erzya are read out, e.g. for text-to-speech systems.

For example:


This (part of) documentation was generated from src/fst/transcriptions/transcriptor-abbrevs2text.lexc


tools-grammarcheckers-grammarchecker.cg3.md

E R Z Y A G R A M M A R C H E C K E R

DELIMITERS

TAGS AND SETS

Upper and lower case

This will be expanded for homonymy at first

This will be expanded for homonymy at first, i.e., diminutives

used with Dat PxSg1

Derivation tags

2VDerTag 2NDerTag

DerTag

Grammarchecker sets


This (part of) documentation was generated from tools/grammarcheckers/grammarchecker.cg3


tools-tokenisers-tokeniser-disamb-gt-desc.pmscript.md

Tokeniser for myv

Usage:

$ make
$ echo "ja, ja" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa boasttu olmmoš, man mielde lahtuid." | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "márffibiillagáffe" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst

Pmatch documentation: https://github.com/hfst/hfst/wiki/HfstPmatch

Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words:

Whitespace contains ASCII white space and the List contains some unicode white space characters

Apart from what’s in our morphology, there are

  1. unknown word-like forms, and
  2. unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a Unknowns are made of:
    • lower-case ASCII
    • upper-case ASCII ASCII digits
    • select symbols
    • Combining diacritics as individual symbols,
    • various symbols from Private area (probably Microsoft), so far:
    • U+F0B7 for “x in box”

Unknown handling

Unknowns are tagged ?? and treated specially with hfst-tokenise hfst-tokenise –giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it’s safer to let hfst-tokenise handle them.

Finally we mark as a token any sequence making up a:


This (part of) documentation was generated from tools/tokenisers/tokeniser-disamb-gt-desc.pmscript


tools-tokenisers-tokeniser-gramcheck-gt-desc.pmscript.md

Grammar checker tokenisation for myv

Requires a recent version of HFST (3.10.0 / git revision>=3aecdbc) Then just:

$ make
$ echo "ja, ja" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst

More usage examples:

$ echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa boasttu olmmoš, man mielde lahtuid." | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "márffibiillagáffe" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst

Pmatch documentation: https://github.com/hfst/hfst/wiki/HfstPmatch

Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words:

Whitespace contains ASCII white space and the List contains some unicode white space characters

Apart from what’s in our morphology, there are 1) unknown word-like forms, and 2) unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a

TODO: Could use something like this, but built-in’s don’t include šžđčŋ:

Simply give an empty reading when something is unknown: hfst-tokenise –giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it’s safer to let hfst-tokenise handle them.

Finally we mark as a token any sequence making up a:


This (part of) documentation was generated from tools/tokenisers/tokeniser-gramcheck-gt-desc.pmscript


tools-tokenisers-tokeniser-tts-cggt-desc.pmscript.md

TTS tokenisation for smj

Requires a recent version of HFST (3.10.0 / git revision>=3aecdbc) Then just:

make
echo "ja, ja" \
| hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst

More usage examples:

echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa \
boasttu olmmoš, man mielde lahtuid." \
| hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" \
| hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
echo "márffibiillagáffe" \
| hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst

Pmatch documentation: https://kitwiki.csc.fi/twiki/bin/view/KitWiki/HfstPmatch

Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words:

Whitespace contains ASCII white space and the List contains some unicode white space characters

Apart from what’s in our morphology, there are 1) unknown word-like forms, and 2) unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a

TODO: Could use something like this, but built-in’s don’t include šžđčŋ:

Simply give an empty reading when something is unknown: hfst-tokenise –giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it’s safer to let hfst-tokenise handle them.

Needs hfst-tokenise to output things differently depending on the tag they get


This (part of) documentation was generated from tools/tokenisers/tokeniser-tts-cggt-desc.pmscript