Finite state and Constraint Grammar based analysers, proofing tools and other resources
View the project on GitHub giellalt/lang-myv
All doc-comment documentation in one large file.
This section lists all the tags inherited from the fst, and used as tags in the syntactic analysis. The next section, Sets, contains sets defined on the basis of the tags listed here, those set names are not visible in the output.
BOS EOS
Sets for parts of speech
LEFT RIGHT because of apertium
CLBfinal
Dialect homonyms of Sg Gen Def
foreign
noun phrase heads
Upper and lower case
This part of the file lists a large number of sets based partly upon the tags defined above, and partly upon lexemes drawn from the lexicon. See the sourcefile itself to inspect the sets, what follows here is an overview of the set types.
the set INITIAL for initial letters INITIAL
ADLVCASE
NOT-V
MOOD-V
Homonymy for subject conjugation and subject-object conjugation with Pl3 object
VFIN
VFIN-POS
кортамс мезде
words that go with эрьва for кизэ homonymy PxSg2 for кизэ homonymy PxSg1 This will be expanded for homonymy at first
This will be expanded for homonymy at first, i.e., diminutives
these have homonyms
used with Dat PxSg1
2VDerTag 2NDerTag
DerTag
Pl Nom Def is Homomym with verb stem in тне-мс. This is relative for Clt/Cop with ScPl1 and ScPl2
in SP Gen Indef the next word can be кель
2023_03_15 important part of regular inflection
This (part of) documentation was generated from src/cg3/disambiguator.cg3
Sets for POS sub-categories
Sets for Semantic tags
Sets for Morphosyntactic properties
negation marker for fits between negation and conneg
MOOD-V
Erzya and Moksha
this needs Moksha, too.
finite auxiliary verbs with
макссь чарькодемс, Deal with DATAUX separately; they also take MS
finite auxiliary taking supine MO/ME
@+FAUXV : finite auxiliary verbs
@-FAUXV : non-finite auxiliary verbs
finite supaux 2023_03_13
This (part of) documentation was generated from src/cg3/functions.cg3
Adjectives and other parts of speech in ERZYA are compared by means of either a particle or ablative case marking on the standard of comparison
ordinals in -це
истямо:истя
кондямо:кондя
кодамо:кода кодамо:кода кодатнэ кодатне
This (part of) documentation was generated from src/fst/morphology/affixes/adjectives.lexc
The Erzya language postpositions can be broken into many subgroups according to morphological and semantic criteria
Some of the nouns have defective paradigms: € кудыкельганть
ало:ал alo-SPAT-1Arg
This allows for possessor indices, word end or focus e.g. вельде, вельдеяк, вельдензэ ?вельдензэль, вельдензэтне
This allows for word end, possessor indices, predication
postposition that is in ablative case алдо:алдо
postposition that is in elative case потсто:потсто
postposition that is in illative case эземс:эзем
postposition that is in illative case эйс:э
postposition that is in inessive case эйсэ:эйсэ, кисэ
postposition that is in lative case ютков:ютков
postposition that is in locative case ало:ало
postposition that has no continuation пачк
postposition that is in ablative case алдо:алдо
postposition that is in elative case потсто:потсто
postposition that is in illative case малас:мала
postposition that is in illative case малас:мала
postposition that is in illative case потс:пот
postposition that is in illative case эйс:э
postposition that is in inessive case потсо:потсо
postposition that is in lative case алов:ало
postposition that is in locative case ало:ало
postposition that is in prolative case перька:перька
+Temp: K ; перть
+Ela+Temp: PO_POSS_OR_END_FOC ; пингстэ
This (part of) documentation was generated from src/fst/morphology/affixes/adpositions.lexc
The Erzya language adverbs do not compare.
Not a real particle; it can take a clitic седеяк
LEXICON ADV-SPAT_ пачк
LEXICON ADV_IS_LAT алов
LEXICON ADV_IS_LOC ало
LEXICON ADV/PO/PRON-SPAT_ALO ало:ал
LEXICON ADV-SPAT_ALO ало:ал
“стядо”
spatial adverbs dependent and independent case marking
This marking would indicate a word form that may be
This (part of) documentation was generated from src/fst/morphology/affixes/adverbs.lexc
The Erzya language interjections…
This (part of) documentation was generated from src/fst/morphology/affixes/interjections.lexc
Non-Verbal conjugation
In the Erzya language nominals and adverbs also conjugate
Used with deverbals
This is where adjectives get their plural T.
used with infinitives
Conjugation
Conjugation
_KAL-NomSg-Conjugation-only
This allows Clt/Cop+Prs Sg1|Sg2|Pl1|Pl2 Clt/Cop+Prt2 Sg1|Sg2|Sg3|Pl1|Pl2|Pl3 K 2019-01-26
_KUDO-NomPl-Conjugation-only
_KUDO-NomPl-Conjugation-only-mutual
Are there copula verb combinations? 2024-08-06
This (part of) documentation was generated from src/fst/morphology/affixes/nonverbalConjugation.lexc
Nouns in ERZYA inflect for number, case and declension (definite, indefinite and possessive).
LEXICON N_PELE пеле:пель, ало:ал
LEXICON N_T1 кель:кель %^Ь2ZERO
LEXICON N_KEL1 кель:кель %^Ь2ZERO
LEXICON N_LOMAN1 ломань:ломань %^Ь2ZERO
LEXICON N_OZIM1 озимь:озимь %^Ь2ZERO
LEXICON N_RUF1 озимь:озимь %^Ь2ZERO
LEXICON N_RECH1 озимь:озимь %^Ь2ZERO
LEXICON N_VESHCH1 озимь:озимь %^Ь2ZERO
LEXICON N_PEJ кель:кель %^Ь2ZERO
LEXICON N_SODYJ сода%>%{иы%}й, содый
кардаз:карда
панго:панг
потмо:пот
Front vowel, non-palatal consonant before vowel Front vowel, non-palatal consonant before vowel
Front vowel, palatal consonant before vowel
Front vowel, non-palatal consonant before vowel Front vowel, non-palatal consonant before vowel
Does this need a diminutive?
NMN
LEXICON NMN_SAN сан:сан
LEXICON NMN_KEL1 кель:кель %^Ь2ZERO
LEXICON NMN_LOMAN1 ломань:ломань
LEXICON NMN_PEJ кель:кель %^Ь2ZERO
** TMP-INDEF ; ** Check this
**LEXICON NMN_KUDO-PL ** This needs checking 2013-03-27
harmony: front
DERIVATION
**+SP+Gen+Indef:%>%{оеэØ%}нь%> N2Dem-SE ; ** ь retension through double %>%>
**+Pl+Gen+Def:%>тнЕ3%>нь%> N2Dem-SE ; ** ь retension through double %>%>
**+SP+Gen+Indef:%^Ь2ZERO%>ень%> N2Dem-SE ; ** ь retension through double %>%>
**+Pl+Gen+Def:%>тне%>нь%> N2Dem-SE ; ** ь retension through double %>%>
**+SLoss+Sg+Ine+Def+Use/NG+Err/Orth+Dial/NW:%>SLossс%{оэØ%}%>сть%> N2Dem-SE ; ** ь retension through double %>%>
**+Sg+Ine+Def+Use/NG+Err/Orth+Dial/NW:%>%{оеэØ%}%>с%{оэØ%}%>сть%> N2Dem-SE ; ** ь retension through double %>%>
pango:pang
N_KUDO-Def-Declension
N_KUDO-Def-Declension
N_KUDO-Def-Declension
кал+N+Sg+Nom+Def
кал+N+Sg+Nom+Def+Foc/Гак
★калосьгак: кал+N+Sg+Nom+Def+Foc/Гак
(is not standard language)
пакся+N+SP+Ine+Indef+Der+Der/MWN+N+Sg+Nom+Def
пакся+N+SP+Ine+Indef+Der+Der/MWN+N+Sg+Nom+Def+Foc/Add
★паксясосьгак: пакся+N+SP+Ine+Indef+Der+Der/MWN+N+Sg+Nom+Def+Foc/Add
(is not standard language)
кал+N+Sg+Gen+Def
калонтькак: кал+N+Sg+Gen+Def+Foc/Гак
SG-NOM-INDEF_LAK ;
SG-NOM-INDEF_KAL ;
SG-NOM-INDEF_OSH ;
кал+N+SP+Abl+Indef
кал+N+SP+Abl+Indef+Foc/Гак
NP head ellipsis declension, Modifiers without nouns = MWN
Nouns1S_A
Possessor indices
The Erzya language possessor indices or possessive suffixes may be followed by a number of morpheme types
These are possessor indices that can be followed by predicate marking in the present there is no destinction between ScSg3 and ScPl3 Possessor indices allowing (1) #, (2) Foc, (3) Der/Pr ()
This appears with kindred terminology
Is “_KAL” necessary ?
DAT-PXPL1 ;
These are possessor Indices for non-nominative singular NonNomSg
word boundary or focus
This (part of) documentation was generated from src/fst/morphology/affixes/nouns.lexc
Erzya pronouns inflect in many the same cases as regular nouns.
+Sem/Hum+Sg+Nom:е ENDLEX ; кие:ки
+Sem/Obj: CLT/COP_SG ; singular
мон:мо
тон:то
сон:со
минь:
тынь:ты
сынь:сы
Obligatory Possessor Index
Demonstrative
Interrogative
What should be done
кона:кона This is not the same as indefinite PronRel-kona
What should be done
LEXICON PRON-IS-INTERR-SPAT-INE косо
What should be done
Relative pronouns
ки:ки
ки
мезе+Pron:мезИ2 Misc_Pronouns1 ; мезе+Pron+Rel+Gen:мень K ; ки+Pron:ки Misc_Pronouns1 ;
Some pronoun continuation have been moved here Out of TestLexc-noun.txt
This (part of) documentation was generated from src/fst/morphology/affixes/pronouns.lexc
Proper noun inflection
Erzya proper nouns inflect in the same cases as regular nouns.
Андрей:Андре
Вили:Вил
Russian type Surnames Абдеев:Абдеев
Багрий:Багр
Аморский:Аморск
Front-vowel stem
DECLENSION LIMITATIONS
This (part of) documentation was generated from src/fst/morphology/affixes/propernouns.lexc
Quantifier inflection
Erzya quantifiers inflect in many the same cases as regular nouns.
кавонст
омбонст
кавонест is a pronoun like the Finnish molemmat This means a radical increase in the Erzya pronoun inventory: 6 x for each numeral 2 and above
кавксоненек
once, twice; весть, кавксть, аламоксть twofold, threefold; веенькирда, кавонькирда, колмонькирда
васенцеде advmod:multimprf > advmod:ordimprf
васняяк ‘first of all’
Numeral with a range limitation to adnominal phrase
2012-08-09
This (part of) documentation was generated from src/fst/morphology/affixes/quantifiers.lexc
This (part of) documentation was generated from src/fst/morphology/affixes/symbols.lexc
Erzya language verbs inflect for person, subject and object.
OBJECT FLAGS AND +V tags а+V:а
**LEXICON V-AUX-NEG-PRT1 ** а+V:эзь
**LEXICON TV_KADOMS **
**LEXICON TV_NEVTEMS_SUB **
**LEXICON TV_SAVTOMS_SUB **
**LEXICON TV_SAVTOMS **
**LEXICON TV_SAVTOMS-SG3_SUBJ/ZERO **
**LEXICON TV_CHACHTOMS **
**LEXICON TV_KUNDAMS_SUB **
**LEXICON TV_KUNDAMS **
**LEXICON TV_SATOMS **
**LEXICON TV_TUEMS **
**LEXICON TV_TEEMS **
VERBS WITH THIRD PERSON OBJECTS @U.CONJ-PX.13@
VERBS WITH INTRANSITIVE TAGS +V
теемс:тей теемс:тей
no deverbals
no deverbals
no deverbals
LEXICON TV_NEKSHNEMS Alternates with TRA LEXICON TV_NEKSHNEMS Alternates with TRA LEXICON TV_NEKSHNEMS Alternates with TRA
This is fed by actors and participles in N_myv, A_myv and Prc_myv This is fed by actors and participles in N_myv, A_myv and Prc_myv
Indicative Preterite I
Indicative NonPast
redo conj 2012-11-07 begin
redo conj 2012-11-07 end
begin
end
2012-11-09
Given in Grammar 2000
Used with deverbals
ваномс+V+Imprt+ScPl2+Clt/Ga: look/katsoa
This (part of) documentation was generated from src/fst/morphology/affixes/verbs.lexc
The Erzya language clitics…
END
This (part of) documentation was generated from src/fst/morphology/clitics.lexc
This file documents the phonology.twolc file
ӓ Ӓ ҥ Ҥ і І ѳ Ѳ Pre-Soviet 1930s letters
%^ӓ4:е пелемс:п^ӓ4ль
идиса, идима ашоян disallow о:0
вт%{оеэ%}мО1
%{ЕØ%}:0 Stem-final archiphoneme тинге
%^H:0 used with stems in ч, ш, ж for hard plurals
%{frontHard%}:0 — front harmony hard %{frontSoft%}:0 — front harmony soft %{back%}:0 — back harmony %{backHard%}:0 — back harmony
%{ichPat%}:0 — for triggering colloquial partonymic forms
%^OldAE:0 — This allows Ӓ4 and Ӓ3 to be realized as я %^NoLinkVow:0 — No linking vowel is used only after consonants for error
verbStemVowStrong:0
Ӓ3 Ӓ4 as я
A1:o
Y2:yi
%{оеэ%}:е неемс+V+Ger+Ill+PxPl1: –see/nähdä–
%{оеэ%}:о псака+N+SP+Abe+PxSg3+Der+Der/MWN+N+SP+Tra+Indef: cat/kissa
%{оеэ%}:э
%{оеэØ%}:0 %{оеэØ%}:е панемс+V+Ind+ConNeg: drive/ajaa
вадемс+V+Der/Ovt+Prc/Telic+Sg+Nom+Def: the greased one/
%{оеэØ%}:э кев+N+SP+Ill+PxSg2: rock/kivi
%{оеэØ%}:о ков+N+SP+Ill+PxSg2: moon/kuu
%{уиыØ%}:и панемс+V+Inf+Dial/NW: drive/ajaa
%{уиыØ%}:ы кев+N+SP+Ill+PxSg2: rock/kivi
%{уиыØ%}:у ков+N+SP+Ill+PxSg2: moon/kuu
ков0%>уз%>ут
O1:e
O1:o
%{оэØ%}:e
тев+N+Sg+Nom+PxSg3+Err/Orth-no-linking-vowel: thing/juttu
%{оэØ%}:o
псака+N+SP+Abe+PxSg3+Der+Der/MWN+N+SP+Tra+Indef: cat/kissa
%{оэØ%}:0
O1:0
%{ое%}:е
%{ое%}:о
A2:a
путомс+V+Prec+ScSg2: put/laittaa
и:ы
j:0
**Е3:э always ** %> т н _ 2013-02-23
**Е3:э sometimes ** %> т н _ 2013-02-23
**ye:e always **
сыр
Н1:н
Н1:к
а: и Dimin
о: ы Dimin
у: и Dimin
о regressive raising у озномс+V+Ind+Prs+ScSg1+OcSg3+Dial/NW: bless/siunata
э: и Dimin
а: и Dimin
о: и Dimin
у: и Dimin
я: и Dimin
ё: и Dimin
ю: и Dimin
е: и Dimin
a:ya
n loss with plural ведун+N+Pl+Indef: knower/tietäjä
v:0
G1:0
G1:g
G1:k
G2:g
G2:k
G4:0
саемс+V+Ind+Prs+ConNeg+Clt/Ga:
G4:k
потмо+N+Relator+SP+Ela+Indef: inside/sisäosa
imperative suffix K1:t
лыказевемс+V+Imprt+ScSg2: have taken
K1:к
ливтемс+V+Prec+ScSg2: set out/laittaa esille
U4:y
кал+N+Sg+Nom+Def: fish/kala
пильге+N+Pl+Nom+Indef leg; foot/jalka
U4:0
вадемс+V+Der/Ovt+Prc/Telic+Sg+Nom+Def: the greased one/
валдо+N+Pl+Nom+Indef light/valo
t:d
ловомс+V+Ind+Prs+ScSg1+OcSg2: regard/pitää jonain
s:0
d:t
d:d
y:y
y:0
меремс+V+Ind+Prt1+ScSg3: say/sanoa
Disallow TLoss after non-t
Disallow RegrRaise after A
Disallow vow loss before break
Disallow OldAE when no Ä
Disallow KLoss after non-k
Disallow SLoss after non-s
Disallow %^WLoss after non-v
Disallow Н1:н after Letters
[л | р | н | :Vows (HarmDummies:)] (ь:) %> _ %> %{оеэØ%}: ; |
Disallow soft loss
Disallow soft loss чувто+N+Pl+Nom+Def: tree/puu
веле+N+SP+Tra+PxSg2
псака+N+SP+Abe+PxSg2+Clt/Cop+Prt2+ScPl3+Clt/Gak
псака+N+SP+Abe+PxSg3+Der+Der/MWN+N+SP+Tra+Indef: cat/kissa
веле+N+SP+Tra+PxSg2+Clt/Cop+Prt2+ScPl3: village/kylä
Disallow %^NoLinkVow after vowel
Disallow s for control of stems with inessive…
Disallow dano after non-voiced
Disallow dano after non-voiced
Disallow k for control of comparative with stem types
This (part of) documentation was generated from src/fst/morphology/phonology.twolc
INTRODUCTION TO MORPHOLOGICAL ANALYSER OF ERZYA.
Analysis symbols
The morphological analyses of wordforms of ERZYA are presented in this system in terms of following symbols. (It is highly suggested to follow existing standards when adding new tags).
+TYÄ WORK HAS TO BE DONE
%
Adjectives
Adverbs
кавксть
, kpv: кыкысь
кавонькирда
Interjections
Nouns
Particles
Postpositions + Spat, + Temp
+Long монень, тонеть; монстень
Quantifiers and Numerals are classified under:
Don't cry' (Proh);
Аволь мелявтт, кецяк!
Don’t worry, be happy!’ (Neg + Imprt)Other verb forms are
+Subst * deverbal nouns retaining verb arguments/gov
+Err/Orth-cons-stem * пачт | емс 2012 пачтямс |
+Err/Orth-stem-je-should-be-je0 * чудемс+V:чуде чуд | емс (->)чуде | мс |
+Err/Orth-v-loss-before-lab * ольной
+Use/-TTS – never retained in the HFST Text-To-Speech disambiguation tokeniser
+Err/Lex * The lemma is not an Erzya word (Depricating –+Src/F–)
Delimiter marks are classified with:
The verbs are syntactically split according to transitivity:
Auxiliary verbs
Special multiword units are analysed with:
Non-dictionary words can be recognised with:
Question and Focus particles:
Semantic tags to help disambiguation & synt. analysis: (before POS) Borrowed from main/langs/sme/src/morphology/root.lexc
Multiple Semantic tags:
Semantics are classified with
Semantic Fields
Derivations are classified under the morphophonetic form of the suffix, the source and target part-of-speech.
+Der/PatrFem Female patronymic
Modifier without noun
+Indef
in indefinite pronouns+Indef
in indefinite pronouns+Indef
in indefinite pronouns+Indef
in indefinite pronouns+Indef
in indefinite pronouns+Indef
in indefinite pronouns ковия, зярыя+Der/njems verb2verb derivation
+Der/Oncje old orth кудонцесь
+Der/ks Adv›N
+OLang/UND - Undefined
To represent phonologic variations in word forms we use the following symbols in the lexicon files:
And following triggers to control variation
%{ichPat%} — for triggering colloquial partonymic forms
%^CnsRM — Remove consonant
Ӓ4 пелемс:пӒ4ль
%^ӓ4 пелемс:п^ӓ4ль
%^Ь2ZERO removes stem-final soft sign
%{дт%} in ablative
вт%{оеэ%}мО1 suffix-internal archivowel
%^OldAE — This allows Ӓ4 and Ӓ3 to be realized as я
Development tag
+WORK
+NoVowX
%0
%-
Compounding
+Intensifier уш
Focus clitics
+v24
+ACC +DAT +COM This marks a function not a morpheme
(written with square brackets, see the root.lexc file)
We have manually optimised the structure of our lexicon using following flag diacritics to restrict morhpological combinatorics - only allow compounds with verbs if the verb is further derived into a noun again: | @P.NeedNoun.ON@ | (Dis)allow compounds with verbs unless nominalised | @D.NeedNoun.ON@ | (Dis)allow compounds with verbs unless nominalised | @C.NeedNoun@ | (Dis)allow compounds with verbs unless nominalised
For languages that allow compounding, the following flag diacritics are needed to control position-based compounding restrictions for nominals. Their use is handled automatically if combined with +CmpN/xxx tags. If not used, they will do no harm. | @P.CmpFrst.FALSE@ | Require that words tagged as such only appear first | @D.CmpPref.TRUE@ | Block such words from entering ENDLEX | @P.CmpPref.FALSE@ | Block these words from making further compounds | @D.CmpLast.TRUE@ | Block such words from entering R | @D.CmpNone.TRUE@ | Combines with the next tag to prohibit compounding | @U.CmpNone.FALSE@ | Combines with the prev tag to prohibit compounding | @P.CmpOnly.TRUE@ | Sets a flag to indicate that the word has passed R | @D.CmpOnly.FALSE@ | Disallow words coming directly from root.
Use the following flag diacritics to control downcasing of derived proper nouns (e.g. Finnish Pariisi -> pariisilainen). See e.g. North Sámi for how to use these flags. There exists a ready-made regex that will do the actual down-casing given the proper use of these flags. | @U.Cap.Obl@ | Allowing downcasing of derived names: deatnulasj. | @U.Cap.Opt@ | Allowing downcasing of derived names: deatnulasj.
Flags used to identify parts of speech
@U.POS.A@
Flags used with +Clt/Cop nonverbal predication
Flags used with transitivity
@U.CONJ-INF.NO@
@U.CONJ-TX.PRT2@
@U.CONJ-MX.COND@
@U.CONJ-CONNEG.NO@
@U.CONJ-NX.SG@
@U.CONJ-POSS.3ACC@
@U.CONJ-PX.36@
@U.CONJ-PX.46@
@U.CONJ-PX.56@
@U.CONJ-PX.66@
@R.CONJ-PX.36@
@R.CONJ-PX.46@
@R.CONJ-PX.56@
@R.CONJ-PX.66@
@R.TLOSS.ON@
@P.PossPx.Pl3@
@U.PossPx.SP3@
@U.PossPx.Pl3@
problematic
@C.TPERS@
@U.CX.TEMP@
@C.CX@
@C.DNUM@
@C.NUM@
This allows or disallows combining with hyphen through loop especially for acronyms 2012-11-04
This disallows secondary compounding
Linking vowel for use with Translative
@C.LV@
@C.CONJ-POSS@
@U.DECL-CX.CMP@
Removal
Flag diacritic | Explanation |
---|---|
@U.number.one@ | Flag used to give arabic numerals in smj different cases ; |
@U.number.two@ | Flag used to give arabic numerals in smj different cases ; |
@U.number.three@ | Flag used to give arabic numerals in smj different cases ; |
@U.number.four@ | Flag used to give arabic numerals in smj different cases ; |
@U.number.five@ | Flag used to give arabic numerals in smj different cases ; |
@U.number.six@ | Flag used to give arabic numerals in smj different cases ; |
@U.number.seven@ | Flag used to give arabic numerals in smj different cases ; |
@U.number.eight@ | Flag used to give arabic numerals in smj different cases ; |
@U.number.nine@ | Flag used to give arabic numerals in smj different cases ; |
@U.number.zero@ | Flag used to give arabic numerals in smj different cases ; |
The word forms in ERZYA start from the lexeme roots of basic word classes, or optionally from prefixes: Here follow all contlexes, appr 20.
CyrillicFemaleName ; HUNSPELL Type name derivation RussianMalenamesDerive ; ! RussianSurnamesDerive ;
увол-авол
alo-SPAT-1Arg ; >PO_KAL-LOC
This (part of) documentation was generated from src/fst/morphology/root.lexc
This is where new words are added as lexc entries before they are added to the xml source files. од:од A_KAL “(eng) /(fin)/(rus) “ ;
ADD ADJECTIVES BELOW
This (part of) documentation was generated from src/fst/morphology/stems/adjectives-russian-like_newwords.lexc
This is where new words are added as lexc entries before they are added to the xml source files. эрзя-мокшонь:эрзя-мокшонь A_IS_GEN “(eng) /(fin) /(rus) “ ;
ADD ADJECTIVES BELOW
This (part of) documentation was generated from src/fst/morphology/stems/adjectives_newwords.lexc
This is where new words are added as lexc entries before they are added to the xml source files. лембстэ:лембстэ ADV_ “(eng) /(fin) /(rus) “ ;
ADD ADVERBS BELOW
This (part of) documentation was generated from src/fst/morphology/stems/adverbs_newwords.lexc
Exceptions are quite strange word-forms. the ones that do not fit anywhere else. This file contains all enumerated word forms that cannot reasonably be created from lexical data by regular inflection. Usually there should be next to none exceptions, it’s always better to have a paradigm that covers only one or few words than an exception since these will not work nicely with e.g. compounding scheme or possibly many end applications.
verbs of negation have partial inflection: € аволь € иля € эзь
The verb ярсамс has additional irregular forms: € ярстано € ярстадо
The verb сеземс
Some of the nouns have archaic consonant stem forms left: € ийть
Periferal
Some random Russian elements:
Some of the nouns have special forms for Gen PxSg1 and PxSg2:
Reciprocal pronouns These might be done with flags
These two stems have м loss but its presence can be observed in the choice of “тнэ” over “тне” This has special hard after lost consonant This has special hard after lost consonant
1930s Phonetic transcription дс » ц гт » к мекевлангт+Adv+Use/NG+Err/Orth:мекевланг K ; Half way between morphology and phonetics with a Russian twist
are dealt with as adverbs
сисем+Num+Ord:сисеме NUMORD_KUDO ; This is irregularly formed, cf. сисемце
100 % homographs of Russian words
adjectives in ой Adj-od » A_RU-OJ with +Use/SpellNoSugg
+SP+Gen+Indef attributes as adjectives
Russian language words found in Erzya texts
Old Bible Names and words
RUSSIAN VERBS
unrecognized
Problems with synchronization missing lemmas
COLLECTIVE NOUNS
This (part of) documentation was generated from src/fst/morphology/stems/exceptions.lexc
This is where new words are added as lexc entries before they are added to the xml source files. Ботужале+N+Prop+SP+Gen+Indef:ботужале A_IS_PROP_GEN ;
ADD ADJECTIVES BELOW
This (part of) documentation was generated from src/fst/morphology/stems/genitive_attributes.lexc
These are nouns with parallel declension
ават%-тейтерть аванзо-тетянзо ават%-цёрат атявтт%-ававтт атят%-ават атят%-бабат атят%-сэрдят бабат%-нуцькат барант%-каткат боярт%-азорт боярт%-боярават
вирть%-лугат вирть%-паксят вирть%-укшторт ворт%-грабительть ворт%-розбойникть эрзят%-мокшот
This (part of) documentation was generated from src/fst/morphology/stems/hyphenated-nouns.lexc
These are verbs with parallel conjugation
авардемс%-авардемс ардомс%-ардомс ардтневтемс%-ардтневтемс арсемс%-арсемс аштемс%-аштемс ванномс%-ванномс ваномс%-ваномс вешнемс%-вешнемс
авардемс%-теемс арсемс%-теемс аштемс%-теемс ванномс%-теемс ваномс%-теемс
андомс%-симдемс аштемс%-учомс велямс%-чарамс вастомс%-дёлямс васькамс%-оймамс витнемс%-петнемс ёмавтомс%-аравтомс ярсамс%-симемс
витнемс%-ютавтомс
This (part of) documentation was generated from src/fst/morphology/stems/hyphenated-verbs.lexc
-kal
-osh
-kudo
-kal
-osh
-kudo
Place names, Settlements
Rivers
This (part of) documentation was generated from src/fst/morphology/stems/myv-propernouns.lexc
This is where new words are added as lexc entries before they are added to the xml source files. автор:автор N_KAL “(eng) /(fin) /(rus) “ ;
ADD NOUNS BELOW
This (part of) documentation was generated from src/fst/morphology/stems/nouns_newwords.lexc
This is where new words are added as lexc entries before they are added to the xml source files. автор:автор N_KAL “(eng) /(fin) /(rus) “ ;
ADD NOUNS BELOW
This (part of) documentation was generated from src/fst/morphology/stems/propernouns_newwords.lexc
The derivable male given names have been moved to the template urj-Cyrl-propernouns.lexc.
This (part of) documentation was generated from src/fst/morphology/stems/rusMaleNameDer.lexc
This is where new words are added as lexc entries before they are added to the xml source files. ливтевкшнемс+V:ливтевкшне TV_KUNDAMS “(eng) /(fin) /(rus) “ ;
ADD VERBS BELOW
These verbs just need Finnish translations A-M
N-End
This (part of) documentation was generated from src/fst/morphology/stems/verbs_newwords.lexc
retroflex plosive, voiceless t ʈ 0288, 648 (
= ASCII 096)
retroflex plosive, voiced d ɖ 0256, 598
labiodental nasal F ɱ 0271, 625
retroflex nasal n
ɳ 0273, 627
palatal nasal J ɲ 0272, 626
velar nasal N ŋ 014B, 331
uvular nasal N\ ɴ 0274, 628
bilabial trill B\ ʙ 0299, 665
uvular trill R\ ʀ 0280, 640
alveolar tap 4 ɾ 027E, 638
retroflex flap r ɽ 027D, 637
bilabial fricative, voiceless p\ ɸ 0278, 632
bilabial fricative, voiced B β 03B2, 946
dental fricative, voiceless T θ 03B8, 952
dental fricative, voiced D ð 00F0, 240
postalveolar fricative, voiceless S ʃ 0283, 643
postalveolar fricative, voiced Z ʒ 0292, 658
retroflex fricative, voiceless s
ʂ 0282, 642
retroflex fricative, voiced z` ʐ 0290, 656
palatal fricative, voiceless C ç 00E7, 231
palatal fricative, voiced j\ ʝ 029D, 669
velar fricative, voiced G ɣ 0263, 611
uvular fricative, voiceless X χ 03C7, 967
uvular fricative, voiced R ʁ 0281, 641
pharyngeal fricative, voiceless X\ ħ 0127, 295
pharyngeal fricative, voiced ?\ ʕ 0295, 661
glottal fricative, voiced h\ ɦ 0266, 614
alveolar lateral fricative, vl. K alveolar lateral fricative, vd. K\
labiodental approximant P (or v) alveolar approximant r\ retroflex approximant r` velar approximant M\
retroflex lateral approximant l`
palatal lateral approximant L
velar lateral approximant L
Clicks
bilabial O\ (O = capital letter)
dental |
(post)alveolar !\
palatoalveolar =\
alveolar lateral ||
Ejectives, implosives
ejective > e.g. ejective p p> implosive < e.g. implosive b b< Vowels
close back unrounded M close central unrounded 1 close central rounded } lax i I lax y Y lax u U
close-mid front rounded 2 close-mid central unrounded @\ close-mid central rounded 8 close-mid back unrounded 7
schwa ə @
open-mid front unrounded E open-mid front rounded 9 open-mid central unrounded 3 open-mid central rounded 3\ open-mid back unrounded V open-mid back rounded O
ash (ae digraph) { open schwa (turned a) 6
open front rounded & open back unrounded A open back rounded Q Other symbols
voiceless labial-velar fricative W voiced labial-palatal approx. H voiceless epiglottal fricative H\ voiced epiglottal fricative <\ epiglottal plosive >\
alveolo-palatal fricative, vl. s\ alveolo-palatal fricative, voiced z\ alveolar lateral flap l\ simultaneous S and x x\ tie bar _ Suprasegmentals
primary stress “
secondary stress %
long :
half-long :\
extra-short _X
linking mark -
Tones and word accents
level extra high _T level high _H level mid _M level low _L level extra low _B downstep ! upstep ^ (caret, circumflex)
contour, rising contour, falling _F contour, high rising _H_T contour, low rising _B_L
contour, rising-falling _R_F
(NB Instead of being written as diacritics with _, all prosodic
marks can alternatively be placed in a separate tier, set off
by < >, as recommended for the next two symbols.)
global rise
voiceless 0 (0 = figure), e.g. n_0 voiced _v aspirated _h more rounded _O (O = letter) less rounded _c advanced _+ retracted _- centralized _” syllabic = (or _=) e.g. n= (or n=) non-syllabic _^ rhoticity `
breathy voiced _t creaky voiced _k linguolabial _N labialized _w palatalized ‘ (or _j) e.g. t’ (or t_j) velarized _G pharyngealized _?\
dental d apical _a laminal _m nasalized ~ (or _~) e.g. A~ (or A~) nasal release _n lateral release _l no audible release _}
velarized or pharyngealized _e velarized l, alternatively 5 raised _r lowered _o advanced tongue root _A retracted tongue root _q
This (part of) documentation was generated from src/fst/phonetics/txt2ipa.xfscript
We describe here how abbreviations are in Erzya are read out, e.g. for text-to-speech systems.
For example:
This (part of) documentation was generated from src/fst/transcriptions/transcriptor-abbrevs2text.lexc
E R Z Y A G R A M M A R C H E C K E R
Upper and lower case
Sets for parts of speech
Sets for POS sub-categories
Sets for Semantic tags
Sets for Morphosyntactic properties
Sets for Derivation
This will be expanded for homonymy at first
This will be expanded for homonymy at first, i.e., diminutives
used with Dat PxSg1
2VDerTag 2NDerTag
DerTag
This (part of) documentation was generated from tools/grammarcheckers/grammarchecker.cg3
Usage:
$ make
$ echo "ja, ja" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa boasttu olmmoš, man mielde lahtuid." | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "márffibiillagáffe" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
Pmatch documentation: https://github.com/hfst/hfst/wiki/HfstPmatch
Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words:
U+00AD
U+FEFF
.Whitespace contains ASCII white space and the List contains some unicode white space characters
Apart from what’s in our morphology, there are
hfst-tokenise -a
Unknowns are made of:
Unknowns are tagged ?? and treated specially with hfst-tokenise
hfst-tokenise –giella-cg will treat such empty analyses as unknowns, and
remove empty analyses from other readings. Empty readings are also
legal in CG, they get a default baseform equal to the wordform, but
no tag to check, so it’s safer to let hfst-tokenise handle them.
Finally we mark as a token any sequence making up a:
This (part of) documentation was generated from tools/tokenisers/tokeniser-disamb-gt-desc.pmscript
Requires a recent version of HFST (3.10.0 / git revision>=3aecdbc) Then just:
$ make
$ echo "ja, ja" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
More usage examples:
$ echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa boasttu olmmoš, man mielde lahtuid." | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "márffibiillagáffe" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
Pmatch documentation: https://github.com/hfst/hfst/wiki/HfstPmatch
Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words:
U+00AD
U+FEFF
.Whitespace contains ASCII white space and the List contains some unicode white space characters
Apart from what’s in our morphology, there are 1) unknown word-like forms, and 2) unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a
TODO: Could use something like this, but built-in’s don’t include šžđčŋ:
Simply give an empty reading when something is unknown: hfst-tokenise –giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it’s safer to let hfst-tokenise handle them.
Finally we mark as a token any sequence making up a:
This (part of) documentation was generated from tools/tokenisers/tokeniser-gramcheck-gt-desc.pmscript
Requires a recent version of HFST (3.10.0 / git revision>=3aecdbc) Then just:
make
echo "ja, ja" \
| hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
More usage examples:
echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa \
boasttu olmmoš, man mielde lahtuid." \
| hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" \
| hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
echo "márffibiillagáffe" \
| hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
Pmatch documentation: https://kitwiki.csc.fi/twiki/bin/view/KitWiki/HfstPmatch
Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words:
U+00AD
U+FEFF
.Whitespace contains ASCII white space and the List contains some unicode white space characters
Apart from what’s in our morphology, there are 1) unknown word-like forms, and 2) unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a
TODO: Could use something like this, but built-in’s don’t include šžđčŋ:
Simply give an empty reading when something is unknown: hfst-tokenise –giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it’s safer to let hfst-tokenise handle them.
Needs hfst-tokenise to output things differently depending on the tag they get
This (part of) documentation was generated from tools/tokenisers/tokeniser-tts-cggt-desc.pmscript