Eastern Mari language model documentation
All doc-comment documentation in one large file.
src-cg3-dependency.cg3.md
!!!C O M M O N S Á M I D E P E N D E N C Y G R A M M A R
This dep file is for sma, sme, smj, sje.
!!!DELIMITERS
Sentence delimiters are the following: <.> <!> <?> <…> <¶>
!!!TAGS AND SETS
N V A Adv CC CS Inf Sup Neg Num Po Pr
Pcle Prop
Pron IV TV COMMA DASH CITATION to keep colouring we add a “ HYPHEN QMARK PUNCT LEFT RIGHT CLB Ind Pot Impr ImprtII Cond ConNeg Caus causative eus VGen Interj ABBR ACR Prs Prt Cmpnd RCmpnd PrfPrc PrsPrc Actor Actio Ger Indef Nom Acc Ill Com Gen Ess
IM For fao
!!POS sub-categories
!!Syntactic tags and sets
!Syntactic tags in input to this file
!Syntactic tags added in this file
- @FMV : finite main verb ** oaidná: Son oaidná ollislaš gova. - She sees the whole picture
- infinite main verb
- @FAUX : finite auxiliary verb ** ferte: Son ferte oaidnit ollislaš gova. - She must see the whole picture.
- @FMVdic : finite main verb introducing direct speech
- @IMVdic : infinite main verb introducing direct speech
- @FS-IMV : infinite main verb of subclause
- @FS-IAUX : infinite auxiliary verb in subclause
- @FS-N<IAUX : infinite auxiliary verb of a relative subclause
- @FS-N<IMV : infinite main verb of a relative subclause
- @FS-OBJ : finite verb in subclause functioning as object
- @FS-OBJ> : finite verb in subclause functioning as object
- @FS-<OBJ : finite verb in subclause functioning as object
- @FS-SUBJ : finite verb in subclause functioning as subject
- @FS-SUBJ> : finite verb in subclause functioning as subject
- @FS-<SUBJ : finite verb in subclause functioning as subject
- @FS-ADVL> : finite verb in subclause functioning as adverbial to the left of the main clause
- @FS-<ADVL : finite verb in subclause functioning as adverbial to the right of the main clause
- @S< : a clause modifying a sentence to the right of it
- @FS-ADVL : finite verb in subclause …
- @-FS-<ADVL : infinite subclause - eus
- @-FS-ADVL> : infinite subclause - eus
- @FS-N< : relative clause to N
- @FS->N : relative clause to N to the left side of it - eus
- @FS-VFIN< : finite verb in sentence, statement ** eai: Idja ii leat šat, eai ge sii dárbbaš lámppá dahje beaivváža čuovgga, dasgo Hearrá Ipmil lea sin čuovga. - The night is not anymore, they do not need the lamp- or day- light either, because God the Lord is their light.
- @FS-<APP : finite subclause functioning as an apposition
- @ICL-ADVL : non-finite subclause …
- @ICL-AUX< : “right” argument of auxiliary (?)
- @ICL-OBJ : infinitival clause object
- @ICL-SUBJ : infinitival clause subject
- @ICL-P< : infinitival clause complement of preprosition
- @IAUX : non-finite auxiliary
-
: main verb. A temporarily tag omitted in the end of the file. -
: auxilary verb. A temporarily tag omitted in the end of the file.
!fao syntags
- @>V
!kal syntags
- @INS :
- @<INS :
- @INS> :
!eus syntags
- @FS-SPRED : finite verb in subclause functioning as a subject predicate - eus, but not sure if in use
!Syntactic set definitions
!!!Dep grammar
Correction rules
-
muitalit
-
XX
-
XX
-
XX
-
faoSumId=Rel
!!The finite verb
!!!Mapping rules
!! lgRemove removes the language tags
This (part of) documentation was generated from src/cg3/dependency.cg3
src-cg3-disambiguator.cg3.md
This is the Eastern Mari disambiguation file. It chooses the correct morphological analyses in any given sentence context.
The file first defines sentence delimiters and tags and sets. Thereafter come the rules, each rule is listed below.
Sentence delimiters
The delimiters are: “<.>” “<!>” “<?>” “<…>” “<¶>” sent
The Tags section lists all the tags inherited from the fst, and defines them for use in the syntactic analysis. The tags are documented in the root.lexc file, and here only listed for reference.
The next section, Sets, contains sets defined on the basis of the tags listed here, those set names are not visible in the output.
Tags
Beginning and end of sentence
BOS EOS
Clause boundary
Parts of speech tags
N V A Adv CC CS Interj Pron Num Pcle Clt Po
WORD is the set of all POS
Verbal tense and mood tags
Prs Prt1 Prt2 Fut Imprt Ind Cond Des
Other verbal tags
Act ConNeg FutPrc Ger Inf Nec Neg NegPrc Pass Prc PrfPrc
Verbal person-number tags Sg1 Sg2 Sg3 Pl1 Pl2 Pl3
Numeral tags
Sg Pl
Case tags
Nom Gen Abl Dat Com Cns Acc Ins Ine Ill Cmpr (case)
Other nominal tags
Pers Refl Rel Interr Recipr Dem ABBR ACR
Adjective comparison tags
Pos (?) Superl Comp
Possessive suffix tags
PxSg1 PxSg2 PxSg3 PxPl1 PxPl2 PxPl3
Numeral tags
Card Coll Ord Temp (?)
Particles
Qst Foc
Punctuation marks
CLB PUCT LEFT RIGHT COMMA
Derivation tags
Der/MWN Der/sa Der/Pur Der/Caus Der/Nom
Tags for internal testing
CmpTest Err
Sets
- CASE = all cases
- OBLCASE = All cases except Nom
- VFIN = All moods
Der/Date Der/Year Der/Hum Der/Lang Der/Domain Der/Feat-phys Der/Clth Der/Body Der/Act
Sem/Ani Sem/Fem Sem/Group Sem/Mal Sem/Measr Sem/Money Sem/Obj Sem/Obj-el Sem/Org Sem/Perc-emo Sem/Plc Sem/Sign Sem/State-sick Sem/Sur Sem/Time Sem/Txt
Rule section
Early, word-internal rules
- CmpTst remove CmpTst if not 1 N
- CmpTst2 Select CmpTst in all other cases *NoFocPossNoun remove Foc/Poss if PxSg3
CC or Pcle
- teveteve1 gives CC if two теве
- teveteve1 gives CC if two теве
-
onlyteve gives Pcle if two теве
-
PcleNotCC Lauseen alussa on Pcle
-
CCnotInterj
-
Posna деч посна
- ikNum ик is never A
Particles
*InterrQ if question mark anywhere to the right
*Interr removes Rel if question mark to the right somewhere
Verbs
Existential ulo
Infinitives
-
Ind selects Ind if no Ind to the right or to the left
-
1SgAgr selects (Ind Sg1) if Pron1Sg to the right or to the left
-
1SgAgr selects (Ind Sg1) if (Pron1Sg Nom) to the right or to the left
-
2SgAgr selects (Ind Sg2) if Pron2Sg to the right or to the left
-
IndAfterInf selects Ind if Inf to the left
-
NotImpWhenInd
-
NotImpWhenWords1
-
NotImpWhenWords2
Adjectives
*RemAdjBeforeProp removes A if Prop to the left
*AdjBeforeMo selects A if Interr to the right
-
AdjBeforeAN selects A if N or A to the right
-
RemN removes N if N to the right
*AdjBeforeConjAdj selects A if conjuction and A to the right ;
-
AdjNotAdv removes Adv if N to the right
-
AdjNotPron removes Pron Pers if N to the right
*AdjNotN removes N if Pron Pers anywhere to the left
- RemAdj1 removes A if no N or A follows
*RemAdj2 removes A if no N or Pron in a clause
Nouns
- lym nalash “to take a name” = “to be given a title”
*RemNomIfPronLeft removes Nom if Pron Nom anywhere to the left
*RemNomIfPronRight removes Nom if Pron Nom anywhere to the right
*NomBeforeConjNom selects N Nom if conjoined with N Nom
*NafterDem selects N if Dem to the left (demonstratives tend to be sole modifiers)
*NotANoun
*NafterAbeforeEOS
*RemNafterAdv removes N if adverb to the left
Derivations
-
RemDerMWN1 removes Der/MWN if N is an option
-
RemDerMWN2 removes Der/MWN if N to the right
-
Dersa if noun follows
-
SelDerMWN select Der/MWN if no noun follows
Cases
-
RemNomNif12left removes Nom with N if there is a verb with 1st or 2nd agreement to the lef
-
RemNomNif12right removes Nom with N if there is a verb with 1st or 2nd agreement to the right
-
AccNeedsVerb prefers Nom (TODO: does this make sense? SASHA: it does but there was a typo, -1* instead of 1* in the third clause of the condition)
Proper nouns
Numerals
- IkNumN ik is num before N Sg
-
IkNumAN ik is num before A N Sg
- KumNumAN ik is num before A N Sg
Pronouns
-
NotImp in most тиде cases
-
NotInterr if Rel
-
Dem if noun follows
-
уке
Conjunctions
Postpositions
- PoNeedsGen removes postposition if the word to the left is not Gen or Nom
Adverbs
- molan awaiting rules for dative verbs subcategorising for mo Dat
Phrases
Verbs
Finite verb or Gerundium
*RemGer removes Ger Gen if there is no verb to the right
-
FinNotGer removes Ger if there is a Ind Prt2 Sg3 in the clause
-
GerNotFin Ger if there is a Ind next
-
GerNotFin Ger if there is a Ger da Ger VFin
First or third person
-
Sg1NotSg3 removes Prt1 Sg3 when Pers Sg1 Nom in same clause
- Sg3NotSg1 removes Prt1 Sg1 when there is no Pers Sg1 Nom in same clause
- This definitely is too strong, it precludes zero Sg1 subjects
ConNeg or not
- NoConNeg1 No ConNeg if no Neg to the left
- NoConNeg2 No ConNeg if another ConNeg to the left
да
-
da1 Adv initially
-
da2 CC elsewhere
и
- iNotAbbr
Interjection
- NoExclNoInterj
Predicative
AifVövny selects A if вӧвны somewhere to the left
Conjunctions
-
NotPcle
-
NoErrOrth
This (part of) documentation was generated from src/cg3/disambiguator.cg3
src-cg3-functions.cg3.md
S Y N T A C T I C F U N C T I O N S F O R S Á M I
Sámi language technology project 2003-2025, University of Tromsø #
This file adds syntactic functions. It is common for all the Saami
LEFT RIGHT because of apertium
-
Sets for POS sub-categories
-
Sets for Semantic tags
-
Sets for Morphosyntactic properties
!!Syntactic tags
- @+FAUXV : finite auxiliary verb ** ferte: Son ferte oaidnit ollislaš gova. - She must see the whole picture.
- @+FMAINV : finite main verb ** oaidná: Son oaidná ollislaš gova. - She sees the whole picture
- @-FAUXV : infinite auxiliary verb ** sáhte: In sáhte gáhku borrat. - I cannot eat cake.
- @-FMAINV : infinite main verb ** oaidnit: Son ferte oaidnit ollislaš gova. - She must see the whole picture.
- @-FSUBJ> : Subject of infinite verb outside the verbal. ** mu: Diet dáhpáhuvai mu dieđikeahttá. - It happened without me knowing about it.
- @-F<OBJ : Subject of infinite verb outside the verbal. ** nuppi: Ulbmil lea oažžut nuppi boagustit. - The goal is to get the other one to laugh.
- @-FOBJ> : Object of infinite verb outside the verbal. ** váldovuoittuid: Sii vurde váldovuoittuid fasket. - They waited to grab the main prizes.
- @SPRED<OBJ : Object of an subsject predicative. (some adjectives are transitive) ** guliid: Mánát leat oažžulat guliid.
- @-FADVL : Adverbial complement of infinite verb outside the verbal. ** várrogasat: Dihkkadeaddji rávve skohtervuddjiid várrogasat mátkkoštit. - The roadman warns snowscooter drivers to drive carefully.
- @-F<PRED : Predicative complement of infinite verb outside the verbal. ** ággan: Jáhkken kulturmáhtu leat oktan ággan.
- @>ADVL : Modifier of an adverbial to the right. ** vaikko: doppe leat vaikko man ollu studeanttat.
- @ADVL< : Komplement for adverbial. ** vahkus: Son málesta guktii vahkus.
- @<ADVL : Adverbial after the main verb. ** dás: Eanet dieđuid gávnnat dás.
- @ADVL> : Adverbial to the left of the main verb ** viimmat: Dál de viimmat asttan lohkat reivve.
- @ADVL>CS : Adverbial modifying subjunction. ** ‘beare’ pointing at ‘danin go’: Muhto dus ii leat riekti dearpat su beare danin go sáhtát.
-
: Habitive, specifying an adverbial, e.g. @ADVL> ** Máhtes: Máhtes lea beana. -
: Extencial, specifying an subject, e.g. @<SUBJ ** beana: Máhtes lea beana. -
: logoforic pronouns, e.g. @>N (for MT) -
: - @>N : Modifier of a noun to the right. ** geavatlaš: Ráđđehussii lea geavatlaš politihkka deaŧalaš. - For the government, practical politics is important.
- @N< : Complement of noun to the left. ** vihtta: Mun boađán diibmu vihtta.
- @>A : Modifier of an adjective to the right. ** juohke: Seminára lágiduvvo juohke nuppi jagi.
- @P< : Complement of preposition. ** soađi: Dat dáhpáhuvai maŋŋel soađi.
- @>P : Complement of postposition. ** riegádeami: Seta riegádeami maŋŋel Áttán elii vel 800 jagi.
- @HNOUN : Stray noun in sentence fragment. ** muittut: Fidnokurssa muittut.
- @INTERJ : Interjection. ** Hei: Hei, boađe!
- @>Num : Attribute of numeral to the right. ** dušše: Mun ledjen dušše guokte mánu doppe.
- @Pron< : Complement of pronoun to the left. ** Birehiin: Moai Birehiin leimme doppe.
- @>Pron : Modifyer of pronoun to the right. ** vaikko: Olmmoš sáhttá bargat vaikko maid.
- @Num< : Complement of numeral to the left. ** girjjiin: Dat lea okta min buoremus girjjiin.
- @OBJ : Object, the verb is not in the sentence (ellipse)
- @<OBJ : Object, the verb is to the left. ** gávtti: Son goarru gávtti.
- @OBJ> : Object, the verb is to the right. ** filmma: Dán filmma leat Kárášjoga nuorat oaidnán.
- @OPRED : Object predicative, the verb is not in the sentence (ellipse).
- @<OPRED : Object predicative, the verb is to the left. ** buriid: Son ráhkada gáhkuid hui buriid.
- @OPRED> : Object predicative, the verb is to the right. ** dohkkemeahttumin: Son oinnii dohkkemeahttumin bargat ášši nu.
- @PCLE : Particle. ** Amma: Amma mii eat leat máksán? - We have not paid, have we?
- @COMP-CS< : Complement of subjunction. ** vejolaš: Dat šaddá nu buorre go vejolaš.
- @SPRED : Subject predicative, the verb is not in the sentence (ellipse).
- @<SPRED : Subject predicative, the verb is to the left. ** árgabivttas: Ovdal lei gákti árgabivttas.
- @SPRED> : Subject predicative, the verb is to the left. ** álbmogin: Sápmelaččaid historjá álbmogin lea duháhiid jagiid boaris.
- @SUBJ : Subject, the finite verb is not in the sentence (ellipse).
- @<SUBJ : Subject, the finite verb is to the left. ** gákti: Ovdal lei gákti árgabivttas.
- @SUBJ> : Subject, the finite verb is to the right. ** Son: Son lea mu oabbá. - Sheis my sister.
- @PPRED : Predicative for predicative.
- @APP : Apposition
- @APP-N< : Apposition to noun to the left. ** oahpaheaddji: Oidnen Ánne, min oahpaheaddji.
- @APP-Pron< : Apposition to pronoun to the left. ** boazodoalloáirasat: Ja moai boazodoalloáirasat áigguime vaikko guovttá joatkit barggu.
- @APP>Pron : Apposition to noun to the right.
- @APP-Num< : Apposition to numeral to the left.
- @APP-ADVL< : Apposition to adverbial to the left. ** bearjadaga: Mun vuolggán ihttin, bearjadaga.
- @VOC : Vocative ** Miss Turner : Bures boahtin deike, Miss Turner! - Welcome her, Miss Turner!
- @CVP : Conjunction or subjunction that conjoins finite verb phrases. ** go : Leago guhkes áigi dassá go Máreha oidnet? - Is it a long time since you saw Máret?
- @CNP : Local conjunction or subjunction. ** vai : Leago nieida vai bárdni? - Is it a girl or a boy?
- @CMPND
- @X : The function is unknown, e.g. because of that the word is unknown
!!Tag sets
- Sets for verbs
** V is all readings with a V tag in them, REAL-V should be the ones without an N tag following the V. The REAL-V set thus awaits a fix to the preprocess V … N bug.
-
The set COPULAS is for predicative constructions
-
NP sets defined according to their morphosyntactic features
-
The PRE-NP-HEAD family of sets
These sets model noun phrases (NPs). The idea is to first define whatever can occur in front of the head of the NP, and thereafter negate that with the expression WORD - premodifiers.
The set NOT-NPMOD is used to find barriers between NPs. Typical usage: … (*1 N BARRIER NPT-NPMOD) … meaning: Scan to the first noun, ignoring anything that can be part of the noun phrase of that noun (i.e., “scan to the next NP head”)
-
Miscellaneous sets
-
Border sets and their complements
ADLVCASE
- Syntactic sets
These were the set types.
!!Numeral outside the sentence
!!HABITIVE MAPPING
-
hab1 hab aux leat
-
hab_numo1 hab copula comma comma N+Nom
-
hab_numo2 copula nu mo/go hab
-
leahab copula nu mo/go hab
-
hab2 hab auxv adv leat
-
hab3 (
@ADVL>) for asdf hab-actor and hab-case; if leat to the right, and Nom to the right of leat. Lots of restrictions. -
hab3 (
@ADVL>) for asdf hab-actor and hab-case; if leat to the right, and Nom to the right of leat. Lots of restrictions. -
hab3 (
@ADVL>) for asdf hab-actor and hab-case; if leat to the right, and Nom to the right of leat. Lots of restrictions. -
hab3 (
@ADVL>) for hab-actor and hab-case; if leat to the right, and Nom to the right of leat. Lots of restrictions. -
hab_main (
@ADVL>) for hab-actor and hab-case; if leat to the right, and Nom to the right of leat. Lots of restrictions. -
habInf hab lea inf
-
habNomLeft Nom or Num + gen hab lea
-
habAdvl Ii han ovttasge du sogas leat dat namma.
-
hab4 hab cc hab leat
-
hab6 lea go hab – leago hab
-
hab7 lea go hab
- hab8 This is not HAB Ellii šattai hoahppu.
-
hab5 This is not HAB Mánás gollot gieđat.
-
hab9 prop ord-hab leat
-
hab10 prop ord-hab leat
- habDain (
@ADVL>) for (Pron Dem Pl Loc) if leat followed by Nom to the right -
habDain2
-
habRel # before relative clause
-
habEllipse Buot gánddain lea dreassa, nieiddain fas gákti.
-
habGen (
@<ADVL) hab for Gen; if Gen is located in the end of the sentence and Nom is sentence initial -
habGenQst (
@<ADVL) hab for Gen; in a question sentence. Gen is located sentence initially and SUBJ is found to the right. To the right of SUBJ is copulas -
habRefl # with inf
-
n<titel1 (@N<) for (“jr”) or (“sr”); if first one to the left is Prop
-
n<titel2 (@N<) for INITIAL; if first one to the left is a noun, or if to the left of you is a single letter which is part of a noun conjunction ‘‘bustávas e ja f gáibiduvvo’’
-
n<:com (@N<) for (Sg Com); if first one to the left is Coll
-
>nAttr (@>N) for Attr; if there is a noun to your right
-
n>Indef (Pron Indef Attr); if eará is to the right
-
n>Indef (Pron Indef Com); if eará is to the right
-
>nNum (@>N) for numerals if; there is a noun to your right. You are not allowed to be (Sg Nom), (Sg Acc) or (Sem/Date)
-
noun>n (@>N) for Gen; if there is a noun to your right. Restrictions: Not if you are: a time related word. Not if you are OKTA with Pl Loc to your right. Not if CC is to your right followed by another Gen and then Po. Not if you are HUMAN and to your right is Actio Nom folloed by a noun.
-
>nTime (@>N) for Gen TIME-N; if timenoun to your right. Restrictions: Not if you are a OKTA Nom with Pl Loc to your right. Not if CC followed by Gen, followed by Po to your right. Not if COMMA to your right
-
>ntittel (@>N) for (Sg Nom TIME-N) or (Nom Der/NomAg); if to your right is Sem/Mal, Sem/Fem, Sem/Sur
-
>nplc (@>N) for (Sg Nom Prop Sem/Plc), if to your right is Sem/Plc
-
>nALU (@>N) for Sg Acc numerals; when a measure-noun to the right
-
>NTime (@>N) for Gen; if you are TIME-N with BOC to your left, and PREGEN to your right
-
n<:Refl (@N<) for (Refl Nom); if to the left is (N Nom), or if first one to the left is a finite mainverb with a (N Nom) to the left
-
>pron1 (@>Pron) for GRADE-ADV, DUSSE, BUOT if; first one to the right is Pron
-
>pron2 (@>Pron) for (Refl Nom) if; first one to the right is Refl
-
>pron3 (@>Pron) for (Pron Recipr) if; first one to the right is (Pron Recipr)
-
vaikko (@>Pron) for vaikko if; first one to the right is Indef
-
vaikkoman (@>ADVL) for vaikko if; first one to the right is man
-
dasmaŋŋel (@>ADVL) for vaikko if; first one to the right is man
-
adv>advl (@>ADVL)
-
adv>advl (@>ADVL)
-
BOSvoc (@VOC) for HUMAN Nom; if sentence initial. To the right is comma. No nom-cased HUMAN followed by comma or CC is allowed to the right. There should not be a relative clause to the right, because then you are likely to be SUBJ
-
voc (@VOC) for Nom HUMAN; if comma to the left and an second person verb or pronoun to the left. To the right is the end of the sentence
-
__Particle<subj __ (@PCLE)
-
spred<obj (@SPRED<OBJ) for Acc; the object of an SPRPED. Not to be mistaken with OPRED. If SPRED is to the left, and copulas is to the left of it. Nom or Hab are found sentence initially.
-
Hab<subj (
@<SUBJ) for Nom; if copulas, goallut or jápmit is FMAINV and habitive or human Loc is found to the left. OR: if Ill or @Pron< followed by HAB are found to the left. -
Hab<subj (
@<SUBJ) with relative clause in between -
Hab>Advlcase<subj (
@<SUBJ) for Nom; it allows adverbials with Ill/Loc/Com/Ess to be found inbetween HAB and . -
Nom>Advlcase<subj (
@<SUBJ) for Nom; it allows adverbials with Ill/Loc/Com/Ess to be found inbetween Nom and @<SUBJ. -
<extSubj (
@<SUBJ) for Nom; if copulas to the left, and some kind of adverb, N Loc, time related word or Po to the left of it. OR: if Ill or @Pron< to the left, followed by copulas and the before mentioned to the left of copulas. -
<extSubj (
@<SUBJ) for sma Nom; if some kind of adverb to the left, N Loc, time related word or Po to the left of it. -
<extSubjA (
@<SUBJ) for A - TEST WITHOUT THIS ONE -
<extSubj (
@<SUBJ) for Nom; if leat to the left and sentenceboundary -
<extSubj (
@<SUBJ) for Nom, but not for Pers. To the left boahtit or heaŋgát as MAINV, and futher to the left is some kind of place related word, or time related word -
loc<extSubj (
@<SUBJ) for Nom -
<spred (@<SPRED) for Nom; if Nom to the left, copulas to the left of Nom, and a time related word to the left of it.
-
<extQst1 (
@<SUBJ) for Nom; in an existential sentence. To your left is hab, some kind of place or time-word or Po. This is a Qst-sentence so the qst-pcle is attached to leat or following leat -
<extQst2 (
@<SUBJ) for Nom; in an existential sentence. To your left is leat and it is sentence initial. No attributes or other words are allowed inbetween (because then you are SPRED), except the attribute muhtun, muhtin -
extQst3> (
@SUBJ>) for Nom; if habitive first one to the left, followed by copulas. -
extQst3> (
@SUBJ>) for Nom; if habitive first one to the left, followed by copulas. -
<extsubjcoor (
@<SUBJ) for Nom. Coordination -
Sem/Year
-
<spredQst (@<SPRED) for Nom; in a typically question sentence; You are not allowed to be Pers or human. The special part is that Nom is not allowed to your right
-
<spredQst2 (@<SPRED) for (A Nom); in a typically question sentence; You are SPRED if (N Nom) is to your left and leat + qst is to the left
-
<spredQst3 (@<SPRED) for (A Nom); you are SPRED when you are (A Nom) and to your right is (N Nom). This is a Qst-sentence, so copulas is found to your left
-
<spredQst4 (@<SPRED) for Nom; but only in a qst-sentence where there is no chance of you beeing the subj
-
<NomBeforeSpred (@<SPRED) for (A Nom) if; Nom to the left, and copulas is to the left of Nom. There is no Nom allowed to the right of copulas! To avoid messing with coordination: ja, dahje and comma are not allowed to your left. Comma is not allowed to your right; if so then you are likely to be coordinated
-
<spred (@<SPRED) for A Nom or N Nom if; the subject Nom is on the same side of copulas as you: on the right side of copulas
-
<spredVeara (@<SPRED) for veara + Nom; if genitive immediately to the right, and intransitive mainverb to the right of genitive
-
leftCop<spred (@<SPRED) for Nom; if copulas is the main verb to the left, and there is no Ess found to the left of cop (note that Loc is allowed between target and cop). OR: if you are Coll or Sem/Group with copulas to your left.
-
<spredLocEXPERIMENT (@<SPRED) for material Loc; if you are to the right of copulas, and the Nom to the left of copulas is not a hab-actor
-
NumTime (@<SPRED) for A Nom
-
<spredSg (@<SPRED) for Sg Nom
-
<spredPg (@<SPRED) for Pl Nom
-
<spred (@<SPRED) for Nom; if copulas to the left, and Nom or sentence boundary to the left of copulas. First one to the right is EOS.
-
COP<spredEss (@<SPRED) for N Ess
-
spredEss> (@SPRED>) for N Ess; if copulas to the right of you, and if an NP with nom-case first one to your left.
-
GalleSpred> (@SPRED>) for Num Nom; if sentence initial
-
spredSgMII> (@SPRED>)
-
spredšaddat> (@SPRED>)
-
spredsjidtedh> (@SPRED>)
-
r492> (@SPRED>) for Interr Gen; consisting only of negations. You are not allowed to be MII. You are not allowed to have an adjective or noun to yor right. You are not allowed to have a verb to your right; the exception beeing an aux.
-
AdjSpredSg> (@SPRED>) for A Sg Nom; if copulas to the right, but not if A or @<SPRED are found to the right of copulas
-
Spred>SubjInf (@SPRED>) for Nom; if copulas to the right, and the subject of copulas is an Inf to the right
-
spredCoord (@<SPRED) coordination for Nom; only if there already is a SPRED to the left of CNP. Not if there is some kind of comparison involved.
-
subj>Sgnr1 (@SUBJ>) for Nom Sg, including Indef Nom if; VFIN + Sg3 or Pl3 to the right (VFIN not allowed to the left)
- subj>Du (@SUBJ>) for dual nominatives, including Coll Nom. VFIN + Du3 to the right.
-
subj>Pl (@SUBJ>) for plural nominatives, including Coll and Sem/Group. VFIN + Pl3 to the right.
-
subj>Pl (@SUBJ>) for plural nominatives
-
subj>Sg (@SUBJ>) for Nom Sg; if VFIN + Sg3 to the right.
-
Sg<subj (@<SUBJ) for Nom Sg; if VFIN Sg3 or Du2 to the left (no HAB allowed to the left).
-
Du<subj (@<SUBJ) for Nom Coll if; a dual third person verb is found to the left
-
PlDu<subj (@<SUBJ) for (N Nom Pl), (Sem/Group Nom), (Coll Nom), (Pron Nom Pl) if; a verb is Pl3 or Du3 to your left. The verb is not allowed to be copulas with a place, Loc or time noun to its left
-
copPl3<subj (@<SUBJ) for Nom Pl; you don’t to be a noun, only Nom Pl. To the left is copulas and first one to the right is @<SPRED
-
-fsubj> (@-FSUBJ>) for HUMAN Gen; in a NP-clause. To your right is Actio Nom followed by a noun
-
f<advl (@-F<ADVL) for infinite adverbials
-
f<advl (@-F<ADVL) for infinite adverbials
-
s-boundary=advl> (@ADVL>) for ADVL that resemble s-boundaries. Mainverb to the right.
-
diibmuadvl> (@ADVL>) for (diibmu Nom) if first one to the right is Num
-
-fsubj (@-FSUBJ>) for HUMAN Acc after DADJAT verbs
-
-fobj> (@-FOBJ>) for Acc if front of abessive, gerundium, actio locative, perfectum participle or infinitive. First one to the right not allowed to be Acc though
-
-fobj> (@-FOBJ>) for Acc if human with ADVL-case to the left and transitive infinitive OBJ to the right. First one to the right not allowed to be Acc though
-
advl>mainV (@ADVL>) if; finite mainverb not found to the left, but the finite mainverb is found to the right.
-
V<advl (@<ADVL) if; finite mainverb found to the left. Not if a comma is found immediately to the left and a finite mainverb is located somewhere to the right of this comma.
-
advl>v (@ADVL>) if; you are ADVL, time-noun or Sem/Route and there is a finite verb to the right in the clause, or if to your right is: de followed by a finite verb. OR: if you are a time-nound and to your right is: go or sentenceboundary followed by a finite verb
- <advlPoPr (@<ADVL) for Po or Pr; if mainverb to the left.
-
advlPoPr> (@<ADVL) for Po or Pr; if mainverb to the right.
-
BOSPo> (@ADVL>) for Po; if trapped between BOS to the right and S-BOUNDARY OR COMMA to the left, because the main verb will then automatically be on your right side.
-
<advlComIll (@<ADVL) only if; you are Com OR Ill. To your left is a mainverb, and to your right a sentenceboundary, because we don’t want there to be another mainverb you potentially could belong to
-
<advlEOS (@<ADVL) for Po or Pr or Loc; if you are found at the very end of a sentence. A mainverb is needed to the left though.
-
<advlGen (@<ADVL) for (N Gen) if mainverb to the left and no noun to the right
-
<opredgohcodit (@<OPRED) for Ess
-
advlEss> (@<ADVL) for weather and time Ess, if FMAINV to the left.
-
comma<advlEOS (@<ADVL) for Adv if; mainverb is to the left. Comma to the left and mainverb to the right in the same clause is not allowed
-
advl>inbetween (@ADVL>) for Adv; if inbetween two sentenceboundaries where no mainverb is present.
-
comma<advlEOS (@<ADVL) for Adv if; comma found to the left and the finite mainverb to the left of comma. To the right is the end of the sentence.
-
BOSadvl> (@ADVL>) if; you are N Loc or N Ill and found sentence initially and there is a main verb somewhere to the right. No barrier for the mainverb; based on the thought that first one to your right is probably a sentenceboundary.
-
cleanupILL<advl (@<ADVL) for N Ill if; there are no boundarysymbols to your left, if you arent already @N< OR @APP-N<, and no mainverb is to yor left.
-
cleanupPo (@ADVL) for Po: This rule tags all Po:s as ADVL if they haven’t gotten a tag somewhere along the way.
-
cleanupPr (@ADVL) for Po: This rule tags all Pr:s as ADVL if they haven’t gotten a tag somewhere along the way.
-
-fsubj>asAcc (@-FSUBJ>) for HUMAN Acc; if there is a verb @-F<OBJ to your left
-
-f<obj (@-F<OBJ) for Acc if there is a transitive verb + SYN-V to your left
-
-fsubj>IV (@-FSUBJ>) for Acc; if there is an IV-verb acting as a @-F<OBJ to your right
-
-fsubj>IV (@-FSUBJ>) for Acc; if there is an TV-verb acting as a @-F<OBJ to your right followed by an Acc
-
-fsubj>asGen (@-FSUBJ>) for Gen;
-
f<subj (@-F<SUBJ) for Nom if; (V @-F<OBJ) to the left.
-
<opredAAcc (@<OPRED) for A Acc; if an other accusative to the left, and a transtive verb to the left of it. OR: if a transitive verb to the left, and an accusative to the left of it.
- TV<obj (@<OBJ) for Acc; if there is a transitive mainverb to the left in the clause. Not for Rel. Not if you are a numeral followed by a measure-noun
!sma object
-
<advlMeasr (@<ADVL) for (Num Acc); if finite IV-mainverb to the left, measure-noun to the right
-
<objMeasr (@<OBJ) for Num Acc; if finite TV-mainverb to the left, measure-noun to the right
-
<advlMeasr2 (@<ADVL) for MEASR-N + Acc; if (Num Pl) to the left and mainverb to the left of it
-
advlMeasr> (@ADVL>) for Num Acc;
-
Obj> (@OBJ>) for Acc; if there is a finite mainverb to the right in the clause. A really simple rule with no other restrictions..
-
s-boun<obj (@<OBJ) for Acc; if sentenceboundary to your left and a transitive mainverb to the left futher to the left
-
<objIV (@<OBJ) for Acc; if there is an intransitive mainverb in the clause. Not for Rel or Num. Not if you are a numeral followed by a measure-noun
-
<advlEss (@<ADVL) for ESS-ADVL if; FMAINV to the left
-
IV<spredEss (@<SPRED) for N Ess if; FMAINV to the left is intransitive or bargat
-
<opredEss (@<OPRED) for (N Ess), (A Ess) if; transitive mainverb to the left in the clause. If accusative to the left or to the right, or if Inf or ahte to the right, or if there is a noun to the right followed by an Inf
-
Acc<opredEss (@<OPRED) for (N Ess), (A Ess) if; transitive mainverb to the left in the clause, and an accusative cased Rel left to the verb
-
onlyV<opred (@<OPRED) for (N Ess) if; there is a transitive mainverb to the left. Usually there needs to be an Acc to the left, but here it is not needed
-
onlyV<opred2 (@<OPRED) for (N Ess) if;
!!SUBJ MAPPING - leftovers
-
subj>ifV (@SUBJ>) for NP-HEAD-NOM, DUPRON or Num + Nom if; a finite mainverb is found to the right. This is a cleanup rule for subjects
-
hnoun>ifV (@SUBJ>) for NP-HEAD-NOM, DUPRON if. The counterpart of subj>ifV. You are HNOUN if there is a finite verb to your right, but NOT if there is a finite verb after a relative clause
!!OBJ MAPPING - leftovers
!!
!!HNOUN MAPPING
- @<ADVLcoor (@<ADVL) for ADVLCASEAdv if @CNP to the left and ADVL to the left of it
! missingX adds @X to all missings
! therestX adds @X to all what is left, often errouneus disambiguated forms
!!For Apertium: The analysis give double analysis because of optional semtags. We go for the one with semtag.
!!For Korp:
Here we remove special tags for MT
! smeRemove removes the language tags
Here we remove semantic tags for all other words than proper nouns.
This (part of) documentation was generated from src/cg3/functions.cg3
src-fst-morphology-affixes-adjectives.lexc.md
Adjective inflection
Meadow Mari adjectives
LEXICON A underscore
-
LEXICON A-a/e
-
LEXICON A/S-a/e redirect to A underscore
-
LEXICON A/S-VS redirect to A underscore
-
LEXICON A-VS redirect to A underscore
-
LEXICON A/S redirect to A underscore
This (part of) documentation was generated from src/fst/morphology/affixes/adjectives.lexc
src-fst-morphology-affixes-clitics.lexc.md
Eastern Mari Clitics
-
LEXICON K
-
LEXICON K-imprt
This (part of) documentation was generated from src/fst/morphology/affixes/clitics.lexc
src-fst-morphology-affixes-nouns.lexc.md
Noun inflection
Meadow Mari noun inflection
a final lexica
Some Postpositions in Mari take possessive suffixes. For now, am allowing all an all, but we should revisit this in the lexicon eventually - classifying postpositions into those that take Px and those that do not.
Also here: some adverbs that take possessive suffixes, like ӱстембалне on the table > ӱстембалнем on my table
-
LEXICON N_ redirects to N-ava_01
-
LEXICON N-continuation comes from Proper nouns
-
LEXICON N-ava_01 obl because of pronouns
- LEXICON N-ava_01_obl to obl only
-
LEXICON N-ava_01_obl_without-hyphens to obl only
-
LEXICON N-ava_01_obl_with-hyphens to obl only, also ООО-влак
- LEXICON N-OLD-ORTH-SG-NOM_
DECLENSION
Case suffixes
Each case-number-person has its own lexicon.
-
LEXICON N-SG-NOM
-
LEXICON N-SG-GEN
-
LEXICON N-SG-DAT
-
LEXICON N-SG-ACC
-
LEXICON N-SG-CMPR
-
LEXICON N-SG-COM
-
LEXICON N-SG-INE
-
LEXICON N-SG-ILL
-
LEXICON N-SG-LAT
-
LEXICON N-SG-ABE
-
LEXICON N-LOCPL-ILL
-
LEXICON N-LOCPL-INE
-
LEXICON N-LOCPL-LAT
-
LEXICON N-LOCPL-NOM
-
LEXICON N-PL-NOM
-
LEXICON N-PL-GEN
-
LEXICON N-PL-DAT
-
LEXICON N-PL-ACC
-
LEXICON N-PL-CMPR
-
LEXICON N-PL-COM
-
LEXICON N-PL-INE
-
LEXICON N-PL-ILL
-
LEXICON N-PL-LAT
-
LEXICON N-PL-ABE
Sg Sg1
Here starts the Px stuff
-
LEXICON N-SG-PxSg1-NOM
-
LEXICON N-SG-PxSg1-GEN
-
LEXICON N-SG-PxSg1-DAT
-
LEXICON N-SG-PxSg1-ACC
-
LEXICON N-SG-PxSg1-CMPR
-
LEXICON N-SG-PxSg1-COM
-
LEXICON N-SG-PxSg1-INE
-
LEXICON N-SG-PxSg1-ILL
-
LEXICON N-SG-PxSg1-LAT
-
LEXICON N-SG-PxSg1-ABE
Pl Sg1
-
LEXICON N-PL-PxSg1-NOM
-
LEXICON N-PL-PxSg1-NOM_NB-first
-
LEXICON N-PL-PxSg1-GEN
-
LEXICON N-PL-PxSg1-GEN_NB-first
-
LEXICON N-PL-PxSg1-DAT
-
LEXICON N-PL-PxSg1-DAT_NB-first
-
LEXICON N-PL-PxSg1-ACC
-
LEXICON N-PL-PxSg1-ACC_NB-first
-
LEXICON N-PL-PxSg1-CMPR
-
LEXICON N-PL-PxSg1-CMPR_NB-first
-
LEXICON N-PL-PxSg1-COM
-
LEXICON N-PL-PxSg1-COM_NB-first
-
LEXICON N-PL-PxSg1-INE
-
LEXICON N-PL-PxSg1-INE_NB-first
-
LEXICON N-PL-PxSg1-ILL
-
LEXICON N-PL-PxSg1-ILL_NB-first
-
LEXICON N-PL-PxSg1-LAT
-
LEXICON N-PL-PxSg1-LAT_NB-first
-
LEXICON N-PL-PxSg1-ABE
-
LEXICON N-PL-PxSg1-ABE_NB-first
Sg Sg2
-
LEXICON N-SG-PxSg2-NOM
-
LEXICON N-SG-PxSg2-GEN
-
LEXICON N-SG-PxSg2-DAT
-
LEXICON N-SG-PxSg2-ACC
-
LEXICON N-SG-PxSg2-CMPR
-
LEXICON N-SG-PxSg2-COM
-
LEXICON N-SG-PxSg2-INE
-
LEXICON N-SG-PxSg2-ILL
-
LEXICON N-SG-PxSg2-LAT
-
LEXICON N-SG-PxSg2-ABE
Pl Sg2
- LEXICON N-PL-PxSg2-NOM
-
LEXICON N-PL-PxSg2-NOM_NB-first
-
LEXICON N-PL-PxSg2-GEN
-
LEXICON N-PL-PxSg2-GEN_NB-first
-
LEXICON N-PL-PxSg2-DAT
-
LEXICON N-PL-PxSg2-DAT_NB-first
-
LEXICON N-PL-PxSg2-ACC
-
LEXICON N-PL-PxSg2-ACC_NB-first
-
LEXICON N-PL-PxSg2-CMPR
-
LEXICON N-PL-PxSg2-CMPR_NB-first
-
LEXICON N-PL-PxSg2-COM
-
LEXICON N-PL-PxSg2-COM_NB-first
-
LEXICON N-PL-PxSg2-INE
-
LEXICON N-PL-PxSg2-INE_NB-first
-
LEXICON N-PL-PxSg2-ILL
- LEXICON N-PL-PxSg2-ILL_NB-first
Sg Sg3
-
LEXICON N-SG-PxSg3-NOM
-
LEXICON N-SG-PxSg3-GEN
-
LEXICON N-SG-PxSg3-DAT
-
LEXICON N-SG-PxSg3-ACC
-
LEXICON N-SG-PxSg3-CMPR
-
LEXICON N-SG-PxSg3-COM
-
LEXICON N-SG-PxSg3-INE
-
LEXICON N-SG-PxSg3-ILL
-
LEXICON N-SG-PxSg3-LAT
-
LEXICON N-SG-PxSg3-ABE
Pl Sg3
-
LEXICON N-PL-PxSg3-NOM
-
LEXICON N-PL-PxSg3-NOM_NB-first
-
LEXICON N-PL-PxSg3-GEN
-
LEXICON N-PL-PxSg3-GEN_NB-first
-
LEXICON N-PL-PxSg3-DAT
-
LEXICON N-PL-PxSg3-DAT_NB-first
-
LEXICON N-PL-PxSg3-ACC
-
LEXICON N-PL-PxSg3-ACC_NB-first
-
LEXICON N-PL-PxSg3-CMPR
-
LEXICON N-PL-PxSg3-CMPR_NB-first
-
LEXICON N-PL-PxSg3-COM
-
LEXICON N-PL-PxSg3-COM_NB-first
-
LEXICON N-PL-PxSg3-INE
-
LEXICON N-PL-PxSg3-INE_NB-first
-
LEXICON N-PL-PxSg3-ILL
-
LEXICON N-PL-PxSg3-ILL_NB-first
-
LEXICON N-PL-PxSg3-LAT
-
LEXICON N-PL-PxSg3-LAT_NB-first
-
LEXICON N-PL-PxSg3-ABE
-
LEXICON N-PL-PxSg3-ABE_NB-first
Sg Pl1
-
LEXICON N-SG-PxPl1-NOM
-
LEXICON N-SG-PxPl1-GEN
-
LEXICON N-SG-PxPl1-DAT
-
LEXICON N-SG-PxPl1-ACC
-
LEXICON N-SG-PxPl1-CMPR
-
LEXICON N-SG-PxPl1-COM
-
LEXICON N-SG-PxPl1-INE
-
LEXICON N-SG-PxPl1-ILL
-
LEXICON N-SG-PxPl1-LAT
-
LEXICON N-SG-PxPl1-ABE
Pl Pl1
-
LEXICON N-PL-PxPl1-NOM
-
LEXICON N-PL-PxPl1-NOM_NB-first
-
LEXICON N-PL-PxPl1-GEN
-
LEXICON N-PL-PxPl1-GEN_NB-first
-
LEXICON N-PL-PxPl1-DAT
-
LEXICON N-PL-PxPl1-DAT_NB-first
-
LEXICON N-PL-PxPl1-ACC
-
LEXICON N-PL-PxPl1-ACC_NB-first
-
LEXICON N-PL-PxPl1-CMPR
-
LEXICON N-PL-PxPl1-CMPR_NB-first
-
LEXICON N-PL-PxPl1-COM
-
LEXICON N-PL-PxPl1-COM_NB-first
-
LEXICON N-PL-PxPl1-INE
-
LEXICON N-PL-PxPl1-INE_NB-first
-
LEXICON N-PL-PxPl1-ILL
-
LEXICON N-PL-PxPl1-ILL_NB-first
-
LEXICON N-PL-PxPl1-LAT
-
LEXICON N-PL-PxPl1-LAT_NB-first
-
LEXICON N-PL-PxPl1-ABE
-
LEXICON N-PL-PxPl1-ABE_NB-first
Sg Pl2
-
LEXICON N-SG-PxPl2-NOM
-
LEXICON N-SG-PxPl2-GEN
-
LEXICON N-SG-PxPl2-DAT
-
LEXICON N-SG-PxPl2-ACC
-
LEXICON N-SG-PxPl2-CMPR
-
LEXICON N-SG-PxPl2-COM
-
LEXICON N-SG-PxPl2-INE
-
LEXICON N-SG-PxPl2-LAT
-
LEXICON N-SG-PxPl2-ABE
-
LEXICON N-PL-PxPl2-NOM
-
LEXICON N-PL-PxPl2-NOM_NB-first
-
LEXICON N-PL-PxPl2-GEN
-
LEXICON N-PL-PxPl2-GEN_NB-first
-
LEXICON N-PL-PxPl2-DAT
-
LEXICON N-PL-PxPl2-DAT_NB-first
-
LEXICON N-PL-PxPl2-ACC
-
LEXICON N-PL-PxPl2-ACC_NB-first
-
LEXICON N-PL-PxPl2-CMPR
-
LEXICON N-PL-PxPl2-CMPR_NB-first
-
LEXICON N-PL-PxPl2-COM
-
LEXICON N-PL-PxPl2-COM_NB-first
-
LEXICON N-PL-PxPl2-INE
-
LEXICON N-PL-PxPl2-INE_NB-first
-
LEXICON N-PL-PxPl2-ILL
-
LEXICON N-PL-PxPl2-ILL_NB-first
-
LEXICON N-PL-PxPl2-LAT
-
LEXICON N-PL-PxPl2-LAT_NB-first
-
LEXICON N-PL-PxPl2-ABE
-
LEXICON N-PL-PxPl2-ABE_NB-first
Sg Pl3
-
LEXICON N-SG-PxPl3-NOM
-
LEXICON N-SG-PxPl3-GEN
-
LEXICON N-SG-PxPl3-DAT
-
LEXICON N-SG-PxPl3-ACC
-
LEXICON N-SG-PxPl3-CMPR
-
LEXICON N-SG-PxPl3-COM
-
LEXICON N-SG-PxPl3-INE
-
LEXICON N-SG-PxPl3-ILL
-
LEXICON N-SG-PxPl3-LAT
-
LEXICON N-SG-PxPl3-ABE
-
LEXICON N-PL-PxPl3-NOM
-
LEXICON N-PL-PxPl3-NOM_NB-first
-
LEXICON N-PL-PxPl3-GEN
-
LEXICON N-PL-PxPl3-GEN_NB-first
-
LEXICON N-PL-PxPl3-DAT
-
LEXICON N-PL-PxPl3-DAT_NB-first
-
LEXICON N-PL-PxPl3-ACC
-
LEXICON N-PL-PxPl3-ACC_NB-first
-
LEXICON N-PL-PxPl3-CMPR
-
LEXICON N-PL-PxPl3-CMPR_NB-first
-
LEXICON N-PL-PxPl3-COM
-
LEXICON N-PL-PxPl3-COM_NB-first
-
LEXICON N-PL-PxPl3-INE
-
LEXICON N-PL-PxPl3-INE_NB-first
-
LEXICON N-PL-PxPl3-ILL
-
LEXICON N-PL-PxPl3-ILL_NB-first
-
LEXICON N-PL-PxPl3-LAT
-
LEXICON N-PL-PxPl3-LAT_NB-first
-
LEXICON N-PL-PxPl3-ABE
-
LEXICON N-PL-PxPl3-ABE_NB-first
This (part of) documentation was generated from src/fst/morphology/affixes/nouns.lexc
src-fst-morphology-affixes-numbers.lexc.md
East Mari Numeral inflection
-
LEXICON QNT_
-
LEXICON KvMurt
-
LEXICON KvK cardinal numerals
-
LEXICON KvK_ATTR cardinal numerals in noun phrase scope
-
LEXICON KvKoll
-
LEXICON NUM-COLL_
-
LEXICON KvInd
-
LEXICON Kv-a/e
This (part of) documentation was generated from src/fst/morphology/affixes/numbers.lexc
src-fst-morphology-affixes-pronouns.lexc.md
Eastern Mari pronoun inflection
Lexica directed from root.lexc
-
LEXICON pronouns_not_from_xml
-
LEXICON MYJ
-
LEXICON TYJ
-
LEXICON TUDO
-
LEXICON TIDE
-
LEXICON SHKE
Pronoun lexica from xml
-
LEXICON Pimp
-
LEXICON Pmuu
-
LEXICON PRON-IR_
-
LEXICON PRON_
-
LEXICON PronRes
-
LEXICON PronI
-
LEXICON PronIR
-
LEXICON PronInd
-
LEXICON PRON-INDEF
-
LEXICON KAZHNE
-
LEXICON PronDem
-
LEXICON PronRef
This (part of) documentation was generated from src/fst/morphology/affixes/pronouns.lexc
src-fst-morphology-affixes-propernouns.lexc.md
Proper noun inflection
Meadow Mari proper nouns inflect in the same cases as regular nouns, but with a colon (‘:’) as separator. (???)
-
LEXICON PROP-OLD-ORTH-SG-NOM_
- LEXICON PROP_
- : N-ava_01 ; decline like common nouns
-
: ENDLEX ; in attributive position ?SHOULD THIS HAVE an +Attr tag?
- LEXICON PROP-PLC_
-
: N-ava_01 ; decline like common nouns
- LEXICON LEXC_PROP-PLC_
- +N+Prop+Sem/Plc: N-ava_01 ; decline like common nouns
Male given name for deriving patronyms
Check whether +Orth/Colloq is orthographically wrong
-
LEXICON PropNameMaleDer-J-0Evich
-
LEXICON PropNameMaleDer-IJ-Y0Evich
-
LEXICON PropNameMaleDer-IJ-I0Evich
-
LEXICON PropNameMaleDer-Y-0Evich
Вили:Вил
-
LEXICON PropNameMaleDer-I-YEvich
-
LEXICON PropNameMaleDer-Ovich
Russian type Surnames
-
LEXICON Deriv-RUS-V_SURMAL Абдеев:Абдеев
-
LEXICON Deriv-RUS-IJ_SURMAL Багрий:Багр
-
LEXICON Deriv-RUS-KIJ_SURMAL Аморский:Аморск
-
LEXICON Deriv-RUS-OJ_SURMAL
-
LEXICON Deriv-RUS-YJ_SURMAL
-
LEXICON Deriv-RUS-AN_SURMAL
-
LEXICON Deriv-RUS-IN_SURMAL
-
LEXICON PROP_KAL_SURMAL
-
LEXICON PROP_KUDO_SURFEM
-
LEXICON CYRL-CONS_SUR
-
LEXICON PropSur-kal
-
LEXICON CYRL-T_SUR
-
LEXICON PropSur-kit
-
LEXICON CYRL-L_SUR
-
LEXICON CYRL-K_SUR
-
LEXICON PropSur-lak
-
LEXICON CYRL-SIBILANT_SUR
-
LEXICON PropSur-osh
-
LEXICON CYRL-VOW_SUR
-
LEXICON CYRL-A_SUR
-
LEXICON PROP_KAL_SUR
-
LEXICON PROP_KUDO_SUR
-
LEXICON PROP_KAL_MAL
-
LEXICON PROP_LAK_MAL
-
LEXICON PROP_OSH_MAL
-
LEXICON PROP_KUDO_MAL
-
LEXICON LEXC_PROP_KUDO_MAL
-
LEXICON PROP_OSH_PATRMAL
-
LEXICON PROP_KUDO_PATRFEM
Female Given names
-
LEXICON PROP_KAL_FEM
-
LEXICON PROP_OSH_FEM
-
LEXICON PROP_KUDO_FEM
-
LEXICON LEXC_PROP_KUDO_FEM
PLACE NAMES FROM TEMPLATE
This (part of) documentation was generated from src/fst/morphology/affixes/propernouns.lexc
src-fst-morphology-affixes-symbols.lexc.md
Symbol affixes
This (part of) documentation was generated from src/fst/morphology/affixes/symbols.lexc
src-fst-morphology-affixes-verbs.lexc.md
Meadow Mari verb inflection.
Verbal continuation lexica
Auxiliary verbs
Some of these are directed directly from root.lexc
LEXICON verbs_not_from_xml
LEXICON negverb TODO: fix
Regular verbs
We divide the verbs in two, -am and -em
The -am class
LEXICON V_am-N divides V_am in Mood and infinites
LEXICON V_am divides V_am in Mood and infinites
LEXICON Vam-Mood divides in Ind, Imprt, Des
LEXICON Vam-Ind gives all the Ind tenses
LEXICON Vam-Imp for imperative, Повелительное наклонение:
LEXICON Vam-Des for desiderative, Желательное наклонение:
The -em class
First four lexica: V_em with Gerund, the rest without, all going to V_em_ALL to get derivation affixes.
LEXICON V_em divides V_em in Mood and infinites
LEXICON V_em-1SYLL-j allow for literary norm until 1970 (Alhoniemi 1985: 105-106) кайше, кайшаш +Err/Orth: non-finites ; until 1972 reform
LEXICON V_em-1SYLL single syll V_em verbs, do not include bare-stem gerunds in their paradigms
Optional derivation: All verbs going to V_em_INFL
LEXICON Vem-Mood divides in Ind, Imprt, Des
LEXICON Vem-Ind gives all the Ind tenses
-
LEXICON Vem-Imp for imperative, Повелительное наклонение:
-
LEXICON Vem-Des for desiderative, Желательное наклонение:
LEXICON non-finites contains Mutual endings
Special verbs
V_am, возаш : воч
These need work 2012-09-21
This (part of) documentation was generated from src/fst/morphology/affixes/verbs.lexc
src-fst-morphology-compounding.lexc.md
Divvun & Giellatekno - open source grammars for Sámi and other languages
A special lexicon for handling proper noun compounding without hyphens as that would allow compounding with words explicitly coded to disallow such compounds)
This (part of) documentation was generated from src/fst/morphology/compounding.lexc
src-fst-morphology-phonology.twolc.md
Eastern Mari twol file
This file documents the phonology.twolc file
This file contains rules for morphophonological alternations, such as vowel harmony, stem vowel changes, palatalisation, etc.
We define our symbols (Alphabet), some Sets, and then the Rules
Letters of the alphabet
- а б в г д е ё ж з и й к л м н ҥ о ӧ п р с т у ӱ ф х ц ч ш щ ъ ы ь э ю я ӓ ӹ
- А Б В Г Д Е Ё Ж З И Й К Л М Н Ҥ О Ӧ П Р С Т У Ӱ Ф Х Ц Ч Ш Щ Ъ Ы Ь Э Ю Я Ӓ Ӹ
other symbols
- %-
Archiphonemes for vowels, Giellatekno style
- Е1:е а1:а и1:и у1:у ӱ1:ӱ я1:я
- Ы2:е Variation in еоыӧØ
- А1:а
- Е2:е А2:а
Archiphonemes for vowels, Apertium style
- %{аы%}:а Stem-final vowel variation when stress falls on non-final vowel
- %{аы%}:ы Stem-final vowel variation when stress falls on non-final vowel
- %{еы%}:е Stem-final vowel variation when stress falls on non-final vowel
- %{еы%}:ы Stem-final vowel variation when stress falls on non-final vowel
- %{оы%}:о Stem-final vowel variation when stress falls on non-final vowel
- %{оы%}:ы Stem-final vowel variation when stress falls on non-final vowel
- %{ӧы%}:ӧ Stem-final vowel variation when stress falls on non-final vowel
-
%{ӧы%}:ы Stem-final vowel variation when stress falls on non-final vowel
- %{яы%}:я Stem-final vowel variation when stress falls on non-final vowel
-
%{яы%}:ы Stem-final vowel variation when stress falls on non-final vowel
- %{еоыӧØ%}:е PxSg3 final
- %{еоыӧØ%}:о PxSg3 final
- %{еоыӧØ%}:ӧ PxSg3 final
- %{еоыӧØ%}:0 PxSg3 final
- %{еоыӧØ%}:ы PxSg3 final
- %{ыØ%}:ы PxSg3 onset
-
%{ыØ%}:0 PxSg3 onset
-
%{ьØ%}:0 for -ам verbs Prt1 Sg1, Sg2, Sg3, Pl3 л н
- Е1:е for lative
Arcihphonemes for consonants
- з2:з for з:ч, thus возаш - воч but колхоз (*колхоч)
- з2:ч for native stems
- к2:0 кочк- коч# “eat/есть” мушк- муш “wash/мыть”
- к2:к кочк- коч# “eat/есть” мушк- муш “wash/мыть”
- н2:0 шинч- шич# “sit down/сесть”
- н2:н шинч- шич# “sit down/сесть”
- т2:0 лект- лек# “leave/ уходить”
-
т2:т лект- лек# “leave/ уходить”
- %^V2IMPRT:0 for -ем verbs in й
-
%^END:0 for -ам verb final, i.e. Imprf
- %^VoTrigger:0 for use with acronyms after hyphen о у ё ю О У Ё Ю
- %^VeTrigger:0 for use with acronyms after hyphen а е и э я А Е И Э Я
- %^VOTrigger:0 for use with acronyms after hyphen ӧ ӱ Ӧ Ӱ
-
%^Sonorant:0 for use with acronyms after hyphen Л М Н Р Ҥ -
%^Obstruent:0 for use with acronyms after hyphen С Ф Ъ Ь -
%^FrontObstr:0 for use with acronyms after hyphen
- %>
Sets
- Vo = о у ё ю О У Ё Ю ;
- VO = ӧ ӱ Ӧ Ӱ;
- Ve = а е и э я А Е И Э Я ;
-
Vow = Vo VO Ve ы Ы ;
-
Cns = б в г д ж з й к л м н ҥ п р с т ф х ц ч ш щ
з2 к2 н2 т2 ; -
CnsAll = б в г д ж з й к л м н ҥ п р с т ф х ц ч ш щ
з2 к2 н2 т2 ; -
CnsNoj = б в г д ж з к л м н ҥ п р с т ф х ц ч ш щ
Б В Г Д Ж З К Л М Н Ҥ П Р С Т Ф Х Ц Ч Ш Щ; -
Cst = б в г д ж з к п с т ф х ц ч ш щ
Б В Г Д Ж З К П С Т Ф Х Ц Ч Ш Щ; -
Ltrs = Vow Cns ъ ь Ъ Ь ;
- all = Ltrs %- ;
Rules
Punctuation bullet as such This rule prevents deleting of BULLET when it forms a token. BULLET as stress mark is deleted as before.
Palatal mark loss before vowel имне+N+Sg+Nom+Foc/Ат
- имнь%{еы%}%>А2т
- имн00%>ят витнь>Ы2^END
Onset vowel loss in suffix after stem vowel
Onset vowel Е2 realized in suffix е
Onset vowel Е2 realized in suffix э
Onset vowel Е2 realized in suffix ZERO
Onset vowel Ы1 realized in suffix
suffix-final vowel loss after stem-final vowel
пуаш+V+Imprt+Sg2
- пу>Ы2%^END
- пу>00
кияш+V+Imprt+Sg2
- кий%>Ы2%^END
- ки0%>е0
suffix-final vowel loss after stem-final vowel
**suffix-final vowel realized as -Round in word-final position е **
шылаш+V+Imprt+Sg3 шыл%>жЫ2%^END шыл%>же0
**suffix-final vowel realized as +Back +Round in word-final position о **
- фрукт>Ы1штЫ2^END
- фрукт>ышто0
**suffix-final vowel realized as +Front +Round in word-final position ӧ **
шӱртняш+V+ConNeg:
- шӱртнь%>Ы2%^END
- шӱртнь%>ӧ0
remove ʼ mod let apostrophe
%{ьØ%}:ь толам+V+Ind+Prt1+Sg1
- тол%>%{ьØ%}ым
- тол%>ьым
suffix-final vowel realized after stem-final consonant
stem-final vowel realized as -Round in word-final position
stem-final vowel realized as +Back +Round in word-final position
stem-final vowel realized as +Front +Round in word-final position
**suffix-final vowel realized %{аы%}:ы **
stem-final vowel realized %{аы%}:а
stem-final vowel realized %{аы%}:а
Stem-final non-stressed vowel loss
- а•льф{аы}>А2т
- а0льф0>ат
Stem-final non-stressed %{еы%} loss
- киска•лʼь%{еы%}>А2т
- киска•л000>ят
- пятибо•рь%{еы%}>А2т
- пятибо•рье>ат
**suffix-final vowel realized %{еы%}:ы **
имне+N+Sg+PxSg3+Nom horse/hevonen
- имнʼь%{еы%}%>жЫ2
- имн0ьы%>же
**suffix-final vowel realized Ы2:ы **
пӧрт+N+Sg+Ine+Foc/ys
пӧрт%>Ы1штЫ2%>Ы1с%^END
пӧрт%>ышты%>0с0
stem-final vowel realized %{еы%}:е
**suffix-final vowel realized %{ӧы%}:ы **
stem-final vowel realized %{ӧы%}:ӧ
**suffix-final vowel realized %{оы%}:ы **
stem-final vowel realized %{оы%}:о
**suffix-final vowel realized %{яы%}:ы **
stem-final vowel realized %{яы%}:я
- башнь%{яы%}
- башн0я
**stem-internal glide realized in 0:й %{яы%}:ы **
- а0%{яы%}%>Ы1м
- айы%>0м
Clitics in At and Ak take onset glide = a
Clitics in At and Ak take onset glide = ja
когыльо+N+Sg+Nom+Foc/Ат
- когыль%{оы%}%>А2т
- когыл00%>ят толаш+V+Ind+Prt1+Sg3+Foc/Ат
- тол>{ьØ}Ы2>А2т
- тол>00>ят имне+N+Sg+Nom+Foc/Ат
- и•мнʼь%{еы%}>А2т
- и0мн000>ят
Clitics in At and Ak take ZERO
й Deletion in front of я Suffix and others
- кий>А2ш
- ки0>яш
й Deletion in front of я Suffix and others
й Deletion in front of я Suffix and others
- кай>А2
-
ка0>я
- кутыр>А2
- кутыр>а
**Onset consonant devoicing ж:ш **
- авалтымаш%>жы%>ла
- авалтымаш%>шы%>ла
**Onset consonant devoicing з:с **
Stem-final consonant loss т
Stem-final consonant loss к
Stem-final consonant loss н
- колхоз
-
колхоз
- воз2^END
-
воч0
- воз2>аш
-
воз>аш
- камвоз2>аш
-
камвоз>аш
- воз2>са
- воч>са
Stem-final consonant variation з2:з
Stem-final consonant variation з2:з
**Disallow Sg+Ine in тЫ2 everywhere except after stem-final ш ** йӧратымаш+N+Sg+Ine
- йӧратымаш%>тЫ2
- йӧратымаш%>те
**Disallow Sg+Ill in кЫ2 everywhere except after stem-final ш ** авалтымаш+N+Sg+Ine
- авалтымаш%>кЫ2
- авалтымаш%>ке
**Disallow PxSg3 in ыж no where except after ш **
- йолташ%>ыж#
- йолташ%>ыж#
- ★олма%>ыж# (is not standard language)
- ★олма%>ыж# (is not standard language)
**Disallow PxSg3 in ыж no where except after ш **
**Disallow %^V2IMPRT й-final Imprt+Sg2 single-syllable -em verbs **
- и•мнʼь%{еы%}>A2т
-
и0мн000>ят
- и
•(Eng. м н ʼ ь %{еы%} > A2 т) -
и
0(Eng. м н 0 0 0 > я т) - а
- б
This (part of) documentation was generated from src/fst/morphology/phonology.twolc
src-fst-morphology-root.lexc.md
Morphology
This file consists of three parts:
- Multichar Symbols declaration
- The Root lexicon
- A set of lexica for minor parts of speech
- A set of unfinished lexica, to be either deleted or expanded.
Declaration of Multichar_Symbols
Analysis symbols
The morphological analyses of the wordforms of Eastern Mari language are presented in this system in terms of the following symbols. (It is highly suggested to follow existing standards when adding new tags).
-
+WORK = nouns
- %^VoTrigger for use with acronyms after hyphen
- %^VeTrigger for use with acronyms after hyphen
- %^VOTrigger for use with acronyms after hyphen
-
%^Sonorant for use with acronyms after hyphen Л М Н Р Ҥ -
%^Obstruent for use with acronyms after hyphen С Ф Ъ Ь -
%^FrontObstr for use with acronyms after hyphen С Ф Ъ Ь - %^END = twolc tag to mark end of word
The parts-of-speech are:
- +N = nouns
- +A = adjectives
- +Adp = adpositions
- +Adv = adverbs
- +V = verbs
- +Pron = pronouns
- +CS = subjunctions
- +CC = conjunctions
- +Interj = interjections
- +Pcle = particles
- +Num = numerals
- +Descr = descriptive ideophones
POS subtags
The parts of speech are further split up into:
- +Po = postpositions
-
+Pr = prepositons
- +Prop = Proper noun
- +Pers = Personal pronoun
- +Dem = Demonstrative pronoun
- +Interr = Interrogative pronoun
- +Refl = Reflexive pronoun
- +Recipr = Reciprocal pronoun
- +Rel = Relative pronoun
- +Indef = Indefinite pronoun
- +Coll = Collective numerals -ын-
-
+AssocColl = Collective associative numerals with obligatory possessive suffixes -нь-
- +Patr = patronym, look at this in other cyr fsts.
- +Aux = Auxiliary verb
- +Dep = ( pair verbs that do not occur independently get this marker.) /was +Depend, but +Dep used in fst.
Have a look at these:
- +Foc/Poss =
- +Prf = perfective
- +Arab = arabic numerals
- +Qnt = quantifiers
- +Rom = roman numerals
- +Weak = weak (?) form
The nominals are inflected in the following numbers
- +Sg =
- +Pl =
- +AssocPl =
- +LocPl = location, better witho LocusPl to avoid Loc case?
The nominals are inflected in the following Case and Number
- +Nom = nominative
- +Gen = genitive
- +Acc = accusative
- +Com = comitative
- +Ill = illative
- +Ine = inessive
- +Lat = lative
- +Dat = dative
- +Cmpr = comparative case
- +Abe = abessive
- +Voc = vocative
- +Attr = attributive form
- +Instr =
The possession is marked as such:
- +PxSg1 =
- +PxSg2 =
- +PxSg3 =
- +PxPl1 =
- +PxPl2 =
- +PxPl3 =
Suffix ordering tags:
- +So/CP = Suffix ordering: Case + Possessive Person marking
- +So/PC = Suffix ordering: Possessive Person + Case marking
- +So/NCP = Suffix ordering: Number + Case + Possessive Person marking
- +So/NPC = Suffix ordering: Number + Possessive Person + Case marking
- +So/NP = Suffix ordering: Number + Possessive Person marking
- +So/PN = Suffix ordering: Possessive Person + Number marking
- +So/PNC = Suffix ordering: Possessive Person + Number + Case marking
The comparative forms are:
- +Comp = comparative (not: not Cmp)
- +Superl = superlative
Numerals are classified under:
- +Card = (hmm, skip+Card?)
- +Ord =
Note the attributive tag, in defferent contexts
- +Attr =
Verb moods are:
- +Ind = indicative
- +Cond = conditional
- +Imprt = imperative
- +Des = desiderative
Verb tenses are:
- +Prs = present
- +Prt1 = 1st preterite, direct observation
- +Prt2 = 2nd preterite, indirect narrative, conclusion
Verb personal forms are: (also used with personal pronouns)
- +Sg1 =
- +Sg2 =
- +Sg3 =
- +Pl1 =
- +Pl2 =
-
+Pl3 =
- +Ext = form уло
- +Indep = forms огым, огыт, ите
Other verb forms are
- +Inf = Infinitive
- +Ger = Gerund
- +Neg = Negation verb
- +ConNeg = Invariant main verb in negation expression
- +Prc = Participle
- +Nec = Necessive infinitive
- +Fut = Future participle
- +Neg = Negative participle
- +Imprf = Imperfective (?) – XXX check this
- +Act = Active
- +Pass = Passive
Question and Focus particles:
- +Qst =
-
+Foc =
- +Foc/at = -at focus particla
- +Foc/ak = -ak focus particle
- +Foc/ys = -ys focus particle
- +Foc/jan = -jan focus particle
- +Foc/ja = -ja focus particle
Tags distinguishing different versions of the same lemma (before POS)
- +v1
- +v2
- +v3
- +v4
- +v5
- +v6
- +v7
- +v8
- +v9
- +v10
- +v11
- +v12
- +v13
- +v14
- +v15
- +v16
- +v17
- +v18
- +v19
- +v20
Derivations
- +Ex/N = for derivation from N to anoter POS
- +Ex/V = for derivation from V to anoter POS
- +Ex/A = for derivation from A to anoter POS
- +Ex/TV = change to other transitivity
-
+EX/IV = change to other transitivity
- +Der = This allows for Ex/… forms
- +Der/Nom = Derivation V > N: Nominalization
- +Der/NomNeg = Derivation V > N: Negative nominalization
- +Der/Priv = Derivation N > A: Privative adjective
- +Der/Poss = Derivation N > A: Possessive adjective, orig. genitive form without a head
- +Der/Pur = Derivation N > A:
- +Der/Rel = Derivation N > A: Relational adjective
- +Der/Caus = Derivation V > V: Causative
- +Der/Refl = Derivation V > V: Reflexive
- +Der/MWN = Modifier without noun (better: +A+Sg+Nom etc.)
All non-positional derivations should be preceded by this tag, to make it possible to target regular expressions at all derivations in a language-independent way: just specify +Der|+Der1 .. +Der5 and you are set.
- +Der
Abbreviated words are classified with:
- +ABBR = for abbreviations that (may) contain period
- +Symbol = independent symbols in the text stream, like £, €, ©
- +ACR = acronyms
Special symbols are classified with:
- +CLB = clause and sentence boundary symbols
- +PUNCT = other punctuation marks
- +LEFT = paired symbols
- +RIGHT +MIDDLE = paired symbols
The verbs are syntactically split according to transitivity:
- +TV =
- +IV =
Special multiword units are analysed with:
- +Multi =
Non-dictionary words can be recognised with:
- +Guess =
Homony tags
These are especially for verbs. Note that this is not a semantic distinction, we talk about paradigms deviating here and there in the inflection pattern.
- +Hom1 = First pattern (let us say -ам)
- +Hom2 = Second pattern (let us say -ем)
- +Hom3 = Third pattern (if it should exist + even more?)
- +Hom4 =
- +Hom5 =
- +Hom6 =
Usage tags
The Usage extents are marked using following tags:
- +Use/Marg marginal
- +Use/-PLX Excluded in PLX-speller
- +Use/SpellNoSugg recognized but not suggested in speller
- +Use/Circ circular paths (old ^C^)
- +Use/CircN circular paths for the numerals (old ^N^)
- +Use/NG not-generate, for ped generation isme-ped.fst
- +Use/MT Generate for MT only, for restricting analyses needed for MT generation not to pop up elsewhere
- +Use/NGminip Not for miniparadigm in VD dicts
- +Use/Disamb means that the following is only used in the analyser feeding the disambiguator
- +Use/GC only retained in the HFST Grammar Checker disambiguation analyser
- +Use/-PMatch Do not include in fsts made for hfst-pmatch
- +Use/TTS – only retained in the HFST Text-To-Speech disambiguation tokeniser
- +Use/-TTS – never retained in the HFST Text-To-Speech disambiguation tokeniser
-
+MWESplit Split point for MWE
- +Err/Orth = orthographical error (analysed, not accepted in speller)
- +Use/-Spell = accepted in normative FST but not in speller
- +Use/Test = Dealing with lative form 2012-10-27 аваеш, пашаш
Semantic tags
- +Sem/Act = Activity
- +Sem/Amount = Amount
- +Sem/Ani = Animate
- +Sem/Aniprod = Animal Product
- +Sem/Body = Bodypart
- +Sem/Body-abstr = siellu, vuoig?a, jierbmi
- +Sem/Build = Building
- +Sem/Build-part = Part of Bulding, like the closet
- +Sem/Cat = Category
- +Sem/Clth = Clothes
- +Sem/Clth-jewl = Jewelery
- +Sem/Clth-part = part of clothes, boallu, sávdnji…
- +Sem/Ctain = Container
- +Sem/Ctain-abstr = Abstract container like bank account
- +Sem/Ctain-clth
- +Sem/Curr = Currency like dollár, Not Money
- +Sem/Dance = Dance
- +Sem/Dir = Direction like GPS-kursa
- +Sem/Domain = Domain like politics, reindeerherding (a system of actions)
- +Sem/Drink = Drink
- +Sem/Dummytag = Dummytag
- +Sem/Edu = Educational event
- +Sem/Event = Event
- +Sem/Feat = Feature, like Árvu
- +Sem/Feat-phys = Physiological feature, ivdni, fárda
- +Sem/Feat-psych = Psychological feauture
- +Sem/Feat-measr = Psychological feauture
- +Sem/Fem = Female name
- +Sem/Food = Food
- +Sem/Food-med = Medicine
- +Sem/Furn = Furniture
- +Sem/Game = Game
- +Sem/Geom = Geometrical object
- +Sem/Group = Animal or Human Group
- +Sem/Hum = Human
- +Sem/Hum-abstr = Human abstract
- +Sem/Ideol = Ideology
- +Sem/Lang = Language
- +Sem/Mal = Male name
- +Sem/Mat = Material for producing things
- +Sem/Measr = Measure
- +Sem/Money = Has to do with money, like wages, not Curr(ency)
- +Sem/Obj = Object
- +Sem/Obj-clo = Cloth
- +Sem/Obj-cogn = Cloth
- +Sem/Obj-el = (Electrical) machine or apparatus
- +Sem/Obj-ling = Object with something written on it
- +Sem/Obj-rope = flexible ropelike object
- +Sem/Obj-surfc = Surface object
- +Sem/Org = Organisation
- +Sem/Part = Feature, oassi, bealli
- +Sem/Perc-cogn = Cognative perception
- +Sem/Perc-emo = Emotional perception
- +Sem/Perc-phys = Physical perception
- +Sem/Perc-psych = Physical perception
- +Sem/Plant = Plant
- +Sem/Plant-part = Plant part
- +Sem/Plc = Place
- +Sem/Plc-abstr = Abstract place
- +Sem/Plc-elevate = Place
- +Sem/Plc-line = Place
- +Sem/Plc-water = Place
- +Sem/Pos = Position (as in social position job)
- +Sem/Process = Process
- +Sem/Prod = Product
- +Sem/Prod-audio = Audio product
- +Sem/Prod-cogn = Cognition product
- +Sem/Prod-ling = Linguistic product
- +Sem/Prod-vis = Visual product
- +Sem/Rel = Relation
- +Sem/Route = Name of a Route
- +Sem/Rule = Rule or convention
- +Sem/Semcon = Semantic concept
- +Sem/Sign = Sign (e.g. numbers, punctuation)
- +Sem/Sport = Sport
- +Sem/State =
- +Sem/State-sick = Illness
- +Sem/Substnc = Substance, like Air and Water
- +Sem/Sur = Surname
- +Sem/Symbol = Symbol
- +Sem/Time = Time
- +Sem/Tool = Prototypical tool for repairing things
- +Sem/Tool-catch = Tool used for catching (e.g. fish)
- +Sem/Tool-clean = Tool used for cleaning
- +Sem/Tool-it = Tool used in IT
- +Sem/Tool-measr = Tool used for measuring
- +Sem/Tool-music = Music instrument
- +Sem/Tool-write = Writing tool
- +Sem/Txt = Text (girji, lávlla…)
- +Sem/Veh = Vehicle
- +Sem/Wpn = Weapon
- +Sem/Wthr = The Weather or the state of ground
Multiple Semantic tags:
- +Sem/Act_Group =
- +Sem/Act_Plc =
- +Sem/Act_Route =
- +Sem/Amount_Build =
- +Sem/Amount_Semcon =
- +Sem/Ani_Body-abstr_Hum =
- +Sem/Ani_Build =
- +Sem/Ani_Build-part =
- +Sem/Ani_Build_Hum_Txt =
- +Sem/Ani_Group =
- +Sem/Ani_Group_Hum =
- +Sem/Ani_Hum =
- +Sem/Ani_Hum_Plc =
- +Sem/Ani_Hum_Time =
- +Sem/Ani_Plc =
- +Sem/Ani_Plc_Txt =
- +Sem/Ani_Time =
- +Sem/Ani_Veh =
- +Sem/Aniprod_Hum =
- +Sem/Aniprod_Obj-clo =
- +Sem/Aniprod_Perc-phys =
- +Sem/Aniprod_Plc =
- +Sem/Body-abstr_Prod-audio_Semcon =
- +Sem/Body_Body-abstr =
- +Sem/Body_Clth =
- +Sem/Body_Food =
- +Sem/Body_Group_Hum =
- +Sem/Body_Hum =
- +Sem/Body_Mat =
- +Sem/Body_Measr =
- +Sem/Body_Obj_Tool-catch =
- +Sem/Body_Plc =
- +Sem/Body_Time =
- +Sem/Build-part_Plc =
- +Sem/Build_Build-part =
- +Sem/Build_Clth-part =
- +Sem/Build_Edu_Org =
- +Sem/Build_Event_Org =
- +Sem/Build_Org =
- +Sem/Build_Route =
- +Sem/Clth-jewl_Curr =
- +Sem/Clth-jewl_Money =
- +Sem/Clth-jewl_Plant =
- +Sem/Clth_Hum =
- +Sem/Ctain-abstr_Org =
- +Sem/Ctain-clth_Plant =
- +Sem/Ctain-clth_Veh =
- +Sem/Ctain_Feat-phys =
- +Sem/Ctain_Furn =
- +Sem/Ctain_Tool =
- +Sem/Ctain_Tool-measr =
- +Sem/Curr_Org =
- +Sem/Dance_Org =
- +Sem/Dance_Prod-audio =
- +Sem/Domain_Food-med =
- +Sem/Domain_Prod-audio =
- +Sem/Edu_Event =
- +Sem/Edu_Group_Hum =
- +Sem/Edu_Mat =
- +Sem/Edu_Org =
- +Sem/Event_Food =
- +Sem/Event_Hum =
- +Sem/Event_Plc =
- +Sem/Event_Time =
- +Sem/Feat-phys_Tool-write =
- +Sem/Feat-phys_Veh =
- +Sem/Feat-phys_Wthr =
- +Sem/Feat-psych_Hum =
- +Sem/Feat_Plant =
- +Sem/Food_Perc-phys =
- +Sem/Food_Plant =
- +Sem/Game_Obj-play =
- +Sem/Geom_Obj =
- +Sem/Group_Hum =
- +Sem/Group_Hum_Org =
- +Sem/Group_Hum_Plc =
- +Sem/Group_Hum_Prod-vis =
- +Sem/Group_Org =
- +Sem/Group_Sign =
- +Sem/Group_Txt =
- +Sem/Hum_Lang =
- +Sem/Hum_Lang_Plc =
- +Sem/Hum_Lang_Time =
- +Sem/Hum_Obj =
- +Sem/Hum_Org =
- +Sem/Hum_Plant =
- +Sem/Hum_Plc =
- +Sem/Hum_Tool =
- +Sem/Hum_Veh =
- +Sem/Hum_Wthr =
- +Sem/Lang_Tool =
- +Sem/Mat_Plant =
- +Sem/Mat_Txt =
- +Sem/Measr_Time =
- +Sem/Money_Obj =
- +Sem/Money_Txt =
- +Sem/Obj-play =
- +Sem/Obj-play_Sport =
- +Sem/Obj_Semcon =
- +Sem/Clth-jewl_Org =
- +Sem/Org_Rule =
- +Sem/Org_Txt =
- +Sem/Org_Veh =
- +Sem/Part_Prod-cogn =
- +Sem/Perc-emo_Wthr =
- +Sem/Plant_Plant-part =
- +Sem/Plant_Tool =
- +Sem/Plant_Tool-measr =
- +Sem/Plc-abstr_Rel_State =
- +Sem/Plc-abstr_Route =
- +Sem/Plc_Pos =
- +Sem/Plc_Route =
- +Sem/Plc_Substnc =
- +Sem/Plc_Substnc_Wthr =
- +Sem/Plc_Time =
- +Sem/Plc_Tool-catch =
- +Sem/Plc_Wthr =
- +Sem/Prod-audio_Txt =
- +Sem/Prod-cogn_Txt =
- +Sem/Semcon_Txt =
- +Sem/Obj_State =
- +Sem/Substnc_Wthr =
- +Sem/Time_Wthr =
Semantics are classified with
Derivations are classified under the morphophonetic form of the suffix, the source and target part-of-speech.
- +V→N =
- +V→V =
- +V→A =
- +N→A =
- +Der/xxx =
- +Der/mO =
Morphophonology To represent phonologic variations in word forms we use the following symbols in the lexicon files:
- %{аы%} Stem-final vowel variation when stress falls on non-final vowel word-final е and presuffix ы
- %{еы%} Stem-final vowel variation when stress falls on non-final vowel word-final е and presuffix ы
- %{оы%} Stem-final vowel variation when stress falls on non-final vowel
- %{ӧы%} Stem-final vowel variation when stress falls on non-final vowel
- %{яы%} Stem-final vowel variation when stress falls on non-final vowel preceded by ь
- %{еоыӧØ%} PxSg3 final
-
%{ыØ%} PxSg3 onset
-
%{ьØ%} for -ам verbs Prt1 Sg1, Sg2, Sg3, Pl3 л н
- {aä} for vowel harmony
- {oö} for vowel harmony
- {uü} for vowel harmony
- е1 =
- а1 =
- и1 =
- у1 =
- ӱ1 =
-
я1 =
- Е1 = lative
- Е2 =
- А2 =
- Ы1 = stem-onset archi-vowel
- Ы2 =
- з2 = for возаш : воч
- к2 кочк- коч# “eat/есть” мушк- муш “wash/мыть”
- н2 шинч- шич# “sit down/сесть”
- т2 лект- лек# “leave/ уходить”
- %> =
- +TEST =
And following triggers to control variation
- %^V2IMPRT for -ем verbs in й
-
%^END for -ам verb final, i.e. Imprf
- {front}
- {back}
- X1 =
- X2 =
- X3 =
- X4 =
- X5 =
- X6 =
- X7 =
- X8 =
- X9 =
- Z1 =
- Z2 =
-
%-
- %^VoTrigger for use with acronyms after hyphen о у ё ю О У Ё Ю
- %^VeTrigger for use with acronyms after hyphen а е и э я А Е И Э Я
- %^VOTrigger for use with acronyms after hyphen ӧ ӱ Ӧ Ӱ
-
%^Sonorant for use with acronyms after hyphen Л М Н Р Ҥ -
%^Obstruent for use with acronyms after hyphen С Ф Ъ Ь
Symbols that need to be escaped on the lower side (towards twolc):
- »
- «
-
(escaped with square brackets, to avoid collision with > as morpheme boundary)
-
< (escaped with square brackets, to avoid collision with < as morpheme boundary)
- +Cmp = nouns
- +Cmp/Hyph = nouns
- +Cmp/SoftHyph = nouns
- +Cmp/SplitR = nouns
Flag diacritics
We have manually optimised the structure of our lexicon using following flag diacritics to restrict morhpological combinatorics - only allow compounds with verbs if the verb is further derived into a noun again:
| @P.NeedNoun.ON@ | (Dis)allow compounds with verbs unless nominalised |
| @D.NeedNoun.ON@ | (Dis)allow compounds with verbs unless nominalised |
| @C.NeedNoun@ | (Dis)allow compounds with verbs unless nominalised |
For languages that allow compounding, the following flag diacritics are needed to control position-based compounding restrictions for nominals. Their use is handled automatically if combined with +CmpN/xxx tags. If not used, they will do no harm.
| @P.CmpFrst.FALSE@ | Require that words tagged as such only appear first |
| @D.CmpPref.TRUE@ | Block such words from entering ENDLEX |
| @P.CmpPref.FALSE@ | Block these words from making further compounds |
| @D.CmpLast.TRUE@ | Block such words from entering R |
| @D.CmpNone.TRUE@ | Combines with the next tag to prohibit compounding |
| @U.CmpNone.FALSE@ | Combines with the prev tag to prohibit compounding |
| @P.CmpOnly.TRUE@ | Sets a flag to indicate that the word has passed R |
| @D.CmpOnly.FALSE@ | Disallow words coming directly from root. |
Use the following flag diacritics to control downcasing of derived proper nouns (e.g. Finnish Pariisi -> pariisilainen). See e.g. North Sámi for how to use these flags. There exists a ready-made regex that will do the actual down-casing given the proper use of these flags.
| @U.Cap.Obl@ | Allowing downcasing of derived names: deatnulasj. |
| @U.Cap.Opt@ | Allowing downcasing of derived names: deatnulasj. |
| Flag diacritic | Explanation |
|---|---|
| @U.number.one@ | Flag used to give arabic numerals in smj different cases ; |
| @U.number.two@ | Flag used to give arabic numerals in smj different cases ; |
| @U.number.three@ | Flag used to give arabic numerals in smj different cases ; |
| @U.number.four@ | Flag used to give arabic numerals in smj different cases ; |
| @U.number.five@ | Flag used to give arabic numerals in smj different cases ; |
| @U.number.six@ | Flag used to give arabic numerals in smj different cases ; |
| @U.number.seven@ | Flag used to give arabic numerals in smj different cases ; |
| @U.number.eight@ | Flag used to give arabic numerals in smj different cases ; |
| @U.number.nine@ | Flag used to give arabic numerals in smj different cases ; |
| @U.number.zero@ | Flag used to give arabic numerals in smj different cases ; |
The Root lexicon
@U.number.zero@ Here it all starts
The word forms in Meadow Mari language start from the lexeme roots of
the following basic word classes:
- adjectives ;
adverbs ;
conjunctions ;
dependents ;
interjections ;
nouns ;
numbers ;
particles ;
postpositions ;
pronouns ;
pronouns_not_from_xml ;
propernouns ;
propernouns-toponyms ;
verbs ;
verbs_not_from_xml ;
Abbreviation ;
Acronym ;
Numeral ;
Punctuation ;
Symbols ; - Exceptions ;
NUM-COLL_ ;
urj-Cyrl-ProperNouns ; s ProperNoun-mhr ; specifically Mari names N_NEWWORDS ; new nouns to be added
Continuation lexica
Here comes a set of ragbag continuation lexica.
- LEXICON ADP_ TODO: why +WORK?
-
LEXICON CONJ_ TODO: why +WORK? All CONJ_ should be identified as either CC or CS or both, work in progress
-
LEXICON CC_ conjunctinos
-
LEXICON CS_ subjunctions
-
LEXICON DESCR_ = descriptive something
-
LEXICON DESCR-AUD_ these are audible, others may be visible or otherwise sensed, but for now just calling them Interj+Descr should suffice
-
LEXICON AD-A also adverbs
-
LEXICON INTERJ_ interjections
-
LEXICON Puh-a/e XXX do not know
-
LEXICON Puh XXX do not know
-
LEXICON PCLE_ particles, check these
-
LEXICON X for N attributes
- LEXICON ENDLEX = and here it ends with the ^END symbol.
This (part of) documentation was generated from src/fst/morphology/root.lexc
src-fst-morphology-stems-acronyms.lexc.md
Eastern Mari acronym file
Here is the list of lexicalised Sem/Org acronym proper nouns These are also generated by the Acrogenerator
This (part of) documentation was generated from src/fst/morphology/stems/acronyms.lexc
src-fst-morphology-stems-exceptions.lexc.md
NOUNS
KIN TERMS
Single-syllable nouns in У Ӱ Ю
VERBS
This (part of) documentation was generated from src/fst/morphology/stems/exceptions.lexc
src-fst-morphology-stems-mhr-propernouns.lexc.md
The Meadow and Eastern Mari proper noun lexicon
MARI-LIKE NAMES
PLACE NAMES
This (part of) documentation was generated from src/fst/morphology/stems/mhr-propernouns.lexc
src-fst-morphology-stems-nouns_newwords.lexc.md
This is where new words are added as lexc entries before they are added to the xml source files. автор:автор N_ “(eng) /(fin) /(rus) “ ;
ADD NOUNS BELOW
PROPER NAMES
This (part of) documentation was generated from src/fst/morphology/stems/nouns_newwords.lexc
src-fst-morphology-stems-numerals.lexc.md
Meadow & Eastern Mari numerals
The initial lexica
-
LEXICON Numeral initial lexica
-
LEXICON ARABIC arabic numerals
-
LEXICON ARABICLOOP
-
LEXICON ARABICLOOPORD_Back ordinals
-
LEXICON ARABICLOOPORD_Front ordinals
-
LEXICON ARABICLOOPORD_FrontRound ordinals
-
LEXICON ARABICDELIMITERORD_Back ordinals
-
LEXICON ARABICDELIMITERORD_Front ordinals
-
LEXICON ARABICDELIMITERORD_FrontRound ordinals
The Roman numerals ! —————— !
-
LEXICON ROMAN roman numerals
-
LEXICON ROM-THOUSAND
-
LEXICON ROM-THOUSAND-TAG
-
LEXICON ROM-HUNDRED
-
LEXICON ROM-HUNDRED-TAG
-
LEXICON ROM-TEN
-
LEXICON ROM-TEN-TAG
-
LEXICON ROM-ONE
-
LEXICON ROM-ONE-TAG
-
LEXICON ROM-SPLIT
-
LEXICON 2ROMAN
-
LEXICON 2ROM-THOUSAND
-
LEXICON 2ROM-THOUSAND-TAG
-
LEXICON 2ROM-HUNDRED
-
LEXICON 2ROM-HUNDRED-TAG
-
LEXICON 2ROM-TEN
-
LEXICON 2ROM-TEN-TAG
-
LEXICON 2ROM-ONE
-
LEXICON 2ROM-ONE-TAG
-
LEXICON ROMNUMTAG
- LEXICON ARABICCASEORD_Back ordinals Is this then becoming +Ex/A?
- LEXICON ARABICCASEORD_Front ordinals
- LEXICON ARABICCASEORD_FrontRound ordinals
This (part of) documentation was generated from src/fst/morphology/stems/numerals.lexc
src-fst-phonetics-txt2ipa.xfscript.md
retroflex plosive, voiceless t ʈ 0288, 648 ( = ASCII 096)
retroflex plosive, voiced d ɖ 0256, 598
labiodental nasal F ɱ 0271, 625
retroflex nasal n ɳ 0273, 627
palatal nasal J ɲ 0272, 626
velar nasal N ŋ 014B, 331
uvular nasal N\ ɴ 0274, 628
bilabial trill B\ ʙ 0299, 665
uvular trill R\ ʀ 0280, 640
alveolar tap 4 ɾ 027E, 638
retroflex flap r ɽ 027D, 637
bilabial fricative, voiceless p\ ɸ 0278, 632
bilabial fricative, voiced B β 03B2, 946
dental fricative, voiceless T θ 03B8, 952
dental fricative, voiced D ð 00F0, 240
postalveolar fricative, voiceless S ʃ 0283, 643
postalveolar fricative, voiced Z ʒ 0292, 658
retroflex fricative, voiceless s ʂ 0282, 642
retroflex fricative, voiced z` ʐ 0290, 656
palatal fricative, voiceless C ç 00E7, 231
palatal fricative, voiced j\ ʝ 029D, 669
velar fricative, voiced G ɣ 0263, 611
uvular fricative, voiceless X χ 03C7, 967
uvular fricative, voiced R ʁ 0281, 641
pharyngeal fricative, voiceless X\ ħ 0127, 295
pharyngeal fricative, voiced ?\ ʕ 0295, 661
glottal fricative, voiced h\ ɦ 0266, 614
alveolar lateral fricative, vl. K alveolar lateral fricative, vd. K\
labiodental approximant P (or v) alveolar approximant r\ retroflex approximant r` velar approximant M\
retroflex lateral approximant l`
palatal lateral approximant L
velar lateral approximant L
Clicks
bilabial O\ (O = capital letter)
dental |
(post)alveolar !\
palatoalveolar =\
alveolar lateral ||
Ejectives, implosives
ejective > e.g. ejective p p> implosive < e.g. implosive b b< Vowels
close back unrounded M close central unrounded 1 close central rounded } lax i I lax y Y lax u U
close-mid front rounded 2 close-mid central unrounded @\ close-mid central rounded 8 close-mid back unrounded 7
schwa ə @
open-mid front unrounded E open-mid front rounded 9 open-mid central unrounded 3 open-mid central rounded 3\ open-mid back unrounded V open-mid back rounded O
ash (ae digraph) { open schwa (turned a) 6
open front rounded & open back unrounded A open back rounded Q Other symbols
voiceless labial-velar fricative W voiced labial-palatal approx. H voiceless epiglottal fricative H\ voiced epiglottal fricative <\ epiglottal plosive >\
alveolo-palatal fricative, vl. s\ alveolo-palatal fricative, voiced z\ alveolar lateral flap l\ simultaneous S and x x\ tie bar _ Suprasegmentals
primary stress “
secondary stress %
long :
half-long :\
extra-short _X
linking mark -
Tones and word accents
level extra high _T level high _H level mid _M level low _L level extra low _B downstep ! upstep ^ (caret, circumflex)
contour, rising contour, falling _F contour, high rising _H_T contour, low rising _B_L
contour, rising-falling _R_F
(NB Instead of being written as diacritics with _, all prosodic
marks can alternatively be placed in a separate tier, set off
by < >, as recommended for the next two symbols.)
global rise
voiceless 0 (0 = figure), e.g. n_0 voiced _v aspirated _h more rounded _O (O = letter) less rounded _c advanced _+ retracted _- centralized _” syllabic = (or _=) e.g. n= (or n=) non-syllabic _^ rhoticity `
breathy voiced _t creaky voiced _k linguolabial _N labialized _w palatalized ‘ (or _j) e.g. t’ (or t_j) velarized _G pharyngealized _?\
dental d apical _a laminal _m nasalized ~ (or _~) e.g. A~ (or A~) nasal release _n lateral release _l no audible release _}
velarized or pharyngealized _e velarized l, alternatively 5 raised _r lowered _o advanced tongue root _A retracted tongue root _q
This (part of) documentation was generated from src/fst/phonetics/txt2ipa.xfscript
src-fst-transcriptions-transcriptor-abbrevs2text.lexc.md
We describe here how abbreviations are in Eastern Mari are read out, e.g. for text-to-speech systems.
For example:
- s.:syntynyt # ;
- os.:omaa% sukua # ;
- v.:vuosi # ;
- v.:vuonna # ;
- esim.:esimerkki # ;
- esim.:esimerkiksi # ;
This (part of) documentation was generated from src/fst/transcriptions/transcriptor-abbrevs2text.lexc
src-fst-transcriptions-transcriptor-symbols2text.lexc.md
This file contains mappings from abbreviations and some acronyms to full forms for text-to-speech purposes. This is a supplement to the analyser; the analyser must tag the strings as +ABBR or similar for the transcriptions to work. The resulting full form must be lemmas known to the analyser, for further processing.
We describe here how abbreviations in Eastern Mari are read out, for text-to-speech systems.
The file contains:
-
miscellaneous symbols
-
smileys
-
Clause boundary symbols
-
Single punctuation marks
-
Paired punctuation marks
This (part of) documentation was generated from src/fst/transcriptions/transcriptor-symbols2text.lexc
tools-grammarcheckers-grammarchecker.cg3.md
M E A D O W M A R I G R A M M A R C H E C K E R
DELIMITERS
The delimiters are: “<.>” “<!>” “<?>” “<…>” “<¶>” sent
The Tags section lists all the tags inherited from the fst, and defines them for use in the syntactic analysis. The tags are documented in the root.lexc file, and here only listed for reference.
The next section, Sets, contains sets defined on the basis of the tags listed here, those set names are not visible in the output.
Tags
Beginning and end of sentence
BOS EOS
Clause boundary
Parts of speech tags
N V A Adv CC CS Interj Pron Num Pcle Clt Po
ABBR ACR
Punctuation marks
CLB LEFT RIGHT WEB LEFT RIGHT because of apertium
WORD is the set of all POS
Verbal tense and mood tags
Prs Prt1 Prt2 Fut Imprt Ind Cond Des
Other verbal tags
Act ConNeg FutPrc Ger Inf Nec Neg NegPrc Pass Prc PrfPrc
Verbal person-number tags Sg1 Sg2 Sg3 Pl1 Pl2 Pl3
Numeral tags
Sg Pl
Case tags
Nom Gen Abl Dat Com Cns Acc Ins Ine Ill Cmpr (case)
Other nominal tags
Pers Refl Rel Interr Recipr Dem ABBR
Adjective comparison tags
Pos (?) Superl Comp
Attr
Possessive suffix tags
PxSg1 PxSg2 PxSg3 PxPl1 PxPl2 PxPl3
Numeral tags
Card Coll Ord Temp (?)
Derivation tags
Der/MWN Der/sa
Particles
Qst Foc
Tags for internal testing
CmpTest Err
Sets
- CASE = all cases
- OBLCASE = All cases except Nom
- VFIN = All moods
Grammarchecker rules begin here
Grammarchecker sets
Grammarchecker rules
Speller rules
Agreement rules
Negation verb rules
Postposition rules
NP internal rules
Punctuation rules
This (part of) documentation was generated from tools/grammarcheckers/grammarchecker.cg3
tools-tokenisers-tokeniser-disamb-gt-desc.pmscript.md
Tokeniser for mhr
Usage:
$ make
$ echo "ja, ja" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa boasttu olmmoš, man mielde lahtuid." | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "márffibiillagáffe" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
Pmatch documentation: https://github.com/hfst/hfst/wiki/HfstPmatch
Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words:
- Punct contains ASCII punctuation marks
- The symbol after m-dash is soft-hyphen
U+00AD - The symbol following {•} is byte-order-mark / zero-width no-break space
U+FEFF.
Whitespace contains ASCII white space and the List contains some unicode white space characters
- En Quad U+2000 to Zero-Width Joiner U+200d’
- Narrow No-Break Space U+202F
- Medium Mathematical Space U+205F
- Word joiner U+2060
Apart from what’s in our morphology, there are
- unknown word-like forms, and
- unmatched strings
We want to give 1) a match, but let 2) be treated specially by
hfst-tokenise -aUnknowns are made of:- lower-case ASCII
- upper-case ASCII
- some cyrillic
- select extended latin symbols
- mhr specific alphabest ASCII digits
- select symbols
- Combining diacritics as individual symbols,
- various symbols from Private area (probably Microsoft), so far:
- U+F0B7 for “x in box”
Unknown handling
Unknowns are tagged ?? and treated specially with hfst-tokenise
hfst-tokenise –giella-cg will treat such empty analyses as unknowns, and
remove empty analyses from other readings. Empty readings are also
legal in CG, they get a default baseform equal to the wordform, but
no tag to check, so it’s safer to let hfst-tokenise handle them.
Finally we mark as a token any sequence making up a:
- known word in context
- unknown (OOV) token in context
- sequence of word and punctuation
- URL in context
This (part of) documentation was generated from tools/tokenisers/tokeniser-disamb-gt-desc.pmscript
tools-tokenisers-tokeniser-disamb-gt-desc.thirties.pmscript.md
Requires a recent version of HFST (3.10.0 / git revision>=3aecdbc) Then just: $ make $ echo “ja, ja” | hfst-tokenise –giella-cg tokeniser-disamb-gt-desc.pmhfst
Issues:
- Ambiguous input
- Seems to work fine
- Ambiguous multiword expessions with ambiguous tokenisation
- Seems to work – represented within lexc now; hfst-tokenise also supports forms on the analyses now
- Ambiguous multiword expessions need reorganising after CG
- The module cg-mwesplit takes wordforms from readings and turns them into new cohorts
- Unknown words
- The set-difference method only works for words without flag diacritics (even though we should be working only on the form-side?) and leads to binary blow-up: With only lower unknowns, we get 45M; lower+upper gives 67M, while no unknowns gives 27M
- Fixed instead by treating empty analyses as unknown-tokens in hfst-tokenise, and outputting unmatched strings with a prefix
- Treat input that’s within superblanks as unmatched
- probably requires a change in hfst-tokenise itself
- Try >1 space for ambiguous MWE’s? – represented within lexc now
- Try set-difference-unknowns method with regular hfst commands?
More usage examples: $ echo “Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa boasttu olmmoš, man mielde lahtuid.” | hfst-tokenise –giella-cg tokeniser-disamb-gt-desc.pmhfst $ echo “(gáfe) ‘ja’ ja 3. ja? ц jaja ukjend "ukjend"” | hfst-tokenise –giella-cg tokeniser-disamb-gt-desc.pmhfst $ echo “márffibiillagáffe” | hfst-tokenise –giella-cg tokeniser-disamb-gt-desc.pmhfst
Pmatch documentation: https://kitwiki.csc.fi/twiki/bin/view/KitWiki/HfstPmatch
Apart from what’s in our morphology, there are 1) unknown word-like forms, and 2) unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a
- lower-case ASCII
- upper-case ASCII
- some cyrillic
- select extended latin symbols
- mhr specific alphabest ASCII digits
-
select symbols
- Combining diacritics as individual symbols,
- various symbols from Private area (probably Microsoft), so far:
- U+F0B7 for “x in box”
TODO: Could use something like this, but built-in’s don’t include šžđčŋ:
Unknown handling
Unknowns are tagged ?? and treated specially with hfst-tokenise
hfst-tokenise –giella-cg will treat such empty analyses as unknowns, and
remove empty analyses from other readings. Empty readings are also
legal in CG, they get a default baseform equal to the wordform, but
no tag to check, so it’s safer to let hfst-tokenise handle them.
Needs hfst-tokenise to output things differently depending on the tag they get
This (part of) documentation was generated from tools/tokenisers/tokeniser-disamb-gt-desc.thirties.pmscript
tools-tokenisers-tokeniser-gramcheck-gt-desc.pmscript.md
Grammar checker tokenisation for mhr
Requires a recent version of HFST (3.10.0 / git revision>=3aecdbc) Then just:
$ make
$ echo "ja, ja" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
More usage examples:
$ echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa boasttu olmmoš, man mielde lahtuid." | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "márffibiillagáffe" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
Pmatch documentation: https://github.com/hfst/hfst/wiki/HfstPmatch
Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words:
- Punct contains ASCII punctuation marks
- The symbol after m-dash is soft-hyphen
U+00AD - The symbol following {•} is byte-order-mark / zero-width no-break space
U+FEFF.
Whitespace contains ASCII white space and the List contains some unicode white space characters
- En Quad U+2000 to Zero-Width Joiner U+200d’
- Narrow No-Break Space U+202F
- Medium Mathematical Space U+205F
- Word joiner U+2060
Apart from what’s in our morphology, there are 1) unknown word-like forms, and 2) unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a
- lower-case ASCII
- upper-case ASCII
- some cyrillic
- select extended latin symbols
- mhr specific alphabest ASCII digits
-
select symbols
- Combining diacritics as individual symbols,
- various symbols from Private area (probably Microsoft), so far:
- U+F0B7 for “x in box”
TODO: Could use something like this, but built-in’s don’t include šžđčŋ:
Simply give an empty reading when something is unknown: hfst-tokenise –giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it’s safer to let hfst-tokenise handle them.
Finally we mark as a token any sequence making up a:
- known word in context
- unknown (OOV) token in context
- sequence of word and punctuation
- URL in context
This (part of) documentation was generated from tools/tokenisers/tokeniser-gramcheck-gt-desc.pmscript
tools-tokenisers-tokeniser-tts-cggt-desc.pmscript.md
TTS tokenisation for smj
Requires a recent version of HFST (3.10.0 / git revision>=3aecdbc) Then just:
make
echo "ja, ja" \
| hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
More usage examples:
echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa \
boasttu olmmoš, man mielde lahtuid." \
| hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" \
| hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
echo "márffibiillagáffe" \
| hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
Pmatch documentation: https://kitwiki.csc.fi/twiki/bin/view/KitWiki/HfstPmatch
Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words:
- Punct contains ASCII punctuation marks
- The symbol after m-dash is soft-hyphen
U+00AD - The symbol following {•} is byte-order-mark / zero-width no-break space
U+FEFF.
Whitespace contains ASCII white space and the List contains some unicode white space characters
- En Quad U+2000 to Zero-Width Joiner U+200d’
- Narrow No-Break Space U+202F
- Medium Mathematical Space U+205F
- Word joiner U+2060
Apart from what’s in our morphology, there are 1) unknown word-like forms, and 2) unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a
- select extended latin symbols
- select symbols
- various symbols from Private area (probably Microsoft), so far:
- U+F0B7 for “x in box”
TODO: Could use something like this, but built-in’s don’t include šžđčŋ:
Simply give an empty reading when something is unknown: hfst-tokenise –giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it’s safer to let hfst-tokenise handle them.
Needs hfst-tokenise to output things differently depending on the tag they get
This (part of) documentation was generated from tools/tokenisers/tokeniser-tts-cggt-desc.pmscript