Finite state and Constraint Grammar based analysers, proofing tools and other resources
View the project on GitHub giellalt/lang-mhr
All doc-comment documentation in one large file.
This dep file is for sma, sme, smj, sje.
Sentence delimiters are the following: <.> <!> <?> <…> <¶>
N V A Adv CC CS Inf Sup Neg Num Po Pr
Pcle Prop
Pron IV TV COMMA DASH CITATION to keep colouring we add a “ HYPHEN QMARK PUNCT LEFT RIGHT CLB Ind Pot Impr ImprtII Cond ConNeg Caus causative eus VGen Interj ABBR ACR Prs Prt Cmpnd RCmpnd PrfPrc PrsPrc Actor Actio Ger Indef Nom Acc Ill Com Gen Ess
IM For fao
Correction rules
muitalit
XX
XX
XX
faoSumId=Rel
lgRemove removes the language tags
This (part of) documentation was generated from src/cg3/dependency.cg3
This is the Eastern Mari disambiguation file. It chooses the correct morphological analyses in any given sentence context.
The file first defines sentence delimiters and tags and sets. Thereafter come the rules, each rule is listed below.
The delimiters are: “<.>” “<!>” “<?>” “<…>” “<¶>” sent
The Tags section lists all the tags inherited from the fst, and defines them for use in the syntactic analysis. The tags are documented in the root.lexc file, and here only listed for reference.
The next section, Sets, contains sets defined on the basis of the tags listed here, those set names are not visible in the output.
BOS EOS
N V A Adv CC CS Interj Pron Num Pcle Clt Po
WORD is the set of all POS
Prs Prt1 Prt2 Fut Imprt Ind Cond Des
Act ConNeg FutPrc Ger Inf Nec Neg NegPrc Pass Prc PrfPrc
Verbal person-number tags Sg1 Sg2 Sg3 Pl1 Pl2 Pl3
Sg Pl
Nom Gen Abl Dat Com Cns Acc Ins Ine Ill Cmpr (case)
Pers Refl Rel Interr Recipr Dem ABBR ACR
Pos (?) Superl Comp
PxSg1 PxSg2 PxSg3 PxPl1 PxPl2 PxPl3
Card Coll Ord Temp (?)
Qst Foc
CLB PUCT LEFT RIGHT COMMA
Der/MWN Der/sa Der/Pur Der/Caus Der/Nom
CmpTest Err
Der/Date Der/Year Der/Hum Der/Lang Der/Domain Der/Feat-phys Der/Clth Der/Body Der/Act
Sem/Ani Sem/Fem Sem/Group Sem/Mal Sem/Measr Sem/Money Sem/Obj Sem/Obj-el Sem/Org Sem/Perc-emo Sem/Plc Sem/Sign Sem/State-sick Sem/Sur Sem/Time Sem/Txt
onlyteve gives Pcle if two теве
PcleNotCC Lauseen alussa on Pcle
CCnotInterj
Posna деч посна
*InterrQ if question mark anywhere to the right
*Interr removes Rel if question mark to the right somewhere
Existential ulo
Infinitives
Ind selects Ind if no Ind to the right or to the left
1SgAgr selects (Ind Sg1) if Pron1Sg to the right or to the left
1SgAgr selects (Ind Sg1) if (Pron1Sg Nom) to the right or to the left
2SgAgr selects (Ind Sg2) if Pron2Sg to the right or to the left
IndAfterInf selects Ind if Inf to the left
NotImpWhenInd
NotImpWhenWords1
NotImpWhenWords2
*RemAdjBeforeProp removes A if Prop to the left
*AdjBeforeMo selects A if Interr to the right
AdjBeforeAN selects A if N or A to the right
RemN removes N if N to the right
*AdjBeforeConjAdj selects A if conjuction and A to the right ;
AdjNotAdv removes Adv if N to the right
AdjNotPron removes Pron Pers if N to the right
*AdjNotN removes N if Pron Pers anywhere to the left
*RemAdj2 removes A if no N or Pron in a clause
*RemNomIfPronLeft removes Nom if Pron Nom anywhere to the left
*RemNomIfPronRight removes Nom if Pron Nom anywhere to the right
*NomBeforeConjNom selects N Nom if conjoined with N Nom
*NafterDem selects N if Dem to the left (demonstratives tend to be sole modifiers)
*NotANoun
*NafterAbeforeEOS
*RemNafterAdv removes N if adverb to the left
RemDerMWN1 removes Der/MWN if N is an option
RemDerMWN2 removes Der/MWN if N to the right
Dersa if noun follows
SelDerMWN select Der/MWN if no noun follows
RemNomNif12left removes Nom with N if there is a verb with 1st or 2nd agreement to the lef
RemNomNif12right removes Nom with N if there is a verb with 1st or 2nd agreement to the right
AccNeedsVerb prefers Nom (TODO: does this make sense? SASHA: it does but there was a typo, -1* instead of 1* in the third clause of the condition)
IkNumAN ik is num before A N Sg
NotImp in most тиде cases
NotInterr if Rel
Dem if noun follows
уке
Phrases
*RemGer removes Ger Gen if there is no verb to the right
FinNotGer removes Ger if there is a Ind Prt2 Sg3 in the clause
GerNotFin Ger if there is a Ind next
GerNotFin Ger if there is a Ger da Ger VFin
Sg1NotSg3 removes Prt1 Sg3 when Pers Sg1 Nom in same clause
da1 Adv initially
da2 CC elsewhere
AifVövny selects A if вӧвны somewhere to the left
NotPcle
NoErrOrth
This (part of) documentation was generated from src/cg3/disambiguator.cg3
S Y N T A C T I C F U N C T I O N S F O R S Á M I
Sámi language technology project 2003-2024, University of Tromsø #
This file adds syntactic functions. It is common for all the Saami
LEFT RIGHT because of apertium
Sets for POS sub-categories
Sets for Semantic tags
Sets for Morphosyntactic properties
!!Syntactic tags
!!Tag sets
** V is all readings with a V tag in them, REAL-V should be the ones without an N tag following the V. The REAL-V set thus awaits a fix to the preprocess V … N bug.
The set COPULAS is for predicative constructions
NP sets defined according to their morphosyntactic features
The PRE-NP-HEAD family of sets
These sets model noun phrases (NPs). The idea is to first define whatever can occur in front of the head of the NP, and thereafter negate that with the expression WORD - premodifiers.
The set NOT-NPMOD is used to find barriers between NPs. Typical usage: … (*1 N BARRIER NPT-NPMOD) … meaning: Scan to the first noun, ignoring anything that can be part of the noun phrase of that noun (i.e., “scan to the next NP head”)
Miscellaneous sets
Border sets and their complements
ADLVCASE
These were the set types.
!!Numeral outside the sentence
!!HABITIVE MAPPING
hab1 hab aux leat
hab_numo1 hab copula comma comma N+Nom
hab_numo2 copula nu mo/go hab
leahab copula nu mo/go hab
hab2 hab auxv adv leat
hab3 (
hab3 (
hab3 (
hab3 (
hab_main (
habInf hab lea inf
habNomLeft Nom or Num + gen hab lea
habAdvl Ii han ovttasge du sogas leat dat namma.
hab4 hab cc hab leat
hab6 lea go hab – leago hab
hab7 lea go hab
hab5 This is not HAB Mánás gollot gieđat.
hab9 prop ord-hab leat
hab10 prop ord-hab leat
habDain2
habRel # before relative clause
habEllipse Buot gánddain lea dreassa, nieiddain fas gákti.
habGen (
habGenQst (
habRefl # with inf
n<titel1 (@N<) for (“jr”) or (“sr”); if first one to the left is Prop
n<titel2 (@N<) for INITIAL; if first one to the left is a noun, or if to the left of you is a single letter which is part of a noun conjunction ‘‘bustávas e ja f gáibiduvvo’’
n<:com (@N<) for (Sg Com); if first one to the left is Coll
>nAttr (@>N) for Attr; if there is a noun to your right
n>Indef (Pron Indef Attr); if eará is to the right
n>Indef (Pron Indef Com); if eará is to the right
>nNum (@>N) for numerals if; there is a noun to your right. You are not allowed to be (Sg Nom), (Sg Acc) or (Sem/Date)
noun>n (@>N) for Gen; if there is a noun to your right. Restrictions: Not if you are: a time related word. Not if you are OKTA with Pl Loc to your right. Not if CC is to your right followed by another Gen and then Po. Not if you are HUMAN and to your right is Actio Nom folloed by a noun.
>nTime (@>N) for Gen TIME-N; if timenoun to your right. Restrictions: Not if you are a OKTA Nom with Pl Loc to your right. Not if CC followed by Gen, followed by Po to your right. Not if COMMA to your right
>ntittel (@>N) for (Sg Nom TIME-N) or (Nom Der/NomAg); if to your right is Sem/Mal, Sem/Fem, Sem/Sur
>nplc (@>N) for (Sg Nom Prop Sem/Plc), if to your right is Sem/Plc
>nALU (@>N) for Sg Acc numerals; when a measure-noun to the right
>NTime (@>N) for Gen; if you are TIME-N with BOC to your left, and PREGEN to your right
n<:Refl (@N<) for (Refl Nom); if to the left is (N Nom), or if first one to the left is a finite mainverb with a (N Nom) to the left
>pron1 (@>Pron) for GRADE-ADV, DUSSE, BUOT if; first one to the right is Pron
>pron2 (@>Pron) for (Refl Nom) if; first one to the right is Refl
>pron3 (@>Pron) for (Pron Recipr) if; first one to the right is (Pron Recipr)
vaikko (@>Pron) for vaikko if; first one to the right is Indef
vaikkoman (@>ADVL) for vaikko if; first one to the right is man
dasmaŋŋel (@>ADVL) for vaikko if; first one to the right is man
adv>advl (@>ADVL)
adv>advl (@>ADVL)
BOSvoc (@VOC) for HUMAN Nom; if sentence initial. To the right is comma. No nom-cased HUMAN followed by comma or CC is allowed to the right. There should not be a relative clause to the right, because then you are likely to be SUBJ
voc (@VOC) for Nom HUMAN; if comma to the left and an second person verb or pronoun to the left. To the right is the end of the sentence
__Particle<subj __ (@PCLE)
spred<obj (@SPRED<OBJ) for Acc; the object of an SPRPED. Not to be mistaken with OPRED. If SPRED is to the left, and copulas is to the left of it. Nom or Hab are found sentence initially.
Hab<subj (
Hab<subj (
Hab>Advlcase<subj (
Nom>Advlcase<subj (
<extSubj (
<extSubj (
<extSubjA (
<extSubj (
<extSubj (
loc<extSubj (
<spred (@<SPRED) for Nom; if Nom to the left, copulas to the left of Nom, and a time related word to the left of it.
<extQst1 (
<extQst2 (
extQst3> (
extQst3> (
<extsubjcoor (
Sem/Year
<spredQst (@<SPRED) for Nom; in a typically question sentence; You are not allowed to be Pers or human. The special part is that Nom is not allowed to your right
<spredQst2 (@<SPRED) for (A Nom); in a typically question sentence; You are SPRED if (N Nom) is to your left and leat + qst is to the left
<spredQst3 (@<SPRED) for (A Nom); you are SPRED when you are (A Nom) and to your right is (N Nom). This is a Qst-sentence, so copulas is found to your left
<spredQst4 (@<SPRED) for Nom; but only in a qst-sentence where there is no chance of you beeing the subj
<NomBeforeSpred (@<SPRED) for (A Nom) if; Nom to the left, and copulas is to the left of Nom. There is no Nom allowed to the right of copulas! To avoid messing with coordination: ja, dahje and comma are not allowed to your left. Comma is not allowed to your right; if so then you are likely to be coordinated
<spred (@<SPRED) for A Nom or N Nom if; the subject Nom is on the same side of copulas as you: on the right side of copulas
<spredVeara (@<SPRED) for veara + Nom; if genitive immediately to the right, and intransitive mainverb to the right of genitive
leftCop<spred (@<SPRED) for Nom; if copulas is the main verb to the left, and there is no Ess found to the left of cop (note that Loc is allowed between target and cop). OR: if you are Coll or Sem/Group with copulas to your left.
<spredLocEXPERIMENT (@<SPRED) for material Loc; if you are to the right of copulas, and the Nom to the left of copulas is not a hab-actor
NumTime (@<SPRED) for A Nom
<spredSg (@<SPRED) for Sg Nom
<spredPg (@<SPRED) for Pl Nom
<spred (@<SPRED) for Nom; if copulas to the left, and Nom or sentence boundary to the left of copulas. First one to the right is EOS.
COP<spredEss (@<SPRED) for N Ess
spredEss> (@SPRED>) for N Ess; if copulas to the right of you, and if an NP with nom-case first one to your left.
GalleSpred> (@SPRED>) for Num Nom; if sentence initial
spredSgMII> (@SPRED>)
spredšaddat> (@SPRED>)
r492> (@SPRED>) for Interr Gen; consisting only of negations. You are not allowed to be MII. You are not allowed to have an adjective or noun to yor right. You are not allowed to have a verb to your right; the exception beeing an aux.
AdjSpredSg> (@SPRED>) for A Sg Nom; if copulas to the right, but not if A or @<SPRED are found to the right of copulas
Spred>SubjInf (@SPRED>) for Nom; if copulas to the right, and the subject of copulas is an Inf to the right
spredCoord (@<SPRED) coordination for Nom; only if there already is a SPRED to the left of CNP. Not if there is some kind of comparison involved.
subj>Sgnr1 (@SUBJ>) for Nom Sg, including Indef Nom if; VFIN + Sg3 or Pl3 to the right (VFIN not allowed to the left)
subj>Pl (@SUBJ>) for plural nominatives, including Coll and Sem/Group. VFIN + Pl3 to the right.
subj>Pl (@SUBJ>) for plural nominatives
subj>Sg (@SUBJ>) for Nom Sg; if VFIN + Sg3 to the right.
Sg<subj (@<SUBJ) for Nom Sg; if VFIN Sg3 or Du2 to the left (no HAB allowed to the left).
Du<subj (@<SUBJ) for Nom Coll if; a dual third person verb is found to the left
PlDu<subj (@<SUBJ) for (N Nom Pl), (Sem/Group Nom), (Coll Nom), (Pron Nom Pl) if; a verb is Pl3 or Du3 to your left. The verb is not allowed to be copulas with a place, Loc or time noun to its left
copPl3<subj (@<SUBJ) for Nom Pl; you don’t to be a noun, only Nom Pl. To the left is copulas and first one to the right is @<SPRED
-fsubj> (@-FSUBJ>) for HUMAN Gen; in a NP-clause. To your right is Actio Nom followed by a noun
f<advl (@-F<ADVL) for infinite adverbials
f<advl (@-F<ADVL) for infinite adverbials
s-boundary=advl> (@ADVL>) for ADVL that resemble s-boundaries. Mainverb to the right.
diibmuadvl> (@ADVL>) for (diibmu Nom) if first one to the right is Num
-fsubj (@-FSUBJ>) for HUMAN Acc after DADJAT verbs
-fobj> (@-FOBJ>) for Acc if front of abessive, gerundium, actio locative, perfectum participle or infinitive. First one to the right not allowed to be Acc though
-fobj> (@-FOBJ>) for Acc if human with ADVL-case to the left and transitive infinitive OBJ to the right. First one to the right not allowed to be Acc though
advl>mainV (@ADVL>) if; finite mainverb not found to the left, but the finite mainverb is found to the right.
V<advl (@<ADVL) if; finite mainverb found to the left. Not if a comma is found immediately to the left and a finite mainverb is located somewhere to the right of this comma.
advl>v (@ADVL>) if; you are ADVL, time-noun or Sem/Route and there is a finite verb to the right in the clause, or if to your right is: de followed by a finite verb. OR: if you are a time-nound and to your right is: go or sentenceboundary followed by a finite verb
advlPoPr> (@<ADVL) for Po or Pr; if mainverb to the right.
BOSPo> (@ADVL>) for Po; if trapped between BOS to the right and S-BOUNDARY OR COMMA to the left, because the main verb will then automatically be on your right side.
<advlComIll (@<ADVL) only if; you are Com OR Ill. To your left is a mainverb, and to your right a sentenceboundary, because we don’t want there to be another mainverb you potentially could belong to
<advlEOS (@<ADVL) for Po or Pr or Loc; if you are found at the very end of a sentence. A mainverb is needed to the left though.
<advlGen (@<ADVL) for (N Gen) if mainverb to the left and no noun to the right
<opredgohcodit (@<OPRED) for Ess
advlEss> (@<ADVL) for weather and time Ess, if FMAINV to the left.
comma<advlEOS (@<ADVL) for Adv if; mainverb is to the left. Comma to the left and mainverb to the right in the same clause is not allowed
advl>inbetween (@ADVL>) for Adv; if inbetween two sentenceboundaries where no mainverb is present.
comma<advlEOS (@<ADVL) for Adv if; comma found to the left and the finite mainverb to the left of comma. To the right is the end of the sentence.
BOSadvl> (@ADVL>) if; you are N Loc or N Ill and found sentence initially and there is a main verb somewhere to the right. No barrier for the mainverb; based on the thought that first one to your right is probably a sentenceboundary.
cleanupILL<advl (@<ADVL) for N Ill if; there are no boundarysymbols to your left, if you arent already @N< OR @APP-N<, and no mainverb is to yor left.
cleanupPo (@ADVL) for Po: This rule tags all Po:s as ADVL if they haven’t gotten a tag somewhere along the way.
cleanupPr (@ADVL) for Po: This rule tags all Pr:s as ADVL if they haven’t gotten a tag somewhere along the way.
-fsubj>asAcc (@-FSUBJ>) for HUMAN Acc; if there is a verb @-F<OBJ to your left
-f<obj (@-F<OBJ) for Acc if there is a transitive verb + SYN-V to your left
-fsubj>IV (@-FSUBJ>) for Acc; if there is an IV-verb acting as a @-F<OBJ to your right
-fsubj>IV (@-FSUBJ>) for Acc; if there is an TV-verb acting as a @-F<OBJ to your right followed by an Acc
-fsubj>asGen (@-FSUBJ>) for Gen;
f<subj (@-F<SUBJ) for Nom if; (V @-F<OBJ) to the left.
<opredAAcc (@<OPRED) for A Acc; if an other accusative to the left, and a transtive verb to the left of it. OR: if a transitive verb to the left, and an accusative to the left of it.
!sma object
<advlMeasr (@<ADVL) for (Num Acc); if finite IV-mainverb to the left, measure-noun to the right
<objMeasr (@<OBJ) for Num Acc; if finite TV-mainverb to the left, measure-noun to the right
<advlMeasr2 (@<ADVL) for MEASR-N + Acc; if (Num Pl) to the left and mainverb to the left of it
advlMeasr> (@ADVL>) for Num Acc;
Obj> (@OBJ>) for Acc; if there is a finite mainverb to the right in the clause. A really simple rule with no other restrictions..
s-boun<obj (@<OBJ) for Acc; if sentenceboundary to your left and a transitive mainverb to the left futher to the left
<objIV (@<OBJ) for Acc; if there is an intransitive mainverb in the clause. Not for Rel or Num. Not if you are a numeral followed by a measure-noun
<advlEss (@<ADVL) for ESS-ADVL if; FMAINV to the left
IV<spredEss (@<SPRED) for N Ess if; FMAINV to the left is intransitive or bargat
<opredEss (@<OPRED) for (N Ess), (A Ess) if; transitive mainverb to the left in the clause. If accusative to the left or to the right, or if Inf or ahte to the right, or if there is a noun to the right followed by an Inf
Acc<opredEss (@<OPRED) for (N Ess), (A Ess) if; transitive mainverb to the left in the clause, and an accusative cased Rel left to the verb
onlyV<opred (@<OPRED) for (N Ess) if; there is a transitive mainverb to the left. Usually there needs to be an Acc to the left, but here it is not needed
onlyV<opred2 (@<OPRED) for (N Ess) if;
!!SUBJ MAPPING - leftovers
subj>ifV (@SUBJ>) for NP-HEAD-NOM, DUPRON or Num + Nom if; a finite mainverb is found to the right. This is a cleanup rule for subjects
hnoun>ifV (@SUBJ>) for NP-HEAD-NOM, DUPRON if. The counterpart of subj>ifV. You are HNOUN if there is a finite verb to your right, but NOT if there is a finite verb after a relative clause
!!OBJ MAPPING - leftovers
!!
!!HNOUN MAPPING
! missingX adds @X to all missings
! therestX adds @X to all what is left, often errouneus disambiguated forms
!!For Apertium: The analysis give double analysis because of optional semtags. We go for the one with semtag.
This (part of) documentation was generated from src/cg3/functions.cg3
S Y N T A C T I C F U N C T I O N S F O R S Á M I
Sámi language technology project 2003-2014, University of Tromsø #
Here we remove special tags for MT
Here we remove semantic tags for all other words than proper nouns.
This (part of) documentation was generated from src/cg3/korp.cg3
Adjective inflection
Meadow Mari adjectives
LEXICON A underscore
LEXICON A-a/e
LEXICON A/S-a/e redirect to A underscore
LEXICON A/S-VS redirect to A underscore
LEXICON A-VS redirect to A underscore
LEXICON A/S redirect to A underscore
This (part of) documentation was generated from src/fst/morphology/affixes/adjectives.lexc
LEXICON K
LEXICON K-imprt
This (part of) documentation was generated from src/fst/morphology/affixes/clitics.lexc
Meadow Mari noun inflection
Some Postpositions in Mari take possessive suffixes. For now, am allowing all an all, but we should revisit this in the lexicon eventually - classifying postpositions into those that take Px and those that do not.
Also here: some adverbs that take possessive suffixes, like ӱстембалне on the table > ӱстембалнем on my table
LEXICON N_ redirects to N-ava_01
LEXICON N-continuation comes from Proper nouns
LEXICON N-ava_01 obl because of pronouns
LEXICON N-ava_01_obl_without-hyphens to obl only
LEXICON N-ava_01_obl_with-hyphens to obl only, also ООО-влак
DECLENSION
Each case-number-person has its own lexicon.
LEXICON N-SG-NOM
LEXICON N-SG-GEN
LEXICON N-SG-DAT
LEXICON N-SG-ACC
LEXICON N-SG-CMPR
LEXICON N-SG-COM
LEXICON N-SG-INE
LEXICON N-SG-ILL
LEXICON N-SG-LAT
LEXICON N-SG-ABE
LEXICON N-LOCPL-ILL
LEXICON N-LOCPL-INE
LEXICON N-LOCPL-LAT
LEXICON N-LOCPL-NOM
LEXICON N-PL-NOM
LEXICON N-PL-GEN
LEXICON N-PL-DAT
LEXICON N-PL-ACC
LEXICON N-PL-CMPR
LEXICON N-PL-COM
LEXICON N-PL-INE
LEXICON N-PL-ILL
LEXICON N-PL-LAT
LEXICON N-PL-ABE
Here starts the Px stuff
LEXICON N-SG-PxSg1-NOM
LEXICON N-SG-PxSg1-GEN
LEXICON N-SG-PxSg1-DAT
LEXICON N-SG-PxSg1-ACC
LEXICON N-SG-PxSg1-CMPR
LEXICON N-SG-PxSg1-COM
LEXICON N-SG-PxSg1-INE
LEXICON N-SG-PxSg1-ILL
LEXICON N-SG-PxSg1-LAT
LEXICON N-SG-PxSg1-ABE
LEXICON N-PL-PxSg1-NOM
LEXICON N-PL-PxSg1-NOM_NB-first
LEXICON N-PL-PxSg1-GEN
LEXICON N-PL-PxSg1-GEN_NB-first
LEXICON N-PL-PxSg1-DAT
LEXICON N-PL-PxSg1-DAT_NB-first
LEXICON N-PL-PxSg1-ACC
LEXICON N-PL-PxSg1-ACC_NB-first
LEXICON N-PL-PxSg1-CMPR
LEXICON N-PL-PxSg1-CMPR_NB-first
LEXICON N-PL-PxSg1-COM
LEXICON N-PL-PxSg1-COM_NB-first
LEXICON N-PL-PxSg1-INE
LEXICON N-PL-PxSg1-INE_NB-first
LEXICON N-PL-PxSg1-ILL
LEXICON N-PL-PxSg1-ILL_NB-first
LEXICON N-PL-PxSg1-LAT
LEXICON N-PL-PxSg1-LAT_NB-first
LEXICON N-PL-PxSg1-ABE
LEXICON N-PL-PxSg1-ABE_NB-first
LEXICON N-SG-PxSg2-NOM
LEXICON N-SG-PxSg2-GEN
LEXICON N-SG-PxSg2-DAT
LEXICON N-SG-PxSg2-ACC
LEXICON N-SG-PxSg2-CMPR
LEXICON N-SG-PxSg2-COM
LEXICON N-SG-PxSg2-INE
LEXICON N-SG-PxSg2-ILL
LEXICON N-SG-PxSg2-LAT
LEXICON N-SG-PxSg2-ABE
LEXICON N-PL-PxSg2-NOM_NB-first
LEXICON N-PL-PxSg2-GEN
LEXICON N-PL-PxSg2-GEN_NB-first
LEXICON N-PL-PxSg2-DAT
LEXICON N-PL-PxSg2-DAT_NB-first
LEXICON N-PL-PxSg2-ACC
LEXICON N-PL-PxSg2-ACC_NB-first
LEXICON N-PL-PxSg2-CMPR
LEXICON N-PL-PxSg2-CMPR_NB-first
LEXICON N-PL-PxSg2-COM
LEXICON N-PL-PxSg2-COM_NB-first
LEXICON N-PL-PxSg2-INE
LEXICON N-PL-PxSg2-INE_NB-first
LEXICON N-PL-PxSg2-ILL
LEXICON N-SG-PxSg3-NOM
LEXICON N-SG-PxSg3-GEN
LEXICON N-SG-PxSg3-DAT
LEXICON N-SG-PxSg3-ACC
LEXICON N-SG-PxSg3-CMPR
LEXICON N-SG-PxSg3-COM
LEXICON N-SG-PxSg3-INE
LEXICON N-SG-PxSg3-ILL
LEXICON N-SG-PxSg3-LAT
LEXICON N-SG-PxSg3-ABE
LEXICON N-PL-PxSg3-NOM
LEXICON N-PL-PxSg3-NOM_NB-first
LEXICON N-PL-PxSg3-GEN
LEXICON N-PL-PxSg3-GEN_NB-first
LEXICON N-PL-PxSg3-DAT
LEXICON N-PL-PxSg3-DAT_NB-first
LEXICON N-PL-PxSg3-ACC
LEXICON N-PL-PxSg3-ACC_NB-first
LEXICON N-PL-PxSg3-CMPR
LEXICON N-PL-PxSg3-CMPR_NB-first
LEXICON N-PL-PxSg3-COM
LEXICON N-PL-PxSg3-COM_NB-first
LEXICON N-PL-PxSg3-INE
LEXICON N-PL-PxSg3-INE_NB-first
LEXICON N-PL-PxSg3-ILL
LEXICON N-PL-PxSg3-ILL_NB-first
LEXICON N-PL-PxSg3-LAT
LEXICON N-PL-PxSg3-LAT_NB-first
LEXICON N-PL-PxSg3-ABE
LEXICON N-PL-PxSg3-ABE_NB-first
LEXICON N-SG-PxPl1-NOM
LEXICON N-SG-PxPl1-GEN
LEXICON N-SG-PxPl1-DAT
LEXICON N-SG-PxPl1-ACC
LEXICON N-SG-PxPl1-CMPR
LEXICON N-SG-PxPl1-COM
LEXICON N-SG-PxPl1-INE
LEXICON N-SG-PxPl1-ILL
LEXICON N-SG-PxPl1-LAT
LEXICON N-SG-PxPl1-ABE
LEXICON N-PL-PxPl1-NOM
LEXICON N-PL-PxPl1-NOM_NB-first
LEXICON N-PL-PxPl1-GEN
LEXICON N-PL-PxPl1-GEN_NB-first
LEXICON N-PL-PxPl1-DAT
LEXICON N-PL-PxPl1-DAT_NB-first
LEXICON N-PL-PxPl1-ACC
LEXICON N-PL-PxPl1-ACC_NB-first
LEXICON N-PL-PxPl1-CMPR
LEXICON N-PL-PxPl1-CMPR_NB-first
LEXICON N-PL-PxPl1-COM
LEXICON N-PL-PxPl1-COM_NB-first
LEXICON N-PL-PxPl1-INE
LEXICON N-PL-PxPl1-INE_NB-first
LEXICON N-PL-PxPl1-ILL
LEXICON N-PL-PxPl1-ILL_NB-first
LEXICON N-PL-PxPl1-LAT
LEXICON N-PL-PxPl1-LAT_NB-first
LEXICON N-PL-PxPl1-ABE
LEXICON N-PL-PxPl1-ABE_NB-first
LEXICON N-SG-PxPl2-NOM
LEXICON N-SG-PxPl2-GEN
LEXICON N-SG-PxPl2-DAT
LEXICON N-SG-PxPl2-ACC
LEXICON N-SG-PxPl2-CMPR
LEXICON N-SG-PxPl2-COM
LEXICON N-SG-PxPl2-INE
LEXICON N-SG-PxPl2-LAT
LEXICON N-SG-PxPl2-ABE
LEXICON N-PL-PxPl2-NOM
LEXICON N-PL-PxPl2-NOM_NB-first
LEXICON N-PL-PxPl2-GEN
LEXICON N-PL-PxPl2-GEN_NB-first
LEXICON N-PL-PxPl2-DAT
LEXICON N-PL-PxPl2-DAT_NB-first
LEXICON N-PL-PxPl2-ACC
LEXICON N-PL-PxPl2-ACC_NB-first
LEXICON N-PL-PxPl2-CMPR
LEXICON N-PL-PxPl2-CMPR_NB-first
LEXICON N-PL-PxPl2-COM
LEXICON N-PL-PxPl2-COM_NB-first
LEXICON N-PL-PxPl2-INE
LEXICON N-PL-PxPl2-INE_NB-first
LEXICON N-PL-PxPl2-ILL
LEXICON N-PL-PxPl2-ILL_NB-first
LEXICON N-PL-PxPl2-LAT
LEXICON N-PL-PxPl2-LAT_NB-first
LEXICON N-PL-PxPl2-ABE
LEXICON N-PL-PxPl2-ABE_NB-first
LEXICON N-SG-PxPl3-NOM
LEXICON N-SG-PxPl3-GEN
LEXICON N-SG-PxPl3-DAT
LEXICON N-SG-PxPl3-ACC
LEXICON N-SG-PxPl3-CMPR
LEXICON N-SG-PxPl3-COM
LEXICON N-SG-PxPl3-INE
LEXICON N-SG-PxPl3-ILL
LEXICON N-SG-PxPl3-LAT
LEXICON N-SG-PxPl3-ABE
LEXICON N-PL-PxPl3-NOM
LEXICON N-PL-PxPl3-NOM_NB-first
LEXICON N-PL-PxPl3-GEN
LEXICON N-PL-PxPl3-GEN_NB-first
LEXICON N-PL-PxPl3-DAT
LEXICON N-PL-PxPl3-DAT_NB-first
LEXICON N-PL-PxPl3-ACC
LEXICON N-PL-PxPl3-ACC_NB-first
LEXICON N-PL-PxPl3-CMPR
LEXICON N-PL-PxPl3-CMPR_NB-first
LEXICON N-PL-PxPl3-COM
LEXICON N-PL-PxPl3-COM_NB-first
LEXICON N-PL-PxPl3-INE
LEXICON N-PL-PxPl3-INE_NB-first
LEXICON N-PL-PxPl3-ILL
LEXICON N-PL-PxPl3-ILL_NB-first
LEXICON N-PL-PxPl3-LAT
LEXICON N-PL-PxPl3-LAT_NB-first
LEXICON N-PL-PxPl3-ABE
LEXICON N-PL-PxPl3-ABE_NB-first
This (part of) documentation was generated from src/fst/morphology/affixes/nouns.lexc
LEXICON QNT_
LEXICON KvMurt
LEXICON KvK cardinal numerals
LEXICON KvK_ATTR cardinal numerals in noun phrase scope
LEXICON KvKoll
LEXICON NUM-COLL_
LEXICON KvInd
LEXICON Kv-a/e
This (part of) documentation was generated from src/fst/morphology/affixes/numbers.lexc
LEXICON pronouns_not_from_xml
LEXICON MYJ
LEXICON TYJ
LEXICON TUDO
LEXICON TIDE
LEXICON SHKE
LEXICON Pimp
LEXICON Pmuu
LEXICON PRON-IR_
LEXICON PRON_
LEXICON PronRes
LEXICON PronI
LEXICON PronIR
LEXICON PronInd
LEXICON PRON-INDEF
LEXICON KAZHNE
LEXICON PronDem
LEXICON PronRef
This (part of) documentation was generated from src/fst/morphology/affixes/pronouns.lexc
Meadow Mari proper nouns inflect in the same cases as regular nouns, but with a colon (‘:’) as separator. (???)
LEXICON PROP-OLD-ORTH-SG-NOM_
: ENDLEX ; in attributive position ?SHOULD THIS HAVE an +Attr tag?
: N-ava_01 ; decline like common nouns
Check whether +Orth/Colloq is orthographically wrong
LEXICON PropNameMaleDer-J-0Evich
LEXICON PropNameMaleDer-IJ-Y0Evich
LEXICON PropNameMaleDer-IJ-I0Evich
LEXICON PropNameMaleDer-Y-0Evich
Вили:Вил
LEXICON PropNameMaleDer-I-YEvich
LEXICON PropNameMaleDer-Ovich
LEXICON Deriv-RUS-V_SURMAL Абдеев:Абдеев
LEXICON Deriv-RUS-IJ_SURMAL Багрий:Багр
LEXICON Deriv-RUS-KIJ_SURMAL Аморский:Аморск
LEXICON Deriv-RUS-OJ_SURMAL
LEXICON Deriv-RUS-YJ_SURMAL
LEXICON Deriv-RUS-AN_SURMAL
LEXICON Deriv-RUS-IN_SURMAL
LEXICON PROP_KAL_SURMAL
LEXICON PROP_KUDO_SURFEM
LEXICON CYRL-CONS_SUR
LEXICON PropSur-kal
LEXICON CYRL-T_SUR
LEXICON PropSur-kit
LEXICON CYRL-L_SUR
LEXICON CYRL-K_SUR
LEXICON PropSur-lak
LEXICON CYRL-SIBILANT_SUR
LEXICON PropSur-osh
LEXICON CYRL-VOW_SUR
LEXICON CYRL-A_SUR
LEXICON PROP_KAL_SUR
LEXICON PROP_KUDO_SUR
LEXICON PROP_KAL_MAL
LEXICON PROP_LAK_MAL
LEXICON PROP_OSH_MAL
LEXICON PROP_KUDO_MAL
LEXICON LEXC_PROP_KUDO_MAL
LEXICON PROP_OSH_PATRMAL
LEXICON PROP_KUDO_PATRFEM
LEXICON PROP_KAL_FEM
LEXICON PROP_OSH_FEM
LEXICON PROP_KUDO_FEM
LEXICON LEXC_PROP_KUDO_FEM
PLACE NAMES FROM TEMPLATE
This (part of) documentation was generated from src/fst/morphology/affixes/propernouns.lexc
This (part of) documentation was generated from src/fst/morphology/affixes/symbols.lexc
Meadow Mari verb inflection.
Some of these are directed directly from root.lexc
LEXICON verbs_not_from_xml
LEXICON negverb TODO: fix
We divide the verbs in two, -am and -em
LEXICON V_am-N divides V_am in Mood and infinites
LEXICON V_am divides V_am in Mood and infinites
LEXICON Vam-Mood divides in Ind, Imprt, Des
LEXICON Vam-Ind gives all the Ind tenses
LEXICON Vam-Imp for imperative, Повелительное наклонение:
LEXICON Vam-Des for desiderative, Желательное наклонение:
First four lexica: V_em with Gerund, the rest without, all going to V_em_ALL to get derivation affixes.
LEXICON V_em divides V_em in Mood and infinites
LEXICON V_em-1SYLL-j allow for literary norm until 1970 (Alhoniemi 1985: 105-106) кайше, кайшаш +Err/Orth: non-finites ; until 1972 reform
LEXICON V_em-1SYLL single syll V_em verbs, do not include bare-stem gerunds in their paradigms
Optional derivation: All verbs going to V_em_INFL
LEXICON Vem-Mood divides in Ind, Imprt, Des
LEXICON Vem-Ind gives all the Ind tenses
LEXICON Vem-Imp for imperative, Повелительное наклонение:
LEXICON Vem-Des for desiderative, Желательное наклонение:
LEXICON non-finites contains Mutual endings
V_am, возаш : воч
These need work 2012-09-21
This (part of) documentation was generated from src/fst/morphology/affixes/verbs.lexc
Divvun & Giellatekno - open source grammars for Sámi and other languages
A special lexicon for handling proper noun compounding without hyphens as that would allow compounding with words explicitly coded to disallow such compounds)
This (part of) documentation was generated from src/fst/morphology/compounding.lexc
This file documents the phonology.twolc file
This file contains rules for morphophonological alternations, such as vowel harmony, stem vowel changes, palatalisation, etc.
We define our symbols (Alphabet), some Sets, and then the Rules
other symbols
Archiphonemes for vowels, Giellatekno style
Archiphonemes for vowels, Apertium style
%{ӧы%}:ы Stem-final vowel variation when stress falls on non-final vowel
%{яы%}:ы Stem-final vowel variation when stress falls on non-final vowel
%{ыØ%}:0 PxSg3 onset
%{ьØ%}:0 for -ам verbs Prt1 Sg1, Sg2, Sg3, Pl3 л н
т2:т лект- лек# “leave/ уходить”
%^END:0 for -ам verb final, i.e. Imprf
%^Sonorant:0 for use with acronyms after hyphen Л | М | Н | Р | Ҥ |
%^Obstruent:0 for use with acronyms after hyphen С | Ф | Ъ | Ь |
%^FrontObstr:0 for use with acronyms after hyphen
Vow = Vo VO Ve ы Ы ;
Cns = б в г д ж з й к л м н ҥ п р с т ф х ц ч ш щ
з2 к2 н2 т2 ;
CnsAll = б в г д ж з й к л м н ҥ п р с т ф х ц ч ш щ
з2 к2 н2 т2 ;
CnsNoj = б в г д ж з к л м н ҥ п р с т ф х ц ч ш щ
Б В Г Д Ж З К Л М Н Ҥ П Р С Т Ф Х Ц Ч Ш Щ;
Cst = б в г д ж з к п с т ф х ц ч ш щ
Б В Г Д Ж З К П С Т Ф Х Ц Ч Ш Щ;
Ltrs = Vow Cns ъ ь Ъ Ь ;
Punctuation bullet as such This rule prevents deleting of BULLET when it forms a token. BULLET as stress mark is deleted as before.
Palatal mark loss before vowel имне+N+Sg+Nom+Foc/Ат
Onset vowel loss in suffix after stem vowel
Onset vowel Е2 realized in suffix е
Onset vowel Е2 realized in suffix э
Onset vowel Е2 realized in suffix ZERO
Onset vowel Ы1 realized in suffix
suffix-final vowel loss after stem-final vowel
пуаш+V+Imprt+Sg2
кияш+V+Imprt+Sg2
suffix-final vowel loss after stem-final vowel
**suffix-final vowel realized as -Round in word-final position е **
шылаш+V+Imprt+Sg3 шыл%>жЫ2%^END шыл%>же0
**suffix-final vowel realized as +Back +Round in word-final position о **
**suffix-final vowel realized as +Front +Round in word-final position ӧ **
шӱртняш+V+ConNeg:
remove ʼ mod let apostrophe
%{ьØ%}:ь толам+V+Ind+Prt1+Sg1
suffix-final vowel realized after stem-final consonant
stem-final vowel realized as -Round in word-final position
stem-final vowel realized as +Back +Round in word-final position
stem-final vowel realized as +Front +Round in word-final position
**suffix-final vowel realized %{аы%}:ы **
stem-final vowel realized %{аы%}:а
stem-final vowel realized %{аы%}:а
Stem-final non-stressed vowel loss
Stem-final non-stressed %{еы%} loss
**suffix-final vowel realized %{еы%}:ы **
имне+N+Sg+PxSg3+Nom horse/hevonen
**suffix-final vowel realized Ы2:ы **
пӧрт+N+Sg+Ine+Foc/ys
пӧрт%>Ы1штЫ2%>Ы1с%^END
пӧрт%>ышты%>0с0
stem-final vowel realized %{еы%}:е
**suffix-final vowel realized %{ӧы%}:ы **
stem-final vowel realized %{ӧы%}:ӧ
**suffix-final vowel realized %{оы%}:ы **
stem-final vowel realized %{оы%}:о
**suffix-final vowel realized %{яы%}:ы **
stem-final vowel realized %{яы%}:я
**stem-internal glide realized in 0:й %{яы%}:ы **
Clitics in At and Ak take onset glide = a
Clitics in At and Ak take onset glide = ja
когыльо+N+Sg+Nom+Foc/Ат
Clitics in At and Ak take ZERO
й Deletion in front of я Suffix and others
й Deletion in front of я Suffix and others
й Deletion in front of я Suffix and others
ка0>я
**Onset consonant devoicing ж:ш **
**Onset consonant devoicing з:с **
Stem-final consonant loss т
Stem-final consonant loss к
Stem-final consonant loss н
колхоз
воч0
воз>аш
камвоз>аш
Stem-final consonant variation з2:з
Stem-final consonant variation з2:з
**Disallow Sg+Ine in тЫ2 everywhere except after stem-final ш ** йӧратымаш+N+Sg+Ine
**Disallow Sg+Ill in кЫ2 everywhere except after stem-final ш ** авалтымаш+N+Sg+Ine
**Disallow PxSg3 in ыж no where except after ш **
**Disallow PxSg3 in ыж no where except after ш **
**Disallow %^V2IMPRT й-final Imprt+Sg2 single-syllable -em verbs **
и0мн000>ят
•
(Eng. м н ʼ ь %{еы%} > A2 т)и 0
(Eng. м н 0 0 0 > я т)
This (part of) documentation was generated from src/fst/morphology/phonology.twolc
This file consists of three parts:
The morphological analyses of the wordforms of Eastern Mari language are presented in this system in terms of the following symbols. (It is highly suggested to follow existing standards when adding new tags).
+WORK = nouns
%^Sonorant for use with acronyms after hyphen Л | М | Н | Р | Ҥ |
%^Obstruent for use with acronyms after hyphen С | Ф | Ъ | Ь |
%^FrontObstr for use with acronyms after hyphen С | Ф | Ъ | Ь |
The parts of speech are further split up into:
+Pr = prepositons
+AssocColl = Collective associative numerals with obligatory possessive suffixes -нь-
Have a look at these:
The nominals are inflected in the following numbers
The nominals are inflected in the following Case and Number
The possession is marked as such:
Suffix ordering tags:
The comparative forms are:
Numerals are classified under:
Note the attributive tag, in defferent contexts
Verb moods are:
Verb tenses are:
Verb personal forms are: (also used with personal pronouns)
+Pl3 =
Other verb forms are
Question and Focus particles:
+Foc =
+EX/IV = change to other transitivity
All non-positional derivations should be preceded by this tag, to make it possible to target regular expressions at all derivations in a language-independent way: just specify +Der|+Der1 .. +Der5 and you are set.
Abbreviated words are classified with:
Special symbols are classified with:
The verbs are syntactically split according to transitivity:
Special multiword units are analysed with:
Non-dictionary words can be recognised with:
These are especially for verbs. Note that this is not a semantic distinction, we talk about paradigms deviating here and there in the inflection pattern.
The Usage extents are marked using following tags:
+MWESplit Split point for MWE
Multiple Semantic tags:
Semantics are classified with
Derivations are classified under the morphophonetic form of the suffix, the source and target part-of-speech.
Morphophonology To represent phonologic variations in word forms we use the following symbols in the lexicon files:
%{ыØ%} PxSg3 onset
%{ьØ%} for -ам verbs Prt1 Sg1, Sg2, Sg3, Pl3 л н
я1 =
And following triggers to control variation
%^END for -ам verb final, i.e. Imprf
%-
%^Sonorant for use with acronyms after hyphen Л | М | Н | Р | Ҥ |
%^Obstruent for use with acronyms after hyphen С | Ф | Ъ | Ь |
(escaped with square brackets, to avoid collision with > as morpheme boundary)
< (escaped with square brackets, to avoid collision with < as morpheme boundary)
We have manually optimised the structure of our lexicon using following flag diacritics to restrict morhpological combinatorics - only allow compounds with verbs if the verb is further derived into a noun again:
@P.NeedNoun.ON@ | (Dis)allow compounds with verbs unless nominalised |
@D.NeedNoun.ON@ | (Dis)allow compounds with verbs unless nominalised |
@C.NeedNoun@ | (Dis)allow compounds with verbs unless nominalised |
For languages that allow compounding, the following flag diacritics are needed to control position-based compounding restrictions for nominals. Their use is handled automatically if combined with +CmpN/xxx tags. If not used, they will do no harm.
@P.CmpFrst.FALSE@ | Require that words tagged as such only appear first |
@D.CmpPref.TRUE@ | Block such words from entering ENDLEX |
@P.CmpPref.FALSE@ | Block these words from making further compounds |
@D.CmpLast.TRUE@ | Block such words from entering R |
@D.CmpNone.TRUE@ | Combines with the next tag to prohibit compounding |
@U.CmpNone.FALSE@ | Combines with the prev tag to prohibit compounding |
@P.CmpOnly.TRUE@ | Sets a flag to indicate that the word has passed R |
@D.CmpOnly.FALSE@ | Disallow words coming directly from root. |
Use the following flag diacritics to control downcasing of derived proper nouns (e.g. Finnish Pariisi -> pariisilainen). See e.g. North Sámi for how to use these flags. There exists a ready-made regex that will do the actual down-casing given the proper use of these flags.
@U.Cap.Obl@ | Allowing downcasing of derived names: deatnulasj. |
@U.Cap.Opt@ | Allowing downcasing of derived names: deatnulasj. |
Flag diacritic | Explanation |
---|---|
@U.number.one@ | Flag used to give arabic numerals in smj different cases ; |
@U.number.two@ | Flag used to give arabic numerals in smj different cases ; |
@U.number.three@ | Flag used to give arabic numerals in smj different cases ; |
@U.number.four@ | Flag used to give arabic numerals in smj different cases ; |
@U.number.five@ | Flag used to give arabic numerals in smj different cases ; |
@U.number.six@ | Flag used to give arabic numerals in smj different cases ; |
@U.number.seven@ | Flag used to give arabic numerals in smj different cases ; |
@U.number.eight@ | Flag used to give arabic numerals in smj different cases ; |
@U.number.nine@ | Flag used to give arabic numerals in smj different cases ; |
@U.number.zero@ | Flag used to give arabic numerals in smj different cases ; |
@U.number.zero@ Here it all starts
The word forms in Meadow Mari language start from the lexeme roots of
the following basic word classes:
Continuation lexica
Here comes a set of ragbag continuation lexica.
LEXICON CONJ_ TODO: why +WORK? All CONJ_ should be identified as either CC or CS or both, work in progress
LEXICON CC_ conjunctinos
LEXICON CS_ subjunctions
LEXICON DESCR_ = descriptive something
LEXICON DESCR-AUD_ these are audible, others may be visible or otherwise sensed, but for now just calling them Interj+Descr should suffice
LEXICON AD-A also adverbs
LEXICON INTERJ_ interjections
LEXICON Puh-a/e XXX do not know
LEXICON Puh XXX do not know
LEXICON PCLE_ particles, check these
LEXICON X for N attributes
This (part of) documentation was generated from src/fst/morphology/root.lexc
Eastern Mari acronym file
Here is the list of lexicalised Sem/Org acronym proper nouns These are also generated by the Acrogenerator
This (part of) documentation was generated from src/fst/morphology/stems/acronyms.lexc
NOUNS
KIN TERMS
Single-syllable nouns in У Ӱ Ю
VERBS
This (part of) documentation was generated from src/fst/morphology/stems/exceptions.lexc
MARI-LIKE NAMES
PLACE NAMES
This (part of) documentation was generated from src/fst/morphology/stems/mhr-propernouns.lexc
This is where new words are added as lexc entries before they are added to the xml source files. автор:автор N_ “(eng) /(fin) /(rus) “ ;
ADD NOUNS BELOW
PROPER NAMES
This (part of) documentation was generated from src/fst/morphology/stems/nouns_newwords.lexc
LEXICON Numeral initial lexica
LEXICON ARABIC arabic numerals
LEXICON ARABICLOOP
LEXICON ARABICLOOPORD_Back ordinals
LEXICON ARABICLOOPORD_Front ordinals
LEXICON ARABICLOOPORD_FrontRound ordinals
LEXICON ARABICDELIMITERORD_Back ordinals
LEXICON ARABICDELIMITERORD_Front ordinals
LEXICON ARABICDELIMITERORD_FrontRound ordinals
The Roman numerals ! —————— !
LEXICON ROMAN roman numerals
LEXICON ROM-THOUSAND
LEXICON ROM-THOUSAND-TAG
LEXICON ROM-HUNDRED
LEXICON ROM-HUNDRED-TAG
LEXICON ROM-TEN
LEXICON ROM-TEN-TAG
LEXICON ROM-ONE
LEXICON ROM-ONE-TAG
LEXICON ROM-SPLIT
LEXICON 2ROMAN
LEXICON 2ROM-THOUSAND
LEXICON 2ROM-THOUSAND-TAG
LEXICON 2ROM-HUNDRED
LEXICON 2ROM-HUNDRED-TAG
LEXICON 2ROM-TEN
LEXICON 2ROM-TEN-TAG
LEXICON 2ROM-ONE
LEXICON 2ROM-ONE-TAG
LEXICON ROMNUMTAG
This (part of) documentation was generated from src/fst/morphology/stems/numerals.lexc
retroflex plosive, voiceless t ʈ 0288, 648 (
= ASCII 096)
retroflex plosive, voiced d ɖ 0256, 598
labiodental nasal F ɱ 0271, 625
retroflex nasal n
ɳ 0273, 627
palatal nasal J ɲ 0272, 626
velar nasal N ŋ 014B, 331
uvular nasal N\ ɴ 0274, 628
bilabial trill B\ ʙ 0299, 665
uvular trill R\ ʀ 0280, 640
alveolar tap 4 ɾ 027E, 638
retroflex flap r ɽ 027D, 637
bilabial fricative, voiceless p\ ɸ 0278, 632
bilabial fricative, voiced B β 03B2, 946
dental fricative, voiceless T θ 03B8, 952
dental fricative, voiced D ð 00F0, 240
postalveolar fricative, voiceless S ʃ 0283, 643
postalveolar fricative, voiced Z ʒ 0292, 658
retroflex fricative, voiceless s
ʂ 0282, 642
retroflex fricative, voiced z` ʐ 0290, 656
palatal fricative, voiceless C ç 00E7, 231
palatal fricative, voiced j\ ʝ 029D, 669
velar fricative, voiced G ɣ 0263, 611
uvular fricative, voiceless X χ 03C7, 967
uvular fricative, voiced R ʁ 0281, 641
pharyngeal fricative, voiceless X\ ħ 0127, 295
pharyngeal fricative, voiced ?\ ʕ 0295, 661
glottal fricative, voiced h\ ɦ 0266, 614
alveolar lateral fricative, vl. K alveolar lateral fricative, vd. K\
labiodental approximant P (or v) alveolar approximant r\ retroflex approximant r` velar approximant M\
retroflex lateral approximant l`
palatal lateral approximant L
velar lateral approximant L
Clicks
bilabial O\ (O = capital letter)
dental |
(post)alveolar !\
palatoalveolar =\
alveolar lateral ||
Ejectives, implosives
ejective > e.g. ejective p p> implosive < e.g. implosive b b< Vowels
close back unrounded M close central unrounded 1 close central rounded } lax i I lax y Y lax u U
close-mid front rounded 2 close-mid central unrounded @\ close-mid central rounded 8 close-mid back unrounded 7
schwa ə @
open-mid front unrounded E open-mid front rounded 9 open-mid central unrounded 3 open-mid central rounded 3\ open-mid back unrounded V open-mid back rounded O
ash (ae digraph) { open schwa (turned a) 6
open front rounded & open back unrounded A open back rounded Q Other symbols
voiceless labial-velar fricative W voiced labial-palatal approx. H voiceless epiglottal fricative H\ voiced epiglottal fricative <\ epiglottal plosive >\
alveolo-palatal fricative, vl. s\ alveolo-palatal fricative, voiced z\ alveolar lateral flap l\ simultaneous S and x x\ tie bar _ Suprasegmentals
primary stress “
secondary stress %
long :
half-long :\
extra-short _X
linking mark -
Tones and word accents
level extra high _T level high _H level mid _M level low _L level extra low _B downstep ! upstep ^ (caret, circumflex)
contour, rising contour, falling _F contour, high rising _H_T contour, low rising _B_L
contour, rising-falling _R_F
(NB Instead of being written as diacritics with _, all prosodic
marks can alternatively be placed in a separate tier, set off
by < >, as recommended for the next two symbols.)
global rise
voiceless 0 (0 = figure), e.g. n_0 voiced _v aspirated _h more rounded _O (O = letter) less rounded _c advanced _+ retracted _- centralized _” syllabic = (or _=) e.g. n= (or n=) non-syllabic _^ rhoticity `
breathy voiced _t creaky voiced _k linguolabial _N labialized _w palatalized ‘ (or _j) e.g. t’ (or t_j) velarized _G pharyngealized _?\
dental d apical _a laminal _m nasalized ~ (or _~) e.g. A~ (or A~) nasal release _n lateral release _l no audible release _}
velarized or pharyngealized _e velarized l, alternatively 5 raised _r lowered _o advanced tongue root _A retracted tongue root _q
This (part of) documentation was generated from src/fst/phonetics/txt2ipa.xfscript
We describe here how abbreviations are in Eastern Mari are read out, e.g. for text-to-speech systems.
For example:
This (part of) documentation was generated from src/fst/transcriptions/transcriptor-abbrevs2text.lexc
M E A D O W M A R I G R A M M A R C H E C K E R
The delimiters are: “<.>” “<!>” “<?>” “<…>” “<¶>” sent
The Tags section lists all the tags inherited from the fst, and defines them for use in the syntactic analysis. The tags are documented in the root.lexc file, and here only listed for reference.
The next section, Sets, contains sets defined on the basis of the tags listed here, those set names are not visible in the output.
BOS EOS
N V A Adv CC CS Interj Pron Num Pcle Clt Po
ABBR ACR
CLB LEFT RIGHT WEB LEFT RIGHT because of apertium
WORD is the set of all POS
Prs Prt1 Prt2 Fut Imprt Ind Cond Des
Act ConNeg FutPrc Ger Inf Nec Neg NegPrc Pass Prc PrfPrc
Verbal person-number tags Sg1 Sg2 Sg3 Pl1 Pl2 Pl3
Sg Pl
Nom Gen Abl Dat Com Cns Acc Ins Ine Ill Cmpr (case)
Pers Refl Rel Interr Recipr Dem ABBR
Pos (?) Superl Comp
Attr
PxSg1 PxSg2 PxSg3 PxPl1 PxPl2 PxPl3
Card Coll Ord Temp (?)
Der/MWN Der/sa
Qst Foc
CmpTest Err
Grammarchecker rules begin here
This (part of) documentation was generated from tools/grammarcheckers/grammarchecker.cg3
Usage:
$ make
$ echo "ja, ja" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa boasttu olmmoš, man mielde lahtuid." | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "márffibiillagáffe" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
Pmatch documentation: https://github.com/hfst/hfst/wiki/HfstPmatch
Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words:
U+00AD
U+FEFF
.Whitespace contains ASCII white space and the List contains some unicode white space characters
Apart from what’s in our morphology, there are
hfst-tokenise -a
Unknowns are made of:
Unknowns are tagged ?? and treated specially with hfst-tokenise
hfst-tokenise –giella-cg will treat such empty analyses as unknowns, and
remove empty analyses from other readings. Empty readings are also
legal in CG, they get a default baseform equal to the wordform, but
no tag to check, so it’s safer to let hfst-tokenise handle them.
Finally we mark as a token any sequence making up a:
This (part of) documentation was generated from tools/tokenisers/tokeniser-disamb-gt-desc.pmscript
Requires a recent version of HFST (3.10.0 / git revision>=3aecdbc) Then just: $ make $ echo “ja, ja” | hfst-tokenise –giella-cg tokeniser-disamb-gt-desc.pmhfst
Issues:
More usage examples: $ echo “Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa boasttu olmmoš, man mielde lahtuid.” | hfst-tokenise –giella-cg tokeniser-disamb-gt-desc.pmhfst $ echo “(gáfe) ‘ja’ ja 3. ja? ц jaja ukjend "ukjend"” | hfst-tokenise –giella-cg tokeniser-disamb-gt-desc.pmhfst $ echo “márffibiillagáffe” | hfst-tokenise –giella-cg tokeniser-disamb-gt-desc.pmhfst
Pmatch documentation: https://kitwiki.csc.fi/twiki/bin/view/KitWiki/HfstPmatch
Apart from what’s in our morphology, there are 1) unknown word-like forms, and 2) unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a
select symbols
TODO: Could use something like this, but built-in’s don’t include šžđčŋ:
Unknowns are tagged ?? and treated specially with hfst-tokenise
hfst-tokenise –giella-cg will treat such empty analyses as unknowns, and
remove empty analyses from other readings. Empty readings are also
legal in CG, they get a default baseform equal to the wordform, but
no tag to check, so it’s safer to let hfst-tokenise handle them.
Needs hfst-tokenise to output things differently depending on the tag they get
This (part of) documentation was generated from tools/tokenisers/tokeniser-disamb-gt-desc.thirties.pmscript
Requires a recent version of HFST (3.10.0 / git revision>=3aecdbc) Then just:
$ make
$ echo "ja, ja" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
More usage examples:
$ echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa boasttu olmmoš, man mielde lahtuid." | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "márffibiillagáffe" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
Pmatch documentation: https://github.com/hfst/hfst/wiki/HfstPmatch
Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words:
U+00AD
U+FEFF
.Whitespace contains ASCII white space and the List contains some unicode white space characters
Apart from what’s in our morphology, there are 1) unknown word-like forms, and 2) unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a
select symbols
TODO: Could use something like this, but built-in’s don’t include šžđčŋ:
Simply give an empty reading when something is unknown: hfst-tokenise –giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it’s safer to let hfst-tokenise handle them.
Finally we mark as a token any sequence making up a:
This (part of) documentation was generated from tools/tokenisers/tokeniser-gramcheck-gt-desc.pmscript
Requires a recent version of HFST (3.10.0 / git revision>=3aecdbc) Then just:
make
echo "ja, ja" \
| hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
More usage examples:
echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa \
boasttu olmmoš, man mielde lahtuid." \
| hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" \
| hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
echo "márffibiillagáffe" \
| hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
Pmatch documentation: https://kitwiki.csc.fi/twiki/bin/view/KitWiki/HfstPmatch
Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words:
U+00AD
U+FEFF
.Whitespace contains ASCII white space and the List contains some unicode white space characters
Apart from what’s in our morphology, there are 1) unknown word-like forms, and 2) unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a
TODO: Could use something like this, but built-in’s don’t include šžđčŋ:
Simply give an empty reading when something is unknown: hfst-tokenise –giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it’s safer to let hfst-tokenise handle them.
Needs hfst-tokenise to output things differently depending on the tag they get
This (part of) documentation was generated from tools/tokenisers/tokeniser-tts-cggt-desc.pmscript