Eastern Mari NLP Grammar

Finite state and Constraint Grammar based analysers, proofing tools and other resources

View the project on GitHub giellalt/lang-mhr

Page Content

Eastern Mari language model documentation

All doc-comment documentation in one large file.


src-cg3-dependency.cg3.md

C O M M O N S Á M I D E P E N D E N C Y G R A M M A R

This dep file is for sma, sme, smj, sje.

DELIMITERS

Sentence delimiters are the following: <.> <!> <?> <…> <¶>

TAGS AND SETS

N V A Adv CC CS Inf Sup Neg Num Po Pr

Pcle Prop

Pron IV TV COMMA DASH CITATION to keep colouring we add a “ HYPHEN QMARK PUNCT LEFT RIGHT CLB Ind Pot Impr ImprtII Cond ConNeg Caus causative eus VGen Interj ABBR ACR Prs Prt Cmpnd RCmpnd PrfPrc PrsPrc Actor Actio Ger Indef Nom Acc Ill Com Gen Ess

IM For fao

POS sub-categories

Syntactic tags and sets

Syntactic tags in input to this file

Syntactic tags added in this file

fao syntags

kal syntags

eus syntags

Syntactic set definitions

Dep grammar

Correction rules

The finite verb

Mapping rules

lgRemove removes the language tags , , etc, before proceeding to the dep file.


This (part of) documentation was generated from src/cg3/dependency.cg3


src-cg3-disambiguator.cg3.md

This is the Eastern Mari disambiguation file. It chooses the correct morphological analyses in any given sentence context.

The file first defines sentence delimiters and tags and sets. Thereafter come the rules, each rule is listed below.

Sentence delimiters

The delimiters are: “<.>” “<!>” “<?>” “<…>” “<¶>” sent

The Tags section lists all the tags inherited from the fst, and defines them for use in the syntactic analysis. The tags are documented in the root.lexc file, and here only listed for reference.

The next section, Sets, contains sets defined on the basis of the tags listed here, those set names are not visible in the output.

Tags

Beginning and end of sentence

BOS EOS

Clause boundary

Parts of speech tags

N V A Adv CC CS Interj Pron Num Pcle Clt Po

WORD is the set of all POS

Verbal tense and mood tags

Prs Prt1 Prt2 Fut Imprt Ind Cond Des

Other verbal tags

Act ConNeg FutPrc Ger Inf Nec Neg NegPrc Pass Prc PrfPrc

Verbal person-number tags Sg1 Sg2 Sg3 Pl1 Pl2 Pl3

Numeral tags

Sg Pl

Case tags

Nom Gen Abl Dat Com Cns Acc Ins Ine Ill Cmpr (case)

Other nominal tags

Pers Refl Rel Interr Recipr Dem ABBR ACR

Adjective comparison tags

Pos (?) Superl Comp

Possessive suffix tags

PxSg1 PxSg2 PxSg3 PxPl1 PxPl2 PxPl3

Numeral tags

Card Coll Ord Temp (?)

Particles

Qst Foc

Punctuation marks

CLB PUCT LEFT RIGHT COMMA

Derivation tags

Der/MWN Der/sa Der/Pur Der/Caus Der/Nom

Tags for internal testing

CmpTest Err

Sets

Der/Date Der/Year Der/Hum Der/Lang Der/Domain Der/Feat-phys Der/Clth Der/Body Der/Act

Sem/Ani Sem/Fem Sem/Group Sem/Mal Sem/Measr Sem/Money Sem/Obj Sem/Obj-el Sem/Org Sem/Perc-emo Sem/Plc Sem/Sign Sem/State-sick Sem/Sur Sem/Time Sem/Txt

Rule section

Early, word-internal rules

CC or Pcle

Particles

*InterrQ if question mark anywhere to the right

*Interr removes Rel if question mark to the right somewhere

Verbs

Existential ulo

Infinitives

Adjectives

*RemAdjBeforeProp removes A if Prop to the left

*AdjBeforeMo selects A if Interr to the right

*AdjBeforeConjAdj selects A if conjuction and A to the right ;

*AdjNotN removes N if Pron Pers anywhere to the left

*RemAdj2 removes A if no N or Pron in a clause

Nouns

*RemNomIfPronLeft removes Nom if Pron Nom anywhere to the left

*RemNomIfPronRight removes Nom if Pron Nom anywhere to the right

*NomBeforeConjNom selects N Nom if conjoined with N Nom

*NafterDem selects N if Dem to the left (demonstratives tend to be sole modifiers)

*NotANoun

*NafterAbeforeEOS

*RemNafterAdv removes N if adverb to the left

Derivations

Cases

Proper nouns

Numerals

Pronouns

Conjunctions

Postpositions

Adverbs

Phrases

Verbs

Finite verb or Gerundium

*RemGer removes Ger Gen if there is no verb to the right

First or third person

ConNeg or not

да

и

Interjection

Predicative

AifVövny selects A if вӧвны somewhere to the left

Conjunctions


This (part of) documentation was generated from src/cg3/disambiguator.cg3


src-cg3-functions.cg3.md

S Y N T A C T I C F U N C T I O N S F O R S Á M I

Sámi language technology project 2003-2024, University of Tromsø #

This file adds syntactic functions. It is common for all the Saami

LEFT RIGHT because of apertium

!!Syntactic tags

!!Tag sets

** V is all readings with a V tag in them, REAL-V should be the ones without an N tag following the V. The REAL-V set thus awaits a fix to the preprocess V … N bug.

These sets model noun phrases (NPs). The idea is to first define whatever can occur in front of the head of the NP, and thereafter negate that with the expression WORD - premodifiers.

The set NOT-NPMOD is used to find barriers between NPs. Typical usage: … (*1 N BARRIER NPT-NPMOD) … meaning: Scan to the first noun, ignoring anything that can be part of the noun phrase of that noun (i.e., “scan to the next NP head”)

ADLVCASE

These were the set types.

!!Numeral outside the sentence

!!HABITIVE MAPPING

!sma object

!!SUBJ MAPPING - leftovers

!!OBJ MAPPING - leftovers

!! MAPPING for MT - experimental

!!HNOUN MAPPING

! missingX adds @X to all missings

! therestX adds @X to all what is left, often errouneus disambiguated forms

!!For Apertium: The analysis give double analysis because of optional semtags. We go for the one with semtag.


This (part of) documentation was generated from src/cg3/functions.cg3


src-cg3-korp.cg3.md

S Y N T A C T I C F U N C T I O N S F O R S Á M I

Sámi language technology project 2003-2014, University of Tromsø #

For Korp:

Here we remove special tags for MT

smeRemove removes the language tags , , etc, before proceeding to the dep file.

Here we remove semantic tags for all other words than proper nouns.


This (part of) documentation was generated from src/cg3/korp.cg3


src-fst-morphology-affixes-adjectives.lexc.md

Adjective inflection

Meadow Mari adjectives

LEXICON A underscore


This (part of) documentation was generated from src/fst/morphology/affixes/adjectives.lexc


src-fst-morphology-affixes-clitics.lexc.md

Eastern Mari Clitics


This (part of) documentation was generated from src/fst/morphology/affixes/clitics.lexc


src-fst-morphology-affixes-nouns.lexc.md

Noun inflection

Meadow Mari noun inflection

a final lexica

Some Postpositions in Mari take possessive suffixes. For now, am allowing all an all, but we should revisit this in the lexicon eventually - classifying postpositions into those that take Px and those that do not.

Also here: some adverbs that take possessive suffixes, like ӱстембалне on the table > ӱстембалнем on my table

DECLENSION

Case suffixes

Each case-number-person has its own lexicon.

Sg Sg1

Here starts the Px stuff

Pl Sg1

Sg Sg2

Pl Sg2

Sg Sg3

Pl Sg3

Sg Pl1

Pl Pl1

Sg Pl2

Sg Pl3


This (part of) documentation was generated from src/fst/morphology/affixes/nouns.lexc


src-fst-morphology-affixes-numbers.lexc.md

East Mari Numeral inflection


This (part of) documentation was generated from src/fst/morphology/affixes/numbers.lexc


src-fst-morphology-affixes-pronouns.lexc.md

Eastern Mari pronoun inflection

Lexica directed from root.lexc

Pronoun lexica from xml


This (part of) documentation was generated from src/fst/morphology/affixes/pronouns.lexc


src-fst-morphology-affixes-propernouns.lexc.md

Proper noun inflection

Meadow Mari proper nouns inflect in the same cases as regular nouns, but with a colon (‘:’) as separator. (???)

Male given name for deriving patronyms

Check whether +Orth/Colloq is orthographically wrong

Вили:Вил

Russian type Surnames

Female Given names

PLACE NAMES FROM TEMPLATE


This (part of) documentation was generated from src/fst/morphology/affixes/propernouns.lexc


src-fst-morphology-affixes-symbols.lexc.md

Symbol affixes


This (part of) documentation was generated from src/fst/morphology/affixes/symbols.lexc


src-fst-morphology-affixes-verbs.lexc.md

Meadow Mari verb inflection.

Verbal continuation lexica

Auxiliary verbs

Some of these are directed directly from root.lexc

LEXICON verbs_not_from_xml

LEXICON negverb TODO: fix

Regular verbs

We divide the verbs in two, -am and -em

The -am class

LEXICON V_am-N divides V_am in Mood and infinites

LEXICON V_am divides V_am in Mood and infinites

LEXICON Vam-Mood divides in Ind, Imprt, Des

LEXICON Vam-Ind gives all the Ind tenses

LEXICON Vam-Imp for imperative, Повелительное наклонение:

LEXICON Vam-Des for desiderative, Желательное наклонение:

The -em class

First four lexica: V_em with Gerund, the rest without, all going to V_em_ALL to get derivation affixes.

LEXICON V_em divides V_em in Mood and infinites

LEXICON V_em-1SYLL-j allow for literary norm until 1970 (Alhoniemi 1985: 105-106) кайше, кайшаш +Err/Orth: non-finites ; until 1972 reform

LEXICON V_em-1SYLL single syll V_em verbs, do not include bare-stem gerunds in their paradigms

Optional derivation: All verbs going to V_em_INFL

LEXICON Vem-Mood divides in Ind, Imprt, Des

LEXICON Vem-Ind gives all the Ind tenses

LEXICON non-finites contains Mutual endings

Special verbs

V_am, возаш : воч

These need work 2012-09-21


This (part of) documentation was generated from src/fst/morphology/affixes/verbs.lexc


src-fst-morphology-compounding.lexc.md

Divvun & Giellatekno - open source grammars for Sámi and other languages

A special lexicon for handling proper noun compounding without hyphens as that would allow compounding with words explicitly coded to disallow such compounds)


This (part of) documentation was generated from src/fst/morphology/compounding.lexc


src-fst-morphology-phonology.twolc.md

Eastern Mari twol file

This file documents the phonology.twolc file

This file contains rules for morphophonological alternations, such as vowel harmony, stem vowel changes, palatalisation, etc.

We define our symbols (Alphabet), some Sets, and then the Rules

Letters of the alphabet

other symbols

Archiphonemes for vowels, Giellatekno style

Archiphonemes for vowels, Apertium style

Arcihphonemes for consonants

Sets

Rules

Punctuation bullet as such This rule prevents deleting of BULLET when it forms a token. BULLET as stress mark is deleted as before.

Palatal mark loss before vowel имне+N+Sg+Nom+Foc/Ат

Onset vowel loss in suffix after stem vowel

Onset vowel Е2 realized in suffix е

Onset vowel Е2 realized in suffix э

Onset vowel Е2 realized in suffix ZERO

Onset vowel Ы1 realized in suffix

suffix-final vowel loss after stem-final vowel
пуаш+V+Imprt+Sg2

кияш+V+Imprt+Sg2

suffix-final vowel loss after stem-final vowel

**suffix-final vowel realized as -Round in word-final position е **

шылаш+V+Imprt+Sg3 шыл%>жЫ2%^END шыл%>же0

**suffix-final vowel realized as +Back +Round in word-final position о **

**suffix-final vowel realized as +Front +Round in word-final position ӧ **
шӱртняш+V+ConNeg:

remove ʼ mod let apostrophe

%{ьØ%}:ь толам+V+Ind+Prt1+Sg1

suffix-final vowel realized after stem-final consonant

stem-final vowel realized as -Round in word-final position

stem-final vowel realized as +Back +Round in word-final position

stem-final vowel realized as +Front +Round in word-final position

**suffix-final vowel realized %{аы%}:ы **

stem-final vowel realized %{аы%}:а
stem-final vowel realized %{аы%}:а

Stem-final non-stressed vowel loss

Stem-final non-stressed %{еы%} loss

**suffix-final vowel realized %{еы%}:ы **
имне+N+Sg+PxSg3+Nom horse/hevonen

**suffix-final vowel realized Ы2:ы **
пӧрт+N+Sg+Ine+Foc/ys пӧрт%>Ы1штЫ2%>Ы1с%^END пӧрт%>ышты%>0с0

stem-final vowel realized %{еы%}:е

**suffix-final vowel realized %{ӧы%}:ы **

stem-final vowel realized %{ӧы%}:ӧ

**suffix-final vowel realized %{оы%}:ы **

stem-final vowel realized %{оы%}:о

**suffix-final vowel realized %{яы%}:ы **

stem-final vowel realized %{яы%}:я

**stem-internal glide realized in 0:й %{яы%}:ы **

Clitics in At and Ak take onset glide = a

Clitics in At and Ak take onset glide = ja
когыльо+N+Sg+Nom+Foc/Ат

Clitics in At and Ak take ZERO

й Deletion in front of я Suffix and others

й Deletion in front of я Suffix and others

й Deletion in front of я Suffix and others

**Onset consonant devoicing ж:ш **

**Onset consonant devoicing з:с **

Stem-final consonant loss т

Stem-final consonant loss к

Stem-final consonant loss н

Stem-final consonant variation з2:з

Stem-final consonant variation з2:з

**Disallow Sg+Ine in тЫ2 everywhere except after stem-final ш ** йӧратымаш+N+Sg+Ine

**Disallow Sg+Ill in кЫ2 everywhere except after stem-final ш ** авалтымаш+N+Sg+Ine

**Disallow PxSg3 in ыж no where except after ш **

**Disallow PxSg3 in ыж no where except after ш **

**Disallow %^V2IMPRT й-final Imprt+Sg2 single-syllable -em verbs **


This (part of) documentation was generated from src/fst/morphology/phonology.twolc


src-fst-morphology-root.lexc.md

Morphology

This file consists of three parts:

  1. Multichar Symbols declaration
  2. The Root lexicon
  3. A set of lexica for minor parts of speech
  4. A set of unfinished lexica, to be either deleted or expanded.

Declaration of Multichar_Symbols

Analysis symbols

The morphological analyses of the wordforms of Eastern Mari language are presented in this system in terms of the following symbols. (It is highly suggested to follow existing standards when adding new tags).

The parts-of-speech are:

POS subtags

The parts of speech are further split up into:

Have a look at these:

The nominals are inflected in the following numbers

The nominals are inflected in the following Case and Number

The possession is marked as such:

Suffix ordering tags:

The comparative forms are:

Numerals are classified under:

Note the attributive tag, in defferent contexts

Verb moods are:

Verb tenses are:

Verb personal forms are: (also used with personal pronouns)

Other verb forms are

Question and Focus particles:

Tags distinguishing different versions of the same lemma (before POS)

Derivations

All non-positional derivations should be preceded by this tag, to make it possible to target regular expressions at all derivations in a language-independent way: just specify +Der|+Der1 .. +Der5 and you are set.

Abbreviated words are classified with:

Special symbols are classified with:

The verbs are syntactically split according to transitivity:

Special multiword units are analysed with:

Non-dictionary words can be recognised with:

Homony tags

These are especially for verbs. Note that this is not a semantic distinction, we talk about paradigms deviating here and there in the inflection pattern.

Usage tags

The Usage extents are marked using following tags:

Semantic tags

Multiple Semantic tags:

Semantics are classified with

Derivations are classified under the morphophonetic form of the suffix, the source and target part-of-speech.

Morphophonology To represent phonologic variations in word forms we use the following symbols in the lexicon files:

And following triggers to control variation

Symbols that need to be escaped on the lower side (towards twolc):

Flag diacritics

We have manually optimised the structure of our lexicon using following flag diacritics to restrict morhpological combinatorics - only allow compounds with verbs if the verb is further derived into a noun again:

@P.NeedNoun.ON@ (Dis)allow compounds with verbs unless nominalised
@D.NeedNoun.ON@ (Dis)allow compounds with verbs unless nominalised
@C.NeedNoun@ (Dis)allow compounds with verbs unless nominalised

For languages that allow compounding, the following flag diacritics are needed to control position-based compounding restrictions for nominals. Their use is handled automatically if combined with +CmpN/xxx tags. If not used, they will do no harm.

@P.CmpFrst.FALSE@ Require that words tagged as such only appear first
@D.CmpPref.TRUE@ Block such words from entering ENDLEX
@P.CmpPref.FALSE@ Block these words from making further compounds
@D.CmpLast.TRUE@ Block such words from entering R
@D.CmpNone.TRUE@ Combines with the next tag to prohibit compounding
@U.CmpNone.FALSE@ Combines with the prev tag to prohibit compounding
@P.CmpOnly.TRUE@ Sets a flag to indicate that the word has passed R
@D.CmpOnly.FALSE@ Disallow words coming directly from root.

Use the following flag diacritics to control downcasing of derived proper nouns (e.g. Finnish Pariisi -> pariisilainen). See e.g. North Sámi for how to use these flags. There exists a ready-made regex that will do the actual down-casing given the proper use of these flags.

@U.Cap.Obl@ Allowing downcasing of derived names: deatnulasj.
@U.Cap.Opt@ Allowing downcasing of derived names: deatnulasj.
Flag diacritic Explanation
@U.number.one@ Flag used to give arabic numerals in smj different cases ;
@U.number.two@ Flag used to give arabic numerals in smj different cases ;
@U.number.three@ Flag used to give arabic numerals in smj different cases ;
@U.number.four@ Flag used to give arabic numerals in smj different cases ;
@U.number.five@ Flag used to give arabic numerals in smj different cases ;
@U.number.six@ Flag used to give arabic numerals in smj different cases ;
@U.number.seven@ Flag used to give arabic numerals in smj different cases ;
@U.number.eight@ Flag used to give arabic numerals in smj different cases ;
@U.number.nine@ Flag used to give arabic numerals in smj different cases ;
@U.number.zero@ Flag used to give arabic numerals in smj different cases ;

The Root lexicon

@U.number.zero@ Here it all starts

The word forms in Meadow Mari language start from the lexeme roots of

the following basic word classes:

Continuation lexica

Here comes a set of ragbag continuation lexica.


This (part of) documentation was generated from src/fst/morphology/root.lexc


src-fst-morphology-stems-acronyms.lexc.md

Eastern Mari acronym file

Here is the list of lexicalised Sem/Org acronym proper nouns These are also generated by the Acrogenerator


This (part of) documentation was generated from src/fst/morphology/stems/acronyms.lexc


src-fst-morphology-stems-exceptions.lexc.md

NOUNS

KIN TERMS

Single-syllable nouns in У Ӱ Ю

VERBS


This (part of) documentation was generated from src/fst/morphology/stems/exceptions.lexc


src-fst-morphology-stems-mhr-propernouns.lexc.md

The Meadow and Eastern Mari proper noun lexicon

MARI-LIKE NAMES

PLACE NAMES


This (part of) documentation was generated from src/fst/morphology/stems/mhr-propernouns.lexc


src-fst-morphology-stems-nouns_newwords.lexc.md

This is where new words are added as lexc entries before they are added to the xml source files. автор:автор N_ “(eng) /(fin) /(rus) “ ;

ADD NOUNS BELOW

PROPER NAMES


This (part of) documentation was generated from src/fst/morphology/stems/nouns_newwords.lexc


src-fst-morphology-stems-numerals.lexc.md

Meadow & Eastern Mari numerals

The initial lexica

The Roman numerals ! —————— !


This (part of) documentation was generated from src/fst/morphology/stems/numerals.lexc


src-fst-phonetics-txt2ipa.xfscript.md

retroflex plosive, voiceless t ʈ 0288, 648 ( = ASCII 096) retroflex plosive, voiced d ɖ 0256, 598 labiodental nasal F ɱ 0271, 625 retroflex nasal n ɳ 0273, 627 palatal nasal J ɲ 0272, 626 velar nasal N ŋ 014B, 331 uvular nasal N\ ɴ 0274, 628

bilabial trill B\ ʙ 0299, 665 uvular trill R\ ʀ 0280, 640 alveolar tap 4 ɾ 027E, 638 retroflex flap r ɽ 027D, 637 bilabial fricative, voiceless p\ ɸ 0278, 632 bilabial fricative, voiced B β 03B2, 946 dental fricative, voiceless T θ 03B8, 952 dental fricative, voiced D ð 00F0, 240 postalveolar fricative, voiceless S ʃ 0283, 643 postalveolar fricative, voiced Z ʒ 0292, 658 retroflex fricative, voiceless s ʂ 0282, 642 retroflex fricative, voiced z` ʐ 0290, 656 palatal fricative, voiceless C ç 00E7, 231 palatal fricative, voiced j\ ʝ 029D, 669 velar fricative, voiced G ɣ 0263, 611 uvular fricative, voiceless X χ 03C7, 967 uvular fricative, voiced R ʁ 0281, 641 pharyngeal fricative, voiceless X\ ħ 0127, 295 pharyngeal fricative, voiced ?\ ʕ 0295, 661 glottal fricative, voiced h\ ɦ 0266, 614

alveolar lateral fricative, vl. K alveolar lateral fricative, vd. K\

labiodental approximant P (or v) alveolar approximant r\ retroflex approximant r` velar approximant M\

retroflex lateral approximant l` palatal lateral approximant L velar lateral approximant L
Clicks

bilabial O\ (O = capital letter) dental |
(post)alveolar !\ palatoalveolar =\ alveolar lateral ||
Ejectives, implosives

ejective > e.g. ejective p p> implosive < e.g. implosive b b< Vowels

close back unrounded M close central unrounded 1 close central rounded } lax i I lax y Y lax u U

close-mid front rounded 2 close-mid central unrounded @\ close-mid central rounded 8 close-mid back unrounded 7

schwa ə @

open-mid front unrounded E open-mid front rounded 9 open-mid central unrounded 3 open-mid central rounded 3\ open-mid back unrounded V open-mid back rounded O

ash (ae digraph) { open schwa (turned a) 6

open front rounded & open back unrounded A open back rounded Q Other symbols

voiceless labial-velar fricative W voiced labial-palatal approx. H voiceless epiglottal fricative H\ voiced epiglottal fricative <\ epiglottal plosive >\

alveolo-palatal fricative, vl. s\ alveolo-palatal fricative, voiced z\ alveolar lateral flap l\ simultaneous S and x x\ tie bar _ Suprasegmentals

primary stress “ secondary stress % long : half-long :\ extra-short _X linking mark -
Tones and word accents

level extra high _T level high _H level mid _M level low _L level extra low _B downstep ! upstep ^ (caret, circumflex)

contour, rising contour, falling _F contour, high rising _H_T contour, low rising _B_L

contour, rising-falling _R_F (NB Instead of being written as diacritics with _, all prosodic marks can alternatively be placed in a separate tier, set off by < >, as recommended for the next two symbols.) global rise global fall Diacritics

voiceless 0 (0 = figure), e.g. n_0 voiced _v aspirated _h more rounded _O (O = letter) less rounded _c advanced _+ retracted _- centralized _” syllabic = (or _=) e.g. n= (or n=) non-syllabic _^ rhoticity `

breathy voiced _t creaky voiced _k linguolabial _N labialized _w palatalized ‘ (or _j) e.g. t’ (or t_j) velarized _G pharyngealized _?\

dental d apical _a laminal _m nasalized ~ (or _~) e.g. A~ (or A~) nasal release _n lateral release _l no audible release _}

velarized or pharyngealized _e velarized l, alternatively 5 raised _r lowered _o advanced tongue root _A retracted tongue root _q


This (part of) documentation was generated from src/fst/phonetics/txt2ipa.xfscript


src-fst-transcriptions-transcriptor-abbrevs2text.lexc.md

We describe here how abbreviations are in Eastern Mari are read out, e.g. for text-to-speech systems.

For example:


This (part of) documentation was generated from src/fst/transcriptions/transcriptor-abbrevs2text.lexc


tools-grammarcheckers-grammarchecker.cg3.md

M E A D O W M A R I G R A M M A R C H E C K E R

DELIMITERS

The delimiters are: “<.>” “<!>” “<?>” “<…>” “<¶>” sent

The Tags section lists all the tags inherited from the fst, and defines them for use in the syntactic analysis. The tags are documented in the root.lexc file, and here only listed for reference.

The next section, Sets, contains sets defined on the basis of the tags listed here, those set names are not visible in the output.

Tags

Beginning and end of sentence

BOS EOS

Clause boundary

Parts of speech tags

N V A Adv CC CS Interj Pron Num Pcle Clt Po

ABBR ACR

Punctuation marks

CLB LEFT RIGHT WEB LEFT RIGHT because of apertium

WORD is the set of all POS

Verbal tense and mood tags

Prs Prt1 Prt2 Fut Imprt Ind Cond Des

Other verbal tags

Act ConNeg FutPrc Ger Inf Nec Neg NegPrc Pass Prc PrfPrc

Verbal person-number tags Sg1 Sg2 Sg3 Pl1 Pl2 Pl3

Numeral tags

Sg Pl

Case tags

Nom Gen Abl Dat Com Cns Acc Ins Ine Ill Cmpr (case)

Other nominal tags

Pers Refl Rel Interr Recipr Dem ABBR

Adjective comparison tags

Pos (?) Superl Comp

Attr

Possessive suffix tags

PxSg1 PxSg2 PxSg3 PxPl1 PxPl2 PxPl3

Numeral tags

Card Coll Ord Temp (?)

Derivation tags

Der/MWN Der/sa

Particles

Qst Foc

Tags for internal testing

CmpTest Err

Sets

Grammarchecker rules begin here

Grammarchecker sets

Grammarchecker rules

Speller rules

Agreement rules

Negation verb rules

Postposition rules

NP internal rules

Punctuation rules


This (part of) documentation was generated from tools/grammarcheckers/grammarchecker.cg3


tools-tokenisers-tokeniser-disamb-gt-desc.pmscript.md

Tokeniser for mhr

Usage:

$ make
$ echo "ja, ja" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa boasttu olmmoš, man mielde lahtuid." | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "márffibiillagáffe" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst

Pmatch documentation: https://github.com/hfst/hfst/wiki/HfstPmatch

Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words:

Whitespace contains ASCII white space and the List contains some unicode white space characters

Apart from what’s in our morphology, there are

  1. unknown word-like forms, and
  2. unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a Unknowns are made of:
    • lower-case ASCII
    • upper-case ASCII
    • some cyrillic
    • select extended latin symbols
    • mhr specific alphabest ASCII digits
    • select symbols

Unknown handling

Unknowns are tagged ?? and treated specially with hfst-tokenise hfst-tokenise –giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it’s safer to let hfst-tokenise handle them.

Finally we mark as a token any sequence making up a:


This (part of) documentation was generated from tools/tokenisers/tokeniser-disamb-gt-desc.pmscript


tools-tokenisers-tokeniser-disamb-gt-desc.thirties.pmscript.md

Requires a recent version of HFST (3.10.0 / git revision>=3aecdbc) Then just: $ make $ echo “ja, ja” | hfst-tokenise –giella-cg tokeniser-disamb-gt-desc.pmhfst

Issues:

More usage examples: $ echo “Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa boasttu olmmoš, man mielde lahtuid.” | hfst-tokenise –giella-cg tokeniser-disamb-gt-desc.pmhfst $ echo “(gáfe) ‘ja’ ja 3. ja? ц jaja ukjend "ukjend"” | hfst-tokenise –giella-cg tokeniser-disamb-gt-desc.pmhfst $ echo “márffibiillagáffe” | hfst-tokenise –giella-cg tokeniser-disamb-gt-desc.pmhfst

Pmatch documentation: https://kitwiki.csc.fi/twiki/bin/view/KitWiki/HfstPmatch

Apart from what’s in our morphology, there are 1) unknown word-like forms, and 2) unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a

TODO: Could use something like this, but built-in’s don’t include šžđčŋ:

Unknown handling

Unknowns are tagged ?? and treated specially with hfst-tokenise hfst-tokenise –giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it’s safer to let hfst-tokenise handle them.

Needs hfst-tokenise to output things differently depending on the tag they get


This (part of) documentation was generated from tools/tokenisers/tokeniser-disamb-gt-desc.thirties.pmscript


tools-tokenisers-tokeniser-gramcheck-gt-desc.pmscript.md

Grammar checker tokenisation for mhr

Requires a recent version of HFST (3.10.0 / git revision>=3aecdbc) Then just:

$ make
$ echo "ja, ja" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst

More usage examples:

$ echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa boasttu olmmoš, man mielde lahtuid." | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "márffibiillagáffe" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst

Pmatch documentation: https://github.com/hfst/hfst/wiki/HfstPmatch

Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words:

Whitespace contains ASCII white space and the List contains some unicode white space characters

Apart from what’s in our morphology, there are 1) unknown word-like forms, and 2) unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a

TODO: Could use something like this, but built-in’s don’t include šžđčŋ:

Simply give an empty reading when something is unknown: hfst-tokenise –giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it’s safer to let hfst-tokenise handle them.

Finally we mark as a token any sequence making up a:


This (part of) documentation was generated from tools/tokenisers/tokeniser-gramcheck-gt-desc.pmscript


tools-tokenisers-tokeniser-tts-cggt-desc.pmscript.md

TTS tokenisation for smj

Requires a recent version of HFST (3.10.0 / git revision>=3aecdbc) Then just:

make
echo "ja, ja" \
| hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst

More usage examples:

echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa \
boasttu olmmoš, man mielde lahtuid." \
| hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" \
| hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
echo "márffibiillagáffe" \
| hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst

Pmatch documentation: https://kitwiki.csc.fi/twiki/bin/view/KitWiki/HfstPmatch

Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words:

Whitespace contains ASCII white space and the List contains some unicode white space characters

Apart from what’s in our morphology, there are 1) unknown word-like forms, and 2) unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a

TODO: Could use something like this, but built-in’s don’t include šžđčŋ:

Simply give an empty reading when something is unknown: hfst-tokenise –giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it’s safer to let hfst-tokenise handle them.

Needs hfst-tokenise to output things differently depending on the tag they get


This (part of) documentation was generated from tools/tokenisers/tokeniser-tts-cggt-desc.pmscript