Plains Cree NLP Grammar

Finite state and Constraint Grammar based analysers, proofing tools and other resources

View the project on GitHub giellalt/lang-crk

Plains Cree language model documentation

All doc-comment documentation in one large file.


src-cg3-disambiguator.cg3.md

Plains Cree disambiguator


This (part of) documentation was generated from src/cg3/disambiguator.cg3


src-cg3-textanalysis.cg3.md

Plains Cree disambiguator


This (part of) documentation was generated from src/cg3/textanalysis.cg3


src-derivation-crk-drv.lexc.md

Nouns Verbs


This (part of) documentation was generated from src/derivation/crk-drv.lexc


src-fst-morphology-affixes-noun_affixes.lexc.md

NOUN_ENDLEX for wrapping up various things

End of noun affixes code


This (part of) documentation was generated from src/fst/morphology/affixes/noun_affixes.lexc


src-fst-morphology-affixes-propernouns.lexc.md

Proper noun inflection The Plains Cree language proper nouns inflect in the same cases as regular nouns, but with a colon (‘:’) as separator.


This (part of) documentation was generated from src/fst/morphology/affixes/propernouns.lexc


src-fst-morphology-affixes-symbols.lexc.md

Symbol affixes


This (part of) documentation was generated from src/fst/morphology/affixes/symbols.lexc


src-fst-morphology-affixes-verb_affixes.lexc.md

Plains Cree verb morphology

The Plains Cree verbs are divided in four groups:

  1. AI: Animate intransitive
  2. II: Inanimate intransitive
  3. TA: Transitive animate
  4. TI: Transitive inanimate

Prefixes

LEXICON VerbPrefixes divides the lexicon into four modes: independent, conjunctive, imperative and future conditional

LEXICON INDEPENDENT gives flags and prefixes for personprefix Hypotheticals

LEXICON IND_TENSE gives flags and prefixes for tense

LEXICON FUTURE_CONDITIONAL gives flags for future conditional (no prefix)

LEXICON CONJUNCT gives flag for conjunct and combined tense preverbs

LEXICON CNJ_TENSE gives prefixes and flags for tense in conjunct

LEXICON IMPERATIVE gives flag for imperative (no prefixes)

Preverbs

LEXICON VERBPREFIXES just adds the prefix boundary

Now, LEXC directs us to the ../stems/verbs_stems.lexc file, where we find all the verbal stems. The suffixes are then found in the section “Suffixes” right underneath.

Suffixes

Intransitive inanimate (II)

LEXICON VIIn

LEXICON VIIn_SG

LEXICON VIIw_PL

= LEXICON VIIw_PL NO LONGER NEEDED FROM AROK +V+II: VIIw_PL_WICI ; NO LONGER NEEDED FROM AROK

LEXICON VIIw

LEXICON VIIw_SG

LEXICON VIIn_PL NO LONGER NEEDED FROM AROK NO LONGER NEEDED FROM AROK

 NO LONGER NEEDED FROM AROK @U.wici.NULL@ VIIw_PL_ORDER ; NO LONGER NEEDED FROM AROK

@U.wici.NULL@ VIIw_PL_ORDER ;

LEXICON VIIw_SGPL_ORDER

LEXICON VIIw_SG_ORDER singular only

LEXICON VIIw_PL_ORDER singular only

= LEXICON VIIw_PL_ORDER plural only @U.order.indep@+Ind:@U.order.indep@ VIIw_PL_IND_PERSON ; ! @U.order.cnj@+Cnj:@U.order.cnj@ VIIw_PL_CNJ_PERSON ; ! @U.order.FutCon@+Fut+Cond:@U.order.FutCon@ VIIw_PL_FUT_CON_PERSON ;!

LEXICON VIIn_SGPL_ORDER

LEXICON VIIn_SG_ORDER singular only

LEXICON VIIn_PL_ORDER plural only

LEXICON VIIw_SG_IND_TENSE plural only

LEXICON VIIw_SG_CNJ_TENSE plural only

LEXICON VIIw_PL_IND_TENSE plural only

LEXICON VIIw_PL_CNJ_TENSE plural only

= LEXICON VIIw_PL_CNJ_TENSE plural only @U.tense.Prs@+Prs:@U.tense.Prs@ VIIw_PL_IND_PERSON ; ! @U.tense.Prt@+Prt:@U.tense.Prt@ VIIw_PL_IND_PERSON ; ! @U.tense.FutDef@+Fut+Def:@U.tense.FutDef@ VIIw_PL_IND_PERSON ; ! @U.tense.FutInt@+Fut+Int:@U.tense.FutInt@ VIIw_PL_IND_PERSON ; !

= LEXICON VIIw_PL_CNJ_TENSE plural only @U.tense.Prs@+Prs:@U.tense.Prs@ VIIw_PL_CNJ_PERSON ; ! @U.tense.Prt@+Prt:@U.tense.Prt@ VIIw_PL_CNJ_PERSON ; ! @U.tense.FutInt@+Fut+Int:@U.tense.FutInt@ VIIw_PL_CNJ_PERSON ; !

LEXICON VIIn_SG_IND_TENSE plural only

LEXICON VIIn_SG_CNJ_TENSE plural only

LEXICON VIIn_PL_IND_TENSE plural only

LEXICON VIIn_PL_CNJ_TENSE plural only

LEXICON VIIw_SGPL_IND_PERSON

LEXICON VIIw_SGPL_CNJ_PERSON

LEXICON VIIw_SGPL_FUT_CON_PERSON

LEXICON VIIw_SG_IND_PERSON

LEXICON VIIw_SG_CNJ_PERSON

LEXICON VIIw_SG_FUT_CON_PERSON

LEXICON VIIw_PL_IND_PERSON

LEXICON VIIw_PL_CNJ_PERSON

LEXICON VIIw_PL_FUT_CON_PERSON

= LEXICON VIIw_PL_FUT_CON_PERSON plural only @U.person.NULL@ VIIw_IND_PL_SUFFIX ;

= LEXICON VIIw_PL_FUT_CON_PERSON plural only @U.person.NULL@ VIIw_CNJ_PL_SUFFIX ;

= LEXICON VIIw_PL_FUT_CON_PERSON plural only @U.person.NULL@ VIIw_FUT_CON_PL_SUFFIX ;

LEXICON VIIn_SGPL_IND_PERSON

LEXICON VIIn_SGPL_CNJ_PERSON

LEXICON VIIn_SGPL_FUT_CON_PERSON

LEXICON VIIn_SG_IND_PERSON

LEXICON VIIn_SG_CNJ_PERSON

LEXICON VIIn_SG_FUT_CON_PERSON

LEXICON VIIn_PL_IND_PERSON plural only

LEXICON VIIn_PL_CNJ_PERSON plural only

LEXICON VIIn_PL_FUT_CON_PERSON plural only

LEXICON VIIn_SGPL_IND_NULL

LEXICON VIIn_SG_IND_SUFFIX singular

LEXICON VIIn_PL_IND_SUFFIX plural

LEXICON VIIw_SGPL_IND_NULL

LEXICON VIIw_SG_IND_SUFFIX w final singular

LEXICON VIIw_PL_IND_SUFFIX w final plural

LEXICON VIIn_SGPL_CNJ_NULL

LEXICON VIIn_SG_CNJ_SUFFIX singular

LEXICON VIIn_PL_CNJ_SUFFIX plural

LEXICON VIIw_SGPL_CNJ_NULL

LEXICON VIIw_SG_CNJ_SUFFIX w final singular

LEXICON VIIw_PL_CNJ_SUFFIX w final plural

LEXICON VIIn_SGPL_FUT_CON_NULL

LEXICON VIIn_SG_FUT_CON_SUFFIX singular

LEXICON VIIn_PL_FUT_CON_SUFFIX plural

LEXICON VIIw_SGPL_FUT_CON_NULL

LEXICON VIIw_SG_FUT_CON_SUFFIX w final singular

LEXICON VIIw_PL_FUT_CON_SUFFIX w final plural

Intransitive animate (AI)

LEXICON VAIw_PL stems that end in â or ê

LEXICON VAIae stems that end in â or ê

LEXICON VAIio stems that end in i, î, o, ô

LEXICON VAIn

LEXICON VAIn_PL

LEXICON VAIm These are VTI3 in Arok’s database

LEXICON VAIn_ORDER

LEXICON VAIn_PL_ORDER plural only

LEXICON VAIae_ORDER

LEXICON VAIw_PL_ORDER plural only

LEXICON VAIio_ORDER

LEXICON VAIn_PL_IND_TENSE plural only

LEXICON VAIn_PL_CNJ_TENSE plural only

LEXICON VAIw_PL_IND_TENSE plural only

LEXICON VAIw_PL_CNJ_TENSE plural only

LEXICON VAIn_IND_PERSON

LEXICON VAIn_CNJ_PERSON

LEXICON VAIn_FUT_CON_PERSON

LEXICON VAIn_IMP_PERSON

LEXICON VAIn_PL_IND_PERSON plural only

LEXICON VAIn_PL_CNJ_PERSON plural only

LEXICON VAIn_PL_FUT_CON_PERSON plural only

LEXICON VAIn_PL_IMP_PERSON plural only

LEXICON VAIw_PL_IND_PERSON plural only

LEXICON VAIw_PL_CNJ_PERSON plural only

LEXICON VAIw_PL_FUT_CON_PERSON plural only

LEXICON VAIw_PL_IMP_PERSON plural only

LEXICON VAIae_IND_PERSON

LEXICON VAIae_CNJ_PERSON

LEXICON VAIw_FUT_CON_PERSON

LEXICON VAIw_IMP_PERSON

LEXICON VAIio_IND_PERSON

LEXICON VAIio_CNJ_PERSON

LEXICON VAIw_IND_NI

LEXICON VAIw_IND_NI_SG_SUFFIX

LEXICON VAIw_IND_NI_PL_SUFFIX

LEXICON VAIw_IND_KI

LEXICON VAIw_IND_KI_SG_SUFFIX

LEXICON VAIw_IND_KI_PL_SUFFIX

LEXICON VAIae_IND_NULL

LEXICON VAIio_IND_NULL

LEXICON VAIw_IND_NULL_PL_SUFFIX

LEXICON VAIn_IND_NI

LEXICON VAIn_IND_NI_SG_SUFFIX

LEXICON VAIn_IND_NI_PL_SUFFIX

LEXICON VAIn_IND_KI

LEXICON VAIn_IND_KI_SG_SUFFIX

LEXICON VAIn_IND_KI_PL_SUFFIX

LEXICON VAIn_IND_NULL

LEXICON VAIn_IND_NULL_SG_SUFFIX

LEXICON VAIn_IND_NULL_PL_SUFFIX

LEXICON VAIae_CNJ_NULL

LEXICON VAIio_CNJ_NULL

LEXICON VAIae_CNJ_NULL_SG_SUFFIX

LEXICON VAIio_CNJ_NULL_SG_SUFFIX

LEXICON VAIw_CNJ_NULL_PL_SUFFIX

LEXICON VAIn_CNJ_NULL

LEXICON VAIn_CNJ_NULL_SG_SUFFIX

LEXICON VAIn_CNJ_NULL_PL_SUFFIX

LEXICON VAIae_FUT_CON_NULL

LEXICON VAIw_FUT_CON_NULL_SG_SUFFIX

+X+4Sg:%>yiki # ;

LEXICON VAIw_FUT_CON_NULL_PL_SUFFIX

+X+4Pl:%>yikwâwi # ;

LEXICON VAIn_FUT_CON_NULL

LEXICON VAIn_FUT_CON_NULL_SG_SUFFIX

+X+4Sg:%>iyiki # ;

LEXICON VAIn_FUT_CON_NULL_PL_SUFFIX

+X+4Pl:%>iyikwâwi # ;

Transitive inanimate (TI)

LEXICON VTIm_ORDER .

LEXICON VTIm_PL_ORDER plural only NOTE: imperative and fut con go straight to person lexica

LEXICON VTIm_PL_IND_TENSE plural only

LEXICON VTIm_PL_CNJ_TENSE plural only

LEXICON VTIm_IND_PERSON

LEXICON VTIm_CNJ_PERSON

LEXICON VTIm_FUT_CON_PERSON

LEXICON VTIm_IMP_PERSON

LEXICON VTIm_PL_IND_PERSON plural only

LEXICON VTIm_PL_CNJ_PERSON plural only

LEXICON VTIm_PL_FUT_CON_PERSON plural only

LEXICON VTIm_PL_IMP_PERSON plural only

LEXICON VTIm_IND_NI

LEXICON VTIm_IND_NI_SG_SUFFIX

LEXICON VTIm_IND_NI_PL_SUFFIX

LEXICON VTIm_IND_KI

LEXICON VTIm_IND_KI_SG_SUFFIX

LEXICON VTIm_IND_KI_PL_SUFFIX

LEXICON VTIm_IND_NULL

LEXICON VTIm_IND_NULL_SG_SUFFIX NOTE: X actor will eventually derive to VII, so it is not yet included as per Arok’s paradigm

Derives to VIIn

LEXICON VTIm_IND_NULL_PL_SUFFIX

Derives to VIIn

LEXICON VTIm_CNJ_NULL

LEXICON VTIm_CNJ_NULL_SG_SUFFIX

+X+4Sg:%>mihiyik # ;

LEXICON VTIm_CNJ_NULL_PL_SUFFIX

+X+4Pl:%>mihiyiki # ;

LEXICON VTIm_FUT_CON_NULL

LEXICON VTIm_FUT_CON_NULL_SG_SUFFIX

+X+4Sg:%>mihiyiki # ;

LEXICON VTIm_FUT_CON_NULL_PL_SUFFIX

+X+3Sg:%>mihkwâwi # ; +X+4Sg:%>mihiyikwâwi # ;

LEXICON VTA_ORDER Note: Imp and Fut Con don’t take tense

LEXICON VTA_PL_ORDER Note: Imp and Fut Con don’t take tense

LEXICON VTAi_ORDER Note: Imp and Fut Con don’t take tense ; Conjugates as TA regular except in 2sg IMM IMP

LEXICON VTAt_ORDER Note: Imp and Fut Con don’t take tense ; Conjugates as TA regular except in 2sg IMM IMP

LEXICON VTA_IND_TENSE plural only

LEXICON VTA_CNJ_TENSE plural only

LEXICON VTA_PL_IND_TENSE plural only

LEXICON VTA_PL_CNJ_TENSE plural only

LEXICON VTA_IND_PERSON

LEXICON VTA_CNJ_PERSON

LEXICON VTA_FUT_CON_PERSON

LEXICON VTA_IMP_PERSON

LEXICON VTA_PL_IND_PERSON

LEXICON VTA_PL_CNJ_PERSON

LEXICON VTA_PL_FUT_CON_PERSON

LEXICON VTA_PL_IMP_PERSON

LEXICON VTAt_IMP_PERSON no -i in 2sg+3SgO

LEXICON VTAi_IMP_PERSON

LEXICON VTA_IND_NI NOTE: No local, as local forms are always with ki-

LEXICON VTA_IND_NI_SG_SUFFIX

LEXICON VTA_IND_NI_PL_SUFFIX

LEXICON VTA_IND_KI

LEXICON VTA_IND_KI_SG_SUFFIX

LEXICON VTA_IND_KI_PL_SUFFIX

LEXICON VTA_IND_NULL NOTE: never local

LEXICON VTA_IND_NULL_SG_SUFFIX

LEXICON VTA_IND_NULL_PL_SUFFIX

~~~~~~~~~~~~~~~~~~~~~~

End of verb affixes LEXC code


This (part of) documentation was generated from src/fst/morphology/affixes/verb_affixes.lexc


src-fst-morphology-phonology.xfscript.md

Definitions

Rules

VG>i2 -> VV

OUTSIDE RULES

INITIAL CHANGE

Composing the rules together


This (part of) documentation was generated from src/fst/morphology/phonology.xfscript


src-fst-morphology-root.lexc.md

Plains Cree morphological analyser

INTRODUCTION TO MORPHOLOGICAL ANALYSER OF Plains Cree LANGUAGE.

Definitions for Multichar_Symbols

Analysis symbols

The morphological analyses of wordforms of Plains Cree are presented in this system in terms of the following symbols. (It is highly suggested to follow existing standards when adding new tags).

POS

Nominal morphology

Particles

ordinals

Verbal MSP

Person prefix fragment features

Nominal morphosyntactic features

Verb conjugation (transitivity + animacy classes)

Noun animacy and dependency classes

Preverbs

Auxiliary symbols

These symbols either shape or govern the morphophonological structure

Symbols that need to be escaped on the lower side (towards twolc):

Special characters for morphophonology

Triggers for various morphophonological phenomena Mostly, these are not realized themselves as any grapheme/phoneme

Usage tags

These tags distinguish different special-purpose analysers and generators from each other. Thus, for examples, we have normative and descriptive analysers, and generators for different purposes.

Flagdiacritics

These are documented in Chapter 8 of Beesley/Karttunen, p. 456 zB.

For indicative, there are prefixes, so here we need one flag for each person-number combination. Note that for the inverse objective conjugation, the flag refers to the prefix, not to the subject. So indsg1 refers to either subject = 1Sg or object = 1Sg. The 3-3 forms are prefixless.

The conjunct form always has the ê- prefix, and future conditional never has a prefix.

Prefixes with a certain phonological content:

Order

Tense

New multichar symbols for nouns

End of new and all Multichar_Symbols

LEXICON Root is where it all starts


This (part of) documentation was generated from src/fst/morphology/root.lexc


src-fst-morphology-stems-abbreviations.lexc.md

File containing abbreviations

Lexica for adding tags and periods

Splitting in 4 + 1 groups, because of the preprocessor

The sublexica

Dividing between abbreviations with and witout final period

The lexicons that add tags

The abbreviation lexicon itself

For abbrs for which numerals are complements, but other words not necessarily are. This group treats arabic numerals as if it were transitive but letters as if it were intransitive.

This lexicon is for abbrs that always have a constituent following it.


This (part of) documentation was generated from src/fst/morphology/stems/abbreviations.lexc


src-fst-morphology-stems-derivation_stems.lexc.md

Place-holder for inserting derivational FST after prefixes and before suffixes DRV-FST is the place-holder character

Linking verb stems from Derivational FST to their inflectional suffixes

Nouns

Verbs


This (part of) documentation was generated from src/fst/morphology/stems/derivation_stems.lexc


src-fst-morphology-stems-noun_header.lexc.md

Test lemma/stem set for nouns according the new crk FST

Complete extraction of lemma:stem info from CW/AEW (2023) and (MD 2023), according to LEXC structure in the new crk FST.


This (part of) documentation was generated from src/fst/morphology/stems/noun_header.lexc


src-fst-morphology-stems-noun_stems.lexc.md

Test lemma/stem set for nouns according the new crk FST

Complete extraction of lemma:stem info from CW/AEW (2023) and (MD 2023), according to LEXC structure in the new crk FST.


This (part of) documentation was generated from src/fst/morphology/stems/noun_stems.lexc


src-fst-morphology-stems-numeral_symbols.lexc.md

Plains Cree numerals

The file for Arabic, Roman, and other symbolic numerals


This (part of) documentation was generated from src/fst/morphology/stems/numeral_symbols.lexc


src-fst-morphology-stems-numerals.lexc.md

Plains Cree numerals

The file for numerals

Here start the 999 numbers


This (part of) documentation was generated from src/fst/morphology/stems/numerals.lexc


src-fst-morphology-stems-particle_header.lexc.md

Plains Cree particles

The file contains the following lexicons, with content as described:


This (part of) documentation was generated from src/fst/morphology/stems/particle_header.lexc


src-fst-morphology-stems-particles.lexc.md

Plains Cree particles

The file contains the following lexicons, with content as described:


This (part of) documentation was generated from src/fst/morphology/stems/particles.lexc


src-fst-morphology-stems-pronouns.lexc.md

Plains Cree pronouns

There are more pronoums to be added here.

LEXICON Pronoun

LEXICON Personal
niýa+Pron+Pers+1Sg:niýa # ; kiýa+Pron+Pers+2Sg:kiýa # ;

LEXICON Interrogative
awîna+Pron+Interr+A+Sg:awîna # “who,whose” ; awîniki+Pron+Interr+A+Pl:awîniki # “who” ; awînihi+Pron+Interr+A+Obv:awînihi # “who” ; awîniwâ+Pron+Interr+A+Obv:awîniwâ # “who” ;

LEXICON Indefinite
awiyak+Pron+Indef+A+Sg:awiyak # “someone” ; awiyak+Pron+Indef+A+Pl:awiyakak # “some people” ;

LEXICON Demonstrative
ANIMATE
awa+Pron+Dem+Prox+A+Sg:awa # “this” ; ôki+Pron+Dem+Prox+A+Pl:ôki # “these” ; ôhi+Pron+Dem+Prox+A+Obv:ôhi # “this/these” ;

INANIMATE \

ôma+Pron+Dem+Prox+I+Sg:ôma # “this” ; ôhi+Pron+Dem+Prox+I+Pl:ôhi # “these” ;

ôma+Pron+Def+Prox+I+Sg:ôma # “this one” ; ôhi+Pron+Def+Prox+I+Pl:ôhi # “these ones” ;


This (part of) documentation was generated from src/fst/morphology/stems/pronouns.lexc


src-fst-morphology-stems-verb_header.lexc.md

Model verb lemmas and stems for new crk FST

Complete extraction of lemma:stem info from CW/AEW (2023) and (MD 2023), according to LEXC structure in the new crk FST.


This (part of) documentation was generated from src/fst/morphology/stems/verb_header.lexc


src-fst-morphology-stems-verb_stems.lexc.md

Model verb lemmas and stems for new crk FST

Complete extraction of lemma:stem info from CW/AEW (2023) and (MD 2023), according to LEXC structure in the new crk FST.


This (part of) documentation was generated from src/fst/morphology/stems/verb_stems.lexc


src-fst-phonetics-txt2ipa.xfscript.md

Hyphenator and text to ipa for Plains Cree

Defining sets

The rules

Long vowels


This (part of) documentation was generated from src/fst/phonetics/txt2ipa.xfscript


src-fst-phonology-morph-bound.twolc.md

| —

ReduplCRule1

ReduplCRule2

ReduplY2Rule

ReduplY3Rule

INITIAL CHANGE


This (part of) documentation was generated from src/fst/phonology-morph-bound.twolc


src-fst-transcriptions-phonology-eng.xfscript.md


This (part of) documentation was generated from src/fst/transcriptions/phonology-eng.xfscript


src-fst-transcriptions-transcriptor-abbrevs2text.lexc.md

We describe here how abbreviations are in Plains Cree are read out, e.g. for text-to-speech systems.

For example:


This (part of) documentation was generated from src/fst/transcriptions/transcriptor-abbrevs2text.lexc


src-fst-transcriptions-transcriptor-cw-eng-noun-entry2inflected-phrase-w-flags.xfscript.md

Old code


This (part of) documentation was generated from src/fst/transcriptions/transcriptor-cw-eng-noun-entry2inflected-phrase-w-flags.xfscript


src-fst-transcriptions-transcriptor-cw-eng-verb-entry2inflected-phrase-w-flags.xfscript.md

Word-specific explicit solution for verb morphology - Option 1 (works only in FOMA) Word-specific solution

Word-specific explicit solution for verb morphology - Option 2 (works)

Rule-based solution for verb morphology

Irregular verb forms


This (part of) documentation was generated from src/fst/transcriptions/transcriptor-cw-eng-verb-entry2inflected-phrase-w-flags.xfscript


src-fst-transcriptions-transcriptor-cw-eng-verb-entry2inflected-phrase.xfscript.md

Word-specific explicit solution for verb morphology

Rule-based morphological solution


This (part of) documentation was generated from src/fst/transcriptions/transcriptor-cw-eng-verb-entry2inflected-phrase.xfscript


src-fst-transcriptions-transcriptor-eng-phrase2crk-features.xfscript.md

GENERAL DEFINITIONS

VERBS

NOUNS

Continue list of irregular noun plurals here

Assigning +V (+II/AI/TI/TA) or +N as part-of-speech


This (part of) documentation was generated from src/fst/transcriptions/transcriptor-eng-phrase2crk-features.xfscript


tools-grammarcheckers-grammarchecker.cg3.md

[ L A N G U A G E ] G R A M M A R C H E C K E R

DELIMITERS

TAGS AND SETS

Tags

This section lists all the tags inherited from the fst, and used as tags in the syntactic analysis. The next section, Sets, contains sets defined on the basis of the tags listed here, those set names are not visible in the output.

Beginning and end of sentence

BOS EOS

Parts of speech tags

N A Adv V Pron CS CC CC-CS Po Pr Pcle Num Interj ABBR ACR CLB LEFT RIGHT WEB PPUNCT PUNCT

COMMA ¶

Tags for POS sub-categories

Pers Dem Interr Indef Recipr Refl Rel Coll NomAg Prop Allegro Arab Romertall

Tags for morphosyntactic properties

Nom Acc Gen Ill Loc Com Ess Ess Sg Du Pl Cmp/SplitR Cmp/SgNom Cmp/SgGen Cmp/SgGen PxSg1 PxSg2 PxSg3 PxDu1 PxDu2 PxDu3 PxPl1 PxPl2 PxPl3 Px

Comp Superl Attr Ord Qst IV TV Prt Prs Ind Pot Cond Imprt ImprtII Sg1 Sg2 Sg3 Du1 Du2 Du3 Pl1 Pl2 Pl3 Inf ConNeg Neg PrfPrc VGen PrsPrc Ger Sup Actio VAbess

Err/Orth

Semantic tags

Sem/Act Sem/Ani Sem/Atr Sem/Body Sem/Clth Sem/Domain Sem/Feat-phys Sem/Fem Sem/Group Sem/Lang Sem/Mal Sem/Measr Sem/Money Sem/Obj Sem/Obj-el Sem/Org Sem/Perc-emo Sem/Plc Sem/Sign Sem/State-sick Sem/Sur Sem/Time Sem/Txt

HUMAN

PROP-ATTR PROP-SUR

TIME-N-SET

Syntactic tags

@+FAUXV @+FMAINV @-FAUXV @-FMAINV @-FSUBJ> @-F<OBJ @-FOBJ> @-FSPRED<OBJ @-F<ADVL @-FADVL> @-F<SPRED @-F<OPRED @-FSPRED> @-FOPRED> @>ADVL @ADVL< @<ADVL @ADVL> @ADVL @HAB> @<HAB @>N @Interj @N< @>A @P< @>P @HNOUN @INTERJ @>Num @Pron< @>Pron @Num< @OBJ @<OBJ @OBJ> @OPRED @<OPRED @OPRED> @PCLE @COMP-CS< @SPRED @<SPRED @SPRED> @SUBJ @<SUBJ @SUBJ> SUBJ SPRED OPRED @PPRED @APP @APP-N< @APP-Pron< @APP>Pron @APP-Num< @APP-ADVL< @VOC @CVP @CNP OBJ

-OTHERS SYN-V @X ## Sets containing sets of lists and tags This part of the file lists a large number of sets based partly upon the tags defined above, and partly upon lexemes drawn from the lexicon. See the sourcefile itself to inspect the sets, what follows here is an overview of the set types. ### Sets for Single-word sets INITIAL ### Sets for word or not WORD NOT-COMMA ### Case sets ADLVCASE CASE-AGREEMENT CASE NOT-NOM NOT-GEN NOT-ACC ### Verb sets NOT-V ### Sets for finiteness and mood REAL-NEG MOOD-V NOT-PRFPRC ### Sets for person SG1-V SG2-V SG3-V DU1-V DU2-V DU3-V PL1-V PL2-V PL3-V ### Pronoun sets ### Adjectival sets and their complements ### Adverbial sets and their complements ### Sets of elements with common syntactic behaviour ### NP sets defined according to their morphosyntactic features ### The PRE-NP-HEAD family of sets These sets model noun phrases (NPs). The idea is to first define whatever can occur in front of the head of the NP, and thereafter negate that with the expression **WORD - premodifiers**. ### Border sets and their complements ### Grammarchecker sets * * * This (part of) documentation was generated from [tools/grammarcheckers/grammarchecker.cg3](https://github.com/giellalt/lang-crk/blob/main/tools/grammarcheckers/grammarchecker.cg3) --- # tools-tokenisers-tokeniser-disamb-gt-desc.pmscript.md # Tokeniser for crk Usage: ``` $ make $ echo "ja, ja" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst $ echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa boasttu olmmoš, man mielde lahtuid." | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst $ echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst $ echo "márffibiillagáffe" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst ``` Pmatch documentation: <https://github.com/hfst/hfst/wiki/HfstPmatch> Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words: * Punct contains ASCII punctuation marks * The symbol after m-dash is soft-hyphen `U+00AD` * The symbol following {•} is byte-order-mark / zero-width no-break space `U+FEFF`. Whitespace contains ASCII white space and the List contains some unicode white space characters * En Quad U+2000 to Zero-Width Joiner U+200d' * Narrow No-Break Space U+202F * Medium Mathematical Space U+205F * Word joiner U+2060 Apart from what's in our morphology, there are 1. unknown word-like forms, and 2. unmatched strings We want to give 1) a match, but let 2) be treated specially by `hfst-tokenise -a` Unknowns are made of: * lower-case ASCII * upper-case ASCII * select extended latin symbols ASCII digits * select symbols * Combining diacritics as individual symbols, * various symbols from Private area (probably Microsoft), so far: * U+F0B7 for "x in box" ## Unknown handling Unknowns are tagged ?? and treated specially with `hfst-tokenise` hfst-tokenise --giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it's safer to let hfst-tokenise handle them. Finally we mark as a token any sequence making up a: * known word in context * unknown (OOV) token in context * sequence of word and punctuation * URL in context * * * This (part of) documentation was generated from [tools/tokenisers/tokeniser-disamb-gt-desc.pmscript](https://github.com/giellalt/lang-crk/blob/main/tools/tokenisers/tokeniser-disamb-gt-desc.pmscript) --- # tools-tokenisers-tokeniser-gramcheck-gt-desc.pmscript.md # Grammar checker tokenisation for crk Requires a recent version of HFST (3.10.0 / git revision>=3aecdbc) Then just: ``` $ make $ echo "ja, ja" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst ``` More usage examples: ``` $ echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa boasttu olmmoš, man mielde lahtuid." | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst $ echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst $ echo "márffibiillagáffe" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst ``` Pmatch documentation: <https://github.com/hfst/hfst/wiki/HfstPmatch> Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words: * Punct contains ASCII punctuation marks * The symbol after m-dash is soft-hyphen `U+00AD` * The symbol following {•} is byte-order-mark / zero-width no-break space `U+FEFF`. Whitespace contains ASCII white space and the List contains some unicode white space characters * En Quad U+2000 to Zero-Width Joiner U+200d' * Narrow No-Break Space U+202F * Medium Mathematical Space U+205F * Word joiner U+2060 Apart from what's in our morphology, there are 1) unknown word-like forms, and 2) unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a * select extended latin symbols * select symbols * various symbols from Private area (probably Microsoft), so far: * U+F0B7 for "x in box" TODO: Could use something like this, but built-in's don't include šžđčŋ: Simply give an empty reading when something is unknown: hfst-tokenise --giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it's safer to let hfst-tokenise handle them. Finally we mark as a token any sequence making up a: * known word in context * unknown (OOV) token in context * sequence of word and punctuation * URL in context * * * This (part of) documentation was generated from [tools/tokenisers/tokeniser-gramcheck-gt-desc.pmscript](https://github.com/giellalt/lang-crk/blob/main/tools/tokenisers/tokeniser-gramcheck-gt-desc.pmscript) --- # tools-tokenisers-tokeniser-tts-cggt-desc.pmscript.md # TTS tokenisation for smj Requires a recent version of HFST (3.10.0 / git revision>=3aecdbc) Then just: ```sh make echo "ja, ja" \ | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst ``` More usage examples: ```sh echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa \ boasttu olmmoš, man mielde lahtuid." \ | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" \ | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst echo "márffibiillagáffe" \ | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst ``` Pmatch documentation: <https://kitwiki.csc.fi/twiki/bin/view/KitWiki/HfstPmatch> Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words: * Punct contains ASCII punctuation marks * The symbol after m-dash is soft-hyphen `U+00AD` * The symbol following {•} is byte-order-mark / zero-width no-break space `U+FEFF`. Whitespace contains ASCII white space and the List contains some unicode white space characters * En Quad U+2000 to Zero-Width Joiner U+200d' * Narrow No-Break Space U+202F * Medium Mathematical Space U+205F * Word joiner U+2060 Apart from what's in our morphology, there are 1) unknown word-like forms, and 2) unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a * select extended latin symbols * select symbols * various symbols from Private area (probably Microsoft), so far: * U+F0B7 for "x in box" TODO: Could use something like this, but built-in's don't include šžđčŋ: Simply give an empty reading when something is unknown: hfst-tokenise --giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it's safer to let hfst-tokenise handle them. Needs hfst-tokenise to output things differently depending on the tag they get * * * This (part of) documentation was generated from [tools/tokenisers/tokeniser-tts-cggt-desc.pmscript](https://github.com/giellalt/lang-crk/blob/main/tools/tokenisers/tokeniser-tts-cggt-desc.pmscript)