Plains Cree NLP Grammar

Finite state and Constraint Grammar based analysers, proofing tools and other resources

View the project on GitHub giellalt/lang-crk

How to handle variation in lexc

Different orthographies

The lexc can be written with SRO (The Standard Roman Orthography)

From this one can compile alternative FSTs with systematic variation, as a base for generation of spell checker programs and ICALL-programs.

For an analyser to be used for analysing texts, one can use spell relax to get the analyser to understand all orthographies. With spell relax there will not be any tags in the output to tell which kind of orthography is used.

Words with another orthography, from other dialects:

Non-normative forms: Err/Sub

These are forms, which don’t follow orthographic principles, but still they are in texts, which we want to analyse.

ex. from North Saami: “bázáhus” is a non-normative form of the lemma “bázahus”

The normative form should be on the left side, and then the lemma in the analysis will be a normative form and can be found e.g. in the dictionary.

bázahus:bázahuss JOHTOLAT "remainder" ;
bázahus+Err/Orth:bázáhuss JOHTOLAT "remainder" ;

The descriptive FST will inflect both “bázahus” and “bázáhus”, but the string with the tag Err/Sub is removed from the normative analyser/generator during the compilation prosess.

bázahusat 
bázahusat	bázahus+N+Pl+Nom

bázáhusat
bázáhusat	bázahus+Err/Orth+N+Pl+Nom

The normative analyser:

bázahusat 
bázahusat	bázahus+N+Pl+Nom

bázáhusat
bázáhusat	bázáhusat	+?

The word itself is non-normative: Err/Lex

Ex. “brillefutterála” which is a slightly adapated loanword from Norwegian to North Saami. The normative word is “čalbmelássaskuohppu”

brillefutterála+Err/Lex:brille#futterál SOSIAL "spectacle case" ;

The descriptive FST will inflect “brillefutterála”, but the line with the tag Err/Lex is removed from the normative analyser/generator during the compilation prosess.

brillefutterálat 
brillefutterálat	brillefutterála+N+Err/Lex+Pl+Nom

The normative analyser:

brillefutterálat
brillefutterálat	brillefutterálat	+?

Lexical homonymi: how to identify the correct lemma e.g. in a dictionary

Two lemmas can have identical base forms, but different paradigms and semantics.

The lemmas belong to different stem-categories: Add morphogical tags

Example from North Saami. G3 = Grade 3 in consonant gradation

beassi:beassi BEARRI "nest" ;
beassi+G3:beas'si AIGI "birchbark" ;

Analysis:

beassi
beassi	beassi+N+G3+Sg+Nom
beassi	beassi+N+G3+Sg+Acc
beassi	beassi+N+G3+Sg+Gen
beassi	beassi+N+Sg+Nom

beasi
beasi	beassi+N+Sg+Gen
beasi	beassi+N+Sg+Acc

Example from North Saami. NomAg tag for derivation Nomen Agentis

vuovdi+NomAg:vuovdi ACTOR "salesman" ;
vuovdi+G3:vuov'di AIGI "forest" ;

Analysis:

vuovdi
vuovdi	vuovdi+N+NomAg+Sg+Nom
vuovdi	vuovdi+N+NomAg+Sg+Acc
vuovdi	vuovdi+N+NomAg+Sg+Gen
vuovdi	vuovdi+N+Sg+Nom

vuovddi
vuovddi	vuovdi+N+Sg+Gen
vuovddi	vuovdi+N+Sg+Acc

In stead of morphogical tags, one can add homonymi tags

Example from South Saami, two verbs:

govledh+Hom1:govl TJOEHPEDH_TV "hear" ;
govledh+Hom2:govl VÅÅJNEDH "sound" ;

Analysis:

gåvla
gåvla	govledh+Hom1+V+TV+Ind+Prs+Sg3

govloe
govloe	govledh+Hom2+V+IV+Ind+Prs+Sg3

Orthograpic variants (all normative) of the same lemma: tags v1, v2…

One lemma can have orthograpic variants for base form and at least parts of the inflection paradigm. We can add a variants tag as a help to recognize the correct base form for the paradigm.

Example from North Saami:

mandáhta+v2:mandáhtta GOAHTI-A "mandate" ;
mandáhta+v1:mandáhta STAHTA "mandate" ;

Generation with normative generator gives:

mandáhta+v2+N+Ess
mandáhta+v2+N+Ess	mandáhttan

mandáhta+v1+N+Ess
mandáhta+v1+N+Ess	mandáhtan

If the base forms are identical, but there are variants in the inflection, we don’t use these tags.