How to handle variation in lexc
Different orthographies
The lexc can be written with SRO (The Standard Roman Orthography)
From this one can compile alternative FSTs with systematic variation, as a base for generation of spell checker programs and ICALL-programs.
- with macron-FST
- with circumflex-FST
- no-length-marking-FST
- with converting to syllabics
For an analyser to be used for analysing texts, one can use spell relax to get the analyser to understand all orthographies. With spell relax there will not be any tags in the output to tell which kind of orthography is used.
Words with another orthography, from other dialects:
- if systematic, it can be done in the compiling process
- if not systematic, one can use tags, e.g. Dial/Mask. They with be included in the compiling when one asks for it.
Non-normative forms: Err/Sub
These are forms, which don’t follow orthographic principles, but still they are in texts, which we want to analyse.
ex. from North Saami: “bázáhus” is a non-normative form of the lemma “bázahus”
The normative form should be on the left side, and then the lemma in the analysis will be a normative form and can be found e.g. in the dictionary.
bázahus:bázahuss JOHTOLAT "remainder" ;
bázahus+Err/Orth:bázáhuss JOHTOLAT "remainder" ;
The descriptive FST will inflect both “bázahus” and “bázáhus”, but the string with the tag Err/Sub is removed from the normative analyser/generator during the compilation prosess.
bázahusat
bázahusat bázahus+N+Pl+Nom
bázáhusat
bázáhusat bázahus+Err/Orth+N+Pl+Nom
The normative analyser:
bázahusat
bázahusat bázahus+N+Pl+Nom
bázáhusat
bázáhusat bázáhusat +?
The word itself is non-normative: Err/Lex
Ex. “brillefutterála” which is a slightly adapated loanword from Norwegian to North Saami. The normative word is “čalbmelássaskuohppu”
brillefutterála+Err/Lex:brille#futterál SOSIAL "spectacle case" ;
The descriptive FST will inflect “brillefutterála”, but the line with the tag Err/Lex is removed from the normative analyser/generator during the compilation prosess.
brillefutterálat
brillefutterálat brillefutterála+N+Err/Lex+Pl+Nom
The normative analyser:
brillefutterálat
brillefutterálat brillefutterálat +?
Lexical homonymi: how to identify the correct lemma e.g. in a dictionary
Two lemmas can have identical base forms, but different paradigms and semantics.
The lemmas belong to different stem-categories: Add morphogical tags
Example from North Saami. G3 = Grade 3 in consonant gradation
beassi:beassi BEARRI "nest" ;
beassi+G3:beas'si AIGI "birchbark" ;
Analysis:
beassi
beassi beassi+N+G3+Sg+Nom
beassi beassi+N+G3+Sg+Acc
beassi beassi+N+G3+Sg+Gen
beassi beassi+N+Sg+Nom
beasi
beasi beassi+N+Sg+Gen
beasi beassi+N+Sg+Acc
Example from North Saami. NomAg tag for derivation Nomen Agentis
vuovdi+NomAg:vuovdi ACTOR "salesman" ;
vuovdi+G3:vuov'di AIGI "forest" ;
Analysis:
vuovdi
vuovdi vuovdi+N+NomAg+Sg+Nom
vuovdi vuovdi+N+NomAg+Sg+Acc
vuovdi vuovdi+N+NomAg+Sg+Gen
vuovdi vuovdi+N+Sg+Nom
vuovddi
vuovddi vuovdi+N+Sg+Gen
vuovddi vuovdi+N+Sg+Acc
In stead of morphogical tags, one can add homonymi tags
Example from South Saami, two verbs:
govledh+Hom1:govl TJOEHPEDH_TV "hear" ;
govledh+Hom2:govl VÅÅJNEDH "sound" ;
Analysis:
gåvla
gåvla govledh+Hom1+V+TV+Ind+Prs+Sg3
govloe
govloe govledh+Hom2+V+IV+Ind+Prs+Sg3
Orthograpic variants (all normative) of the same lemma: tags v1, v2…
One lemma can have orthograpic variants for base form and at least parts of the inflection paradigm. We can add a variants tag as a help to recognize the correct base form for the paradigm.
Example from North Saami:
mandáhta+v2:mandáhtta GOAHTI-A "mandate" ;
mandáhta+v1:mandáhta STAHTA "mandate" ;
Generation with normative generator gives:
mandáhta+v2+N+Ess
mandáhta+v2+N+Ess mandáhttan
mandáhta+v1+N+Ess
mandáhta+v1+N+Ess mandáhtan
If the base forms are identical, but there are variants in the inflection, we don’t use these tags.