Finite state and Constraint Grammar based analysers, proofing tools and other resources
View the project on GitHub giellalt/lang-sme
These are notes from someone learning how to add new entries to the finite state lexicon files for North Sámi. This includes adding new lemmata and morphological information needed to get the analyzer and the generator to work correctly. This includes adding surface- and underlying information, information about compounding, morpho-phonology, inflection paradigm and grammaticality.
How to add new words to lexc files:
The following ones exist already:
fáktor+N+Err/Orth+Sem/Semcon:faktora STAHTA ; fáktor+N+Sem/Semcon:fáktor GAHPIR ; fáktor+N+Err/Orth+Sem/Semcon:faktu STRUKTUR ;
search uvra search ktuvra take away vra since it’s already in the continuation lexicon STRUKTUR
DGT-vuođđostruktuvra+N+CmpNP/First+Sem/Dummytag:DGT-vuođđo#struktu STRUKTUR ;
Juvdu:Juvºdu MARJA-U-plc ; becomes Juvdu:Juvdu MARJA-U-plc ;
A) other cont-lex B) take away Konrad Nielsen-mark º
º –twol Gradation: Cluster Non-nasal Sonorant + Non-sonorant
“jº” – is the truth and “i” is how we write it
Search for -hit verbs
gieđahit:gieđah MUITAL_TV ;
diskehallat+OLang/NOB:diskehalla RAIMMAHALLA_IV ; ! ^LOAN
Search google Hovedstaden i Myanmar (Burma) search for -aw in smi-propernouns.lexc
Saginaw+OLang/UND:Saginaw BERN-plc ;
Nay% Pyi% Taw+MWE+OLang/UND:Nay% Pyi% Taw BERN-plc ;
propernoun in smi-propernouns.lexc
Frivillighet% Norge+MWE+CmpNP/First+OLang/NOB:Frivillighet% Norge ACCRA-org ;
is a propernoun in smi (smi-propernouns.lexc) since there is no -boy I search for -y
Roy+CmpNP/None+OLang/UND:Roy BERN-mal ;
but Playboy should be -org or -obj
so better:
FairPlay+CmpNP/First+OLang/ENG:FairPlay BERN-obj ;
Playboy+Err/Orth+CmpNP/First+OLang/ENG:Play-boy BERN-obj ; !SUB
Google first
like Slaatten+OLang/UND:Slaatten9 LONDON-sur ; but since it is a place it needs to be LONDON-plc
to shared-smi Sem/Org
Tønsberg% Blad+MWE+CmpNP/First+OLang/NOB:Tønsberg% Blad9 BERN-org ;
Mollekleiv - last name
same ending as Hynnekleiv+OLang/NOB:Hynne^kleiv BERN-sur ;
Mollekleiv+OLang/NOB:Molle^kleiv BERN-sur ;
Butler+OLang/ENG:Butler LONDON-sur ;
Hoge+OLang/NOB:Hoge ACCRA-sur ;
ráŋggáštanriekteteoriija+N+Sem/Dummytag:ráŋggáštan#riekte#te^ori IIJA ;
perfomativitehtateoriija+N+Sem/Prod-cogn:perfomativitehta#te^ori IIJA ;
is a noun: try to find another one with the same ending
AWG-lágideapmi+N+CmpN/SgN+CmpNP/First:AWG-lágid EAPMI_default_sem ;
but this one has a hyphen, those are special, so try to find one without
beassášávvudeapmi+N+CmpN/SgN+Sem/Event:beassáš#ávvud EAPMI_lex_sem ;
biebmoguollešaddadeapmi+N+CmpN/SgN:biebmo#guolle#šaddad EAPMI_default_sem ;
vuorbádeapmi+N+CmpN/SgN+CmpNP/First:vuorbád EAPMI_default_sem ;
how to know which one: =EAPMI_default_sem= gets a default semantic tag EAPMI_lex_sem gets a manual semantic tag for example Sem/Event in this case
means self-harm so it should be default_sem
test word - echo iežasnájadeapmi | hfst-lookup -q src/analyser-gt-desc.hfstol |
+CmpNP/First means: it can only be the first part of a compound, if not we put a spelling error - this is put if it is easily confused with another compound that is more common We also use it if the word is an MWE (Sámi% Dáiddaguovddáš+MWE+CmpNP/First:Sámi% Dáidda#guovddáž LONDON-org ;)
next word:
+N+CmpN/SgN:ele#rávdnje#dálkkod EAPMI_default_sem ;
it’s a propernoun since it is a sme-specific propernoun which would get translated in the other Sámi languages it goes to lang-sme/src/fst/stems/sme-propernouns.lexc (instead of shared-smi)
Davvi% álbmogiid% guovddáš+MWE+CmpNP/First:Davvi% álbmogiid% guovddáž LONDON-org ;
Sámi% Dáiddaguovddáš+MWE+CmpNP/First:Sámi% Dáidda#guovddáž LONDON-org ;
Árbediehtoguovddáš+MWE+CmpNP/First:Sámi% Dáidda#guovddáž LONDON-org
;Árbediehtoguovddáš+MWE+CmpNP/First:Árbe#diehto#guovddáž LONDON-org ;
It’s an adjective:
guhkesagat+A+Sem/Dummytag:guhkes#ag AGAdj ;
CAREFUL (don’t confuse) with similar -agat words where the consonant before is part of the word, like - lagat+A+Sem/Dummytag+Gram/Comp:laga OVDDIT ;
These are tags that say that the entry can be the first part of the compound, it can either be in nominative singular and genitive plural.
— could be buotagatsearvi
— could be buotagagiidsearvi (this is used in combination with Sem/Hum tag)
be aware of morphophonological processes: before the hashtag:
It’s an adjective, but not yet in the normative lexicon
unnibuš unni+A+Der/Comp+A+Der/Dimin+A+Attr 0,000000
unnibuš unni+A+Der/Comp+A+Der/Dimin+A+Sg+Nom 0,000000
unnibuš unni+A+Err/Orth+Der/Comp+A+Der/Dimin+A+Attr 0,000000
unnibuš unni+A+Err/Orth+Der/Comp+A+Der/Dimin+A+Sg+Nom 0,000000
search for the ending -buš:
stuorebuš+A+Sem/Hum:stuorebužž STUORIBUS ;
exchange parts of it
unnibuš+A+Sem/Hum:unnibužž STUORIBUS ;
it is a propernoun for all Sámi languages
open shared-smi/src/fst/stems/smi-propernouns.lexc
search for fjället
Borkafjället+OLang/UND:Borka^fjället LONDON-LOAN-plc ;
Middagsfjället+OLang/UND:Middags^fjället LONDON-plc ;
is a surname
open shared-smi/src/fst/stems/smi-propernouns.lexc
search for the ending -ius
Iskanius+OLang/UND:Iskani^us BERN-sur ;
what on earth is that? - google knows an island in Nicaragua There is nothing similar in the lexicon The following continuation lexica mean the following: BERN-plc — Bernas, Bernii LONDON-plc — Londonis, Londonii ACCRA-plc — Accras, Accrai
search for “ay” – Bay+CmpNP/None+OLang/UND:Bay BERN-plc ; add +MWE change +CmpNP/None to +CmpNP/First
google - Hungarian festival
Sziget+OLang/UND:Sziget9 LONDON-org ;
## what does the number 9 mean?
bargiidbellodatpolitihkar+v1+N+CmpN/SgN+CmpN/SgG+CmpN/PlG+OLang/NOB+Sem/Hum:bargiid9#bellodat#politihkkar MATTAR ;
bargiid#bellodat#politihkkar MATTAR ; slamlaguna+N+CmpN/SgN+CmpN/SgG+CmpN/PlG+Sem/Plc:slam9#laguna GOAHTI-A ;
eaŋkilváikkuhangaskaoapmi+v5+N+OLang/NOB+Sem/Dummytag:eaŋkal#váikkuhan#gask9#oapmi GOAHTI-I ;
9 prevents that
- bargiid- changes to bargiit-
- slam- changes to slan- (slanlaguna)
- en- changes to something else
- gask- changes to gas-
- vowel + d/m/k/s/h
- some continuation lexicons are an exception - Sotaniemi+OLang/FIN:Sota^niem NIEMI ;
## What does ^ mean?
`Playboy+CmpNP/First+OLang/ENG:Play^boy BERN-obj ;'
It's a soft hyphen for names as opposed to hard hyphens # in anything else
## What does º mean?
billu+N+Sem/Dummytag:bilºlu GOAHTI-U ; `` it means that genitive cannot be “bilu” - so it’s third grade/second grade
We add it always after +MWE it means that it can only be the first part of a compound, e.g. “Frivillighet Norge-organisašuvdna”