Finite state and Constraint Grammar based analysers, proofing tools and other resources
View the project on GitHub giellalt/lang-sme
How to add new words to lexc files:
Search for -hit verbs
gieđahit:gieđah MUITAL_TV ;
diskehallat - diskvalifisere (passive meaning)
diskehallat+OLang/NOB:diskehalla RAIMMAHALLA_IV ; ! ^LOAN
Search google Hovedstaden i Myanmar (Burma) search for -aw in smi-propernouns.lexc
Saginaw+OLang/UND:Saginaw BERN-plc ;
Nay% Pyi% Taw+MWE+OLang/UND:Nay% Pyi% Taw BERN-plc ;
propernoun in smi-propernouns.lexc
Frivillighet% Norge+MWE+CmpNP/First+OLang/NOB:Frivillighet% Norge ACCRA-org ;
is a propernoun in smi (smi-propernouns.lexc) since there is no -boy I search for -y
Roy+CmpNP/None+OLang/UND:Roy BERN-mal ;
but Playboy should be -org or -obj
so better:
FairPlay+CmpNP/First+OLang/ENG:FairPlay BERN-obj ;
Playboy+Err/Orth+CmpNP/First+OLang/ENG:Play-boy BERN-obj ; !SUB
Google first
like Slaatten+OLang/UND:Slaatten9 LONDON-sur ; but since it is a place it needs to be LONDON-plc
to shared-smi Sem/Org
Tønsberg% Blad+MWE+CmpNP/First+OLang/NOB:Tønsberg% Blad9 BERN-org ;
Mollekleiv - last name
same ending as Hynnekleiv+OLang/NOB:Hynne^kleiv BERN-sur ;
Mollekleiv+OLang/NOB:Molle^kleiv BERN-sur ;
Butler+OLang/ENG:Butler LONDON-sur ;
Hoge+OLang/NOB:Hoge ACCRA-sur ;
ráŋggáštanriekteteoriija+N+Sem/Dummytag:ráŋggáštan#riekte#te^ori IIJA ;
perfomativitehtateoriija+N+Sem/Prod-cogn:perfomativitehta#te^ori IIJA ;
is a noun: try to find another one with the same ending
AWG-lágideapmi+N+CmpN/SgN+CmpNP/First:AWG-lágid EAPMI_default_sem ;
but this one has a hyphen, those are special, so try to find one without
beassášávvudeapmi+N+CmpN/SgN+Sem/Event:beassáš#ávvud EAPMI_lex_sem ;
biebmoguollešaddadeapmi+N+CmpN/SgN:biebmo#guolle#šaddad EAPMI_default_sem ;
vuorbádeapmi+N+CmpN/SgN+CmpNP/First:vuorbád EAPMI_default_sem ;
how to know which one: =EAPMI_default_sem= gets a default semantic tag EAPMI_lex_sem gets a manual semantic tag for example Sem/Event in this case
means self-harm so it should be default_sem
test word - echo iežasnájadeapmi | hfst-lookup -q src/analyser-gt-desc.hfstol |
+CmpNP/First means: it can only be the first part of a compound, if not we put a spelling error - this is put if it is easily confused with another compound that is more common We also use it if the word is an MWE (Sámi% Dáiddaguovddáš+MWE+CmpNP/First:Sámi% Dáidda#guovddáž LONDON-org ;)
next word:
+N+CmpN/SgN:ele#rávdnje#dálkkod EAPMI_default_sem ;
it’s a propernoun since it is a sme-specific propernoun which would get translated in the other Sámi languages it goes to lang-sme/src/fst/stems/sme-propernouns.lexc (instead of shared-smi)
Davvi% álbmogiid% guovddáš+MWE+CmpNP/First:Davvi% álbmogiid% guovddáž LONDON-org ;
Sámi% Dáiddaguovddáš+MWE+CmpNP/First:Sámi% Dáidda#guovddáž LONDON-org ;
Árbediehtoguovddáš+MWE+CmpNP/First:Sámi% Dáidda#guovddáž LONDON-org
;Árbediehtoguovddáš+MWE+CmpNP/First:Árbe#diehto#guovddáž LONDON-org ;
It’s an adjective:
guhkesagat+A+Sem/Dummytag:guhkes#ag AGAdj ;
CAREFUL (don’t confuse) with similar -agat words where the consonant before is part of the word, like - lagat+A+Sem/Dummytag+Gram/Comp:laga OVDDIT ;
These are tags that say that the entry can be the first part of the compound, it can either be in nominative singular and genitive plural.
+CmpN/SgN
— could be buotagatsearvi
+CmpN/PlG
— could be buotagagiidsearvi (this is used in combination with Sem/Hum tag)
be aware of morphophonological processes: before the hashtag:
It’s an adjective, but not yet in the normative lexicon
unnibuš unni+A+Der/Comp+A+Der/Dimin+A+Attr 0,000000
unnibuš unni+A+Der/Comp+A+Der/Dimin+A+Sg+Nom 0,000000
unnibuš unni+A+Err/Orth+Der/Comp+A+Der/Dimin+A+Attr 0,000000
unnibuš unni+A+Err/Orth+Der/Comp+A+Der/Dimin+A+Sg+Nom 0,000000
search for the ending -buš:
stuorebuš+A+Sem/Hum:stuorebužž STUORIBUS ;
exchange parts of it
unnibuš+A+Sem/Hum:unnibužž STUORIBUS ;
it is a propernoun for all Sámi languages
open shared-smi/src/fst/stems/smi-propernouns.lexc
search for fjället
find:
Borkafjället+OLang/UND:Borka^fjället LONDON-LOAN-plc ;
adapt:
Middagsfjället+OLang/UND:Middags^fjället LONDON-plc ;
is a surname
open shared-smi/src/fst/stems/smi-propernouns.lexc
search for the ending -ius
Iskanius+OLang/UND:Iskani^us BERN-sur ;
what on earth is that? - google knows an island in Nicaragua There is nothing similar in the lexicon The following continuation lexica mean the following: BERN-plc — Bernas, Bernii LONDON-plc — Londonis, Londonii ACCRA-plc — Accras, Accrai
search for “ay” – Bay+CmpNP/None+OLang/UND:Bay BERN-plc ; add +MWE change +CmpNP/None to +CmpNP/First
google - Hungarian festival
Sziget+OLang/UND:Sziget9 LONDON-org ;
``
## what does the number 9 mean?
bargiidbellodatpolitihkar+v1+N+CmpN/SgN+CmpN/SgG+CmpN/PlG+OLang/NOB+Sem/Hum:bargiid9#bellodat#politihkkar MATTAR ;
bargiid#bellodat#politihkkar MATTAR ; slamlaguna+N+CmpN/SgN+CmpN/SgG+CmpN/PlG+Sem/Plc:slam9#laguna GOAHTI-A ;
eaŋkilváikkuhangaskaoapmi+v5+N+OLang/NOB+Sem/Dummytag:eaŋkal#váikkuhan#gask9#oapmi GOAHTI-I ;
9 prevents that
- bargiid- changes to bargiit-
- slam- changes to slan- (slanlaguna)
- en- changes to something else
- gask- changes to gas-
- vowel + d/m/k/s/h
- some continuation lexicons are an exception - Sotaniemi+OLang/FIN:Sota^niem NIEMI ;
## What does ^ mean?
`Playboy+CmpNP/First+OLang/ENG:Play^boy BERN-obj ;'
It's a soft hyphen for names as opposed to hard hyphens # in anything else
## What does º mean?
billu+N+Sem/Dummytag:bilºlu GOAHTI-U ; `` it means that genitive cannot be “bilu” - so it’s third grade/second grade
We add it always after +MWE it means that it can only be the first part of a compound, e.g. “Frivillighet Norge-organisašuvdna”
hit-verbs