Introduction
File structure
The file format is documented in the Xerox manuals, especially in Karttunen 1993 Finite-State Lexicon Compiler, but see also the Beesley and Karttunen book. The file itself consists of a section defining Multichar_symbols, and of a large number of lexica, 183 lexica according to the present count (19.10.01). The file sme-lex.txt contains a.o. the continuation lexica for nouns, verbs and adjectives, whereas the bulk of the lexicon is divided into different files, as indicated below.
In the sme-lex.txt file, the Multichar_Symbols section contains all grammatical tags, and all multicharacter members of the alphabet (the latter set is taken from the grammar file).
The Root lexicon points to the lexica of the different parts of speech: (for each sublexicon there is a pointer to the relevant file containing the sublexicon)
LEXICON Root
NounRoot ;      ! -> noun-sme-lex.txt
ProperNoun ;    ! -> the file sme-lex.txt itself
AdjectiveRoot ; ! -> adj-sme-lex.txt
VerbRoot ;      ! -> verb-sme-lex.txt
Pronoun ;       ! -> closed-sme.lex.txt
Adverb ;        ! -> adv-sme-lex.txt
Particles ;     ! -> closed-sme.lex.txt
Subjunction ;   ! -> closed-sme.lex.txt
Conjunction ;   ! -> closed-sme.lex.txt
Adposition ;    ! -> pp-sme.lex.txt
Postposition ;  ! -> pp-sme.lex.txt
Preposition ;   ! -> pp-sme.lex.txt
Interjection ;  ! -> closed-sme.lex.txt
The different part of speech lexica are documented here, in the order just given.
Nouns
The NounRoot lexicon
The structure of the noun-sme.txt file
The file contains the following sections:
- LEXICON GuessNoun
 - LEXICON NounRoot
    
- Stray forms
 - Compounds
 - Multiword nouns
 - The noun stems themselves, sorted alphabetically (the alphabetical sorting is according to the computer algorithm, and not according to the Saami one)
 
 
The GuessNoun lexiconbelongs to the guess-sme binary file. The idea behind this is to have a binary file that guesses the grammatical form of nouns on the basis of their suffix, a hypothetical form like plineraiguin could for example be hypothetised to be the Comitative Plural of plinera. The guesser has not been used, and it is unclear whether it works at all. TODO: Check this.
The lexicon NounRoot first has some temporary sections. There is a set “compounds, awaiting a solution on the shorting question”, such like sámegiel, oar, beal, etc., reduced first- and middlepart compounds. The long-term solution is to build a syllable shortener for these compounds, for the time being the lexicalised ones are listed here. This section also contain some loanwords that typically have a shorter, non-sámi form as compounds, and a longer, Saami form when used as independent word. An illustrative example is the word sosial. In isolation, we find sosiála, but as compound, we find sosialdepartemeantta, etc. Thus, these words are listed here, pointing directly to the compound lexicon R.
Finally, the noun stem section itself is the large one, it contains about 26500 nouns, they are divided into different noun stem classes. They are documented in the next section.
The noun stem classes
The sublexica, ordered by inflectional type
Bisyllabic nouns:
GOAHTI !Bisyll. V-Nouns. Short nom-compound-forms goahte-,long/short
gen
GOAHTILONG !Long nom-compound-forms,long/short gen
GOAHTILONGSHORT !Sometimes long nom-compound-forms,long/short gen
ALBMI !Bisyll. V-Nouns. Short nom-compound-forms, long gen.
ALBMILONG !Bisyll. V-Nouns. Long nom-compound-forms, long gen.
ALBMILONGSHORT !Bisyll. V-Nouns. Long/SHORT nom-compound-forms, long
gen.
AIGI !Bisyll. V-Nouns. Short nom-compound-forms, short gen
ACTOR !Long compound-forms
ACTORLONGSHORT !Sometimes long compound-forms
ACTORTV !deverbal nouns from transitive verbs
ACTORLONGSHORTTV !deverbal nouns from transitive verbs
STAHTA !Bisyll. Non-Gradating a-Nouns; i-Illative
IIJA !Bisyll. Non-Gradating a-Nouns; a-Illative
LUONDU !this word (+vuohta) because of behavior in compounds
RUOKTU !only this word because of its behavior in compounds
MANNI !Bisyll. V-Nouns. Long/SHORT nom-compound-forms, long
gen. ILL:mánnii/mánnái
BASSI !Bisyll. V-Nouns. Short nom-compound-forms, short gen., long
heapmi-caritive
EADNI !Bisyll. V-Nouns. Short nom-compound-forms, long gen.short
caritive
RAFI !Bisyll. V-Nouns. Long nom-compound-forms, long gen. short
heapmi-caritive
DUOHKI !only this word so far for disamb. reasons
JOHTOLAT
LEXMUSH
Actios:
BOAHTALADDAN ! No compound-forms
BOAHTALADDANTV !Words coming from transitive verbs, No compound-forms
IHTALUDDAMAT ! No compound-forms, plural only
UPMI
EAPMI
MUITTASJEAPMI
EAPMITV !words stemming from transitives
MUITTASJEAPMITV !words stemming from transitives
Without loc and ill sg:
OLGU !Bisyll. V-Nouns. Short nom-compound-forms, short gen.
MIEHTI !Bisyll. V-Nouns. Short nom-compound-forms, long gen.
LULLI !Bisyll. V-Nouns. Long/SHORT nom-compound-forms, long gen
With comparatives:
GADDI !Bisyll. V-Nouns with Comparative Forms. Short nom-compound-forms,
long gen.
GADDILONG !long compound forms
GADDISHORT !Bisyll. V-Nouns with Comparative Forms. Short
nom-compound-forms, SHORT gen.
With comparatives, without loc and ill sg:
GADDILONGSHORT !NB! No SgIll and SgLoc because davvi is the only word
this far Bisyll. V-Nouns with Comp. Forms, long-short nomcmp, long
gencmpOARJI !Bisyll. V-Nouns with Comparative Forms. Short
nom-compound-forms, long gen.
LULLILONG !long compound forms
Plurals:
BUVSSAT !Pl.
BUVSSATLONG
BORALMASAT !like JOHTOLAT but plural only
Trisyllabic nouns:
Ending with consonant, gradating:
MATTAR !Tris. Anim. Gradating C-Nouns
MALIS !Tris. Inanim. Gradating C-Nouns, Short compound-forms
MALISLONG !Tris. Inanim. Gradating C-Nouns. Long compound-forms
MALISLONGSHORT !Tris. Inanim. Gradating C-Nouns. Long and short
compound-forms
OVCCIS ! Collective numerals
DAIVVAS !Tris. Gradating C-Nouns, The Troms declension: dáivvaš:dáivaha,
bearaš:bearraha, njunuš:njunnoha
Ending with vowel, gradating:
BEANA !Trisyll. Anim. Gradating 0-Nouns
SEAMU !Short compound-forms. Trisyll. Inanim. Gradating 0-Nouns
SEAMULONG !Long compound-forms. Trisyll. Inanim. Gradating 0-Nouns
Ending with consonant, non-gradating:
GAHPIR !Short compound-forms. Trisyll. Non-Gradating C-Nouns
GAHPIRLONGSHORT !Long and short compound-forms. Trisyll. Non-Gradating
C-Nouns
GAHPIRLONG !Long compound-forms. Trisyll. Non-Gradating C-Nouns
LEXDIMINC !Diminutives
Actios:
BOAHTINLONGSHORT ! Deverbal nouns
BOAHTIN ! Long compound-forms
BOAHTINSHORT ! Short compound-forms
BOAHTINLONGSHORTTV !!Words coming from transitive verbs Deverbal nouns
BOAHTINTV !!Words coming from transitive verbs Long compound-forms
BOAHTINSHORTTV !!Words coming from transitive verbs Short
compound-forms
IHTAMAT !Actio plurals
Plurals:
BADJOSAT !Pl. bajus:badjosat
BADJOSATLONG !Pl. bajus:badjosat
ALIMAT !Pl. alin:alimat
CEAKCAGAT !Like seamu but plural only
VUOIGNAHAT !Like DAIVVAS but only Pl. vuoiŋŋaš:vuoigŋahat
Contracted nouns:
BOAZU !Anim. Contracted 0-Nouns
SUOLU !Short compound-forms. Inanim. Contracted 0-Nouns
SUOLULONG !Long compound-forms. Inanim. Contracted 0-Nouns
GISTTA !The Noun gistta, gist -
FALIS !Contracted Anim. C-Nouns
LASIS !Contracted Inanim. C-Nouns
GUOVTTIS ! Collective numerals
GUOVTTU !Here because other lexicons don’t fit for it
Plurals:
DURVAT !like LASIS but pl. only
GIRKOSADDOT !LIKE SATTU but pl, only
Miscellaneous noun types:
GENTLEMAN ! cns-final bisyllabic loanwords (stem mana-
MADII !Another peculiar word that deserves its own lex
MASAI !heavy fin syll !gen = -a ill = -ii !parallel to NYSTØ
BUDEITA !Rather special word
GARGIA ! light fin syll, bisyll. on -o- that doesn’t have change o:u in
front of j (i): Kino
BUFFALO !heavy fin syll !gen = -a ill = -ii !parallel to NYSTØ
KULTUR !Recent loanwords on -vra with short cmp-form
KANTUR
MASKIN !Recent loanwords on -iidna with short cmp-form
TELEFON !Recent loanwords on -vdna with short cmp-form
SOSIAL !Recent loanwords on -ála with both short and long cmp-form
OLLUVUOHTA !Exceptional vuohta-Nouns
MAGASH
EGEZHAGAT
Nominal sublexica
tbw. Here we document the nominal section of the sme-lex.txt file.
khjh
The ProperNoun lexicon
The proper nouns are stored in gt/sme/src/propernoun-sme-lex.txt.
The file structure
Propernoun is one file, with the inflectional lexica first, followed by the over 36000 items long proper noun lexicon. This lexicon points to several sublexica. They are shown below, ordered according to the phonological and semantical properties of the stem. After the table comes a list of lexica not yet integrated in the table.
Type                            
StemCoda  CG  IllChange  Loc          Lexicon 
--------------------------------------------------------
Monosyllabic stems
HeavyVow  no   no        -as          NYSTØ  
HeavyCns  no   no        -as          BERN
Bisyllabic stems
LightVow  no   yes       -s           ACCRA  
LightVow  yes  yes       -s           MARJA
LightVow  yes  yes       -s           SUOPMA
Light e   no   no        -s           SIJTE
LightCns  no                          HEANDARAT
LightVow  no   no        -s           NIKOSIIJA
LightVow  yes  yes       -s           HEIM
LightVow  yes  yes       -s           SUND
Trisyllabic stems
LightCns  no   no        -is          LONDON
LightCns  no   no        -is          ANAR
LightCns  yes  yes       -is          DUORTNUS
LightVow  yes                         GUOLBBA
LightCns  no   no        -is          PLACE-DIM
LightCns  yes  yes       -is          RANES
LightCns  yes  yes       -is          CAVKKUS
LightCns  no   no        -is          BALAK
LightCns  no   no        -is          SARAK
Contracted stems
DAVVISUOLU !Inanim. Contracted 0-Names - from SUOLU
GEAVNNIS
Mixed stems
-nen      no   no        -as/nenis  no    C-FI-NEN
HAWAII
SKOHTERMADII
Plural stems
ALEUHTAT is for bisyllabic, would-have-been vowel final stems
VARGGAT  is as aleuhtat, but with Sg Gen and Loc substandard forms
EATNAMAT is for trisyllabic plural stems
ADJAGAT  is for trisyllabic plural stems
SULLOT   is for contracted plural stems
LASSAT   is for contracted plural stems
SKANIK   is for plurals on -k
---------------------------------------------------------
Notes:
Comp means "has comparative forms"
This table is still not completed. The lexica are there, but they are not all properly characterised.
Looking at the distribution, and ignoring the semantic subtags, we have a (slightly outdated) distribution of names over lexicon types as follows:
9114 ACCRA
8482 BERN
8371 MARJA
6480 LONDON
1390 C-FI-NEN
 865 NYSTØ
 296 PLACE
 133 DUORTNUS
  98 DAVVISUOLU
  68 NIILLAS
  62 GUOLBBA
  59 ANAR
  48 ALEUHTAT
  31 SULLOT
  28 HEANDARAT
  27 GIEDDI
  13 VARGGAT
  12 GEAVNNIS
   7 EATNAMAT
Saami geographical names
Saami names have been added from Norway, in 2002. Since that, Statens Kartverk has translated most (all?) Saami names from the old Bergsland/Ruong to the new 1979 orthography. Now, the remaining Sámi names should be added to the list, in the following way:
- All North Saami names should be extracted from the map base and run against the transducer
 - The names that do not get a +Prop tag should be extracted, and then added to the base.
 
Saami names from Finland and Sweden have not been added, but they are underway.
Saami names from outside Sápmi, as listed in Sámi Atlas are added. Other Saami names from outside Sápmi should also be added, but we don’t know of good sources for such names (language councils are one possibility)
Names in other Saami languages are increasingly being used in North Saami texts, e.g. when referring to South Saami or Lule Saami institutions. These ones should be added and organised in lexica.
Adjectives
The file adj-sme-lex.txt contains 4270 entries.
AdjRoot
The adjective sublexica
In the lexicon file adj-sme-lex.txt, the sublexica are distributed in the following way (30.06.05) (ordered after frequency, thereafter after declension type):
Making linguistic sense of the system (Sammallahti's codes aaa etc.):
 899 DABALAS       aad
 477 JEAGOHEAPMI   bae
 358 BOAKKAS       c
 329 At            attributes
 280 BEAKKAN       aab
 218 GAPPUS        aab
 205 JOHTIL        babaa
 191 NUORRA        aaa
 165 AKTIIVA
 157 JIEDNAI       bad
 150 GARAS         bbb
 139 MEAHTTUS      aab
 114 LAIKI         baa
 106 DEARVVASLAS   aad
  71 ISSORAS       aad
  59 VIELG         babab
  38 CAHKK         babaa babab
  31 BUOREMUS      aae
  30 JUHKKIS       aaa
  26 GUOHCA        aab
  26 EATTAS        babba
  19 LODJI         baa bac
  14 GEARGGUS      aab
  13 SEARRA        baa
  13 DILDDAS       babba
   8 VUDDJII       bad
   8 CIENAL        babbb
   7 LINIS         bbe
   7 DEAHTIS       bbe
   7 BIEKKUS       babba
   7 ASEHAS        baf
   6 OVDDIT        aae
   6 NUOLUS        aab
   5 VATTIS        aab
   5 NJUORAS       bbb
   5 HEAHKAS       babba
   4 VIISSIS       aab aac
   4 LIEKKUS       aab
   4 FINJU-        
   3 UHCC          bba
   3 SUVRRIS       bbe
   3 JALGAT        bbc babab
   3 BUORRE         ab
   3 BU/MUS        
   3 ATTR
   3 ALLAT         bbc
   2 NJALGGAT      bbe
   2 NAMAT
a.................No morphologically distinct attr. form
  aa..............attr. is not inflected
    aaa...........Bisyll.
                  NUORRA      
                  JUHKKIS     Non-gradating 
    aab...........Trisyll.
                  BEAKKAN     Gradating 
                  GAPPUS      Gradating 
                  MEAHTTUS    Compounded non-gradating 
                  GUOHCA      Gradating 
                  GEARGGUS    Gradating 
                  NUOLUS      Gradating 
                  VATTIS      Gradating 
                  VIISSIS     Gradating 
                  LIEKKUS     Gradating 
                  IPMAHA      Gradating
    aac...........Contracted 
                  VIISSIS     
    aad...........Qadrisyll.
                  DABALAS      
                  DEARVVASLAS 
                  ISSORAS     
    aae...........Comparative forms
                  BUOREMUS    
                  OVDDIT      
  ab..............Attr. in partial congruence with the noun
                  BUORRE     
b.................Morphologically distinct attr. form  
  ba..............Attr. form ends with -s
    baa...........Bisyll.
                  LAIKI       Attr. form -es in WeG
                  LODJI       Attr. form -es and -is in WeG
                  SEARRA      Attr. form -s in WeG
    bab...........Trisyll.
      baba........Non-gradating
        babaa.....No contraction in attr. form
                  JOHTIL  is-Attr.
                  CAHKK   is-Attr.
        babab.....Contraction (and ensuing strengthening) in attr. fom
                  VIELG   es-Attr. 
                  CAHKK   es-Attr.
                  JALGAT  es-Attr.   
      babb........Grade alternation in stem
        babba.....s-stems with WeG attr-form
                  EATTAS  is-Attr.
                  DILDDAS is-Attr.
                  BIEKKUS is-Attr
                  HEAHKAS is-Attr
        babbb.....Attr. form in StG
                  CIENAL  is-Attr.
    bac...........Vocalic contracetd attr. form
                  LODJI       -es and -is Attr. in WeG
    bad...........Contracted stems
                  JIEDNAI     Non-gradating 
                  VUDDJII     Non-gradating 
    bae...........Caritives with attr- -his.
                  JEAGOHEAPMI 
    baf...........Qadrisyll. Adj., with Attr.-his
                  ASEHAS      
  bb..............Attr. form ends with -a
    bba...........Bisyll. ending in -i (or -a sometimes) 
                  UHCC        StG attr.
   
    bbb...........Trisyll. ending -as
                  GARAS       Gradating with StG attr.
                  NJUORAS     Gradating with StG attr.
    bbc...........Trisyll. ending -at
                  JALGAT      Non-gradating with StG attr.
                  ALLAT       Non-gradating with StG attr.
    bbe...........Trisyll. ending -is
                  LINIS       Gradating with StG Attr. 
                  DEAHTIS     Contracted with StG Attr.
                  SUVRRIS     Gradating with WeG Attr. 
                  NJALGGAT    Gradating with StG Attr.
c.................No attr.
                  BOAKKAS         Trisyll. 
Verbs
The VerbRoot lexicon
The lexicon is stored in the verb-sme.txt file.
VerbRoot contains 49 sublexica divided into three stem types:
- lexicon for impersonal verbs
 - lexicon for verbs with personal passives, Transitives
 - lexicon for verbs without personal passives, Intransitives
 - lexicon without Personal Passive but with Acc obj
 - lexicon for inherent passives
 
Modals:
**
GALGA
FERTE
Bisyllabic verbs:
Impersonals:
ARVI arvit sataa !Bisyllabic Impersonal Verbs
ARVALADDAT arváladdat sataa !Already derived bisyllabic Impersonal
Verbs
Transitives:
DIEHTI diehtit tietaa !Bisyllabic i-Verbs with Personal Passive
BORRA !Bisyllabic a- and u-verbs with Personal Passive
DEAKCU !as BORRA for u-verbs with dim -astit, and a-verbs with dim
-istit that are hardcoded
DIEHTISHORT !Short actio compound-form DIEHTILONGSHORT !Long and short
actio compound-form DIEHTALADDAT diehtáladdat tietaa !Already derived
bisyllabic Verbs with Personal Passive
HAHTTIT !Four-syll kausatives on -httit
DAHTU dáhtut ! As diehti, but -ut verbs, thus without short passive
BOLTU ! As dáhtu but with dim -astit that are harcoded
DOAROSTUVVA !INCHOATIVES on -stuvvat
Intransitives:
ALLU ! -ut verbs, thus without short passive BOAHTI boahtit tulla
!Bisyllabic i-Verbs without Personal Passive
BOAHTILONGSHORT !!Long and short actio compound-form DIEVVA ! Bisyllabic
a- and u-verbs without Personal Passive but with Actor BOAZZU !as DIEVVA
for u-verbs with dim -astit, that are hardcoded BINDU !as DIEVVA (but
without short passive) for u-verbs with dim -astit, that are hardcoded
BOAHTALADDA boahtáladda tulla !Already derived bisyllabic Verbs without
Personal Passive
RAIMMAHALLA !passives on -hallat and INCHOATIVES on -stuvvat
Personal Passive but with Acc obj:
MAHTI máhttiit ! Bisyllabic Verbs without Personal Passive but with Acc
obj.
MAHTALADDA máhtáladdat ! Already derived bisyllabic Verbs without
Personal Passive but with Acc obj.
Inherent passives:
RAIMMAHALLA !passives on -hallat and INCHOATIVES on -stuvvat UVVA
!passives -uvvat
Contracted verbs:
Impersonals:
BORGE borget tehda pyry !Contracted Impersonal Verbs
Transitives:
DOHPPE dohppet tarttua !Contracted Verbs with Personal Passive
MUITA ! Inchoatives and translatives on -á, -o, -e with Personal
Passive
Intransitives:
GILLE gillet viitsia !Contracted Verbs without Personal Passive
CIRRO ! Inchoatives and translatives on -á, -o, -e without Personal
Passive
Personal Passive but with Acc obj:
MAHTA máhttát !Contracted Verbs without Personal Passive but with Acc
obj.
Inherent passives:
BASSO bassot ! Bisyllabic, inherently passive -ot verbs
Trisyllabic verbs:
Impersonals:
COASKKIT čoaskkidiit !Trisyllabic impersonal verbs
ARVVASJ arvvašit !Trisyllabic impersonal verbs ending -šit, -skit, smit,
-idit, -ldit, -git and 5-syllables
ARVIL arvilit !Trisyllabic impersonal verbs ending -lit
Transitives:
MUITAL muitalit !Trisyllabic Verbs with Personal Passive
MUITTASJ !Words ending -šit, -skit, -ldit - !directed here as well:
!Reciprocals on -dit !Momentatives on -dit, -ádit, -ihit, -e7hit
!Frequentatives on -(u)hit !Continuatives on -nit !Inchoatives in -nit
HALIID !Words ending -smit, -idit, -git - BONJAT !!Cont/Freq on -dit,
Continuatives on -(u)hit, Reciprocals, momentatives and frequentatives
ending -alit
VUORDIL !Trisyllabic Verbs ending -lit, -rit with Personal Passive
Intransitives:
ALIST alistit !Trisyllabic Verbs without Personal Passive
BEAGASJ !Words ending -šit, -skit -ldit, transitives on -hit
!directed here as well: !Reciprocals on -dit !Momentatives on -dit,
-ádit, -ihit, -e7hit !Frequentatives on -(u)hit !Continuatives on -nit
!Inchoatives in -nit
JORGGIID !Words ending -smit, -idit, -git
BALAT !!Cont/Freq on -dit, Continuatives on -(u)hit, Reciprocals,
momentatives and frequentatives ending -alit
SUOTNJAL suotnjalit !Trisyllabic Verbs without Personal Passive ending
-lit
BOTNJAS botnjasit !Trisyllabic Verbs without Personal Passive ending
-nit and -sit
LASSAN !Trisyllabic Verbs ending -nit without Personal Passive
Personal Passive but with Acc obj:
GEAGAT ! Trisyllabic Verbs without Personal Passive but with Acc obj.
BUOVVAL buovvalit ! Trisyllabic Verbs without Personal Passive but with
Acc obj ending -lit
The stems are distributed numerically as follows (the -it class includes both even-syllable and odd-syllable verbs):
-at            3722
even-syll  -it 1035 
-ut             794
total          5551
3-syllabic -it 5376
-át            297
-et            2310
-ot             111
total          2718
Verbal sublexica
tbw.
Verbal derivation
Here documenting the main even-syll ones, the other ones are similar. DIEHTI is transitive, BOAHTI is intransitive.
DIEHTI -> +V: DIEHTIStem ; +V: DeverbalVerbsDIEHTI ;
BOAHTI -> +V: BOAHTIStem ; +V: DeverbalVerbsBOAHTI ;
DIEHTIStem -> :Y7j PASSIVE ; BOAHTIINCH ;
BOAHTIStem -> SG3PASSV ; BOAHTIINCH ;
BOAHTIINCH -> DeverbalNounsV ; +goah0ti:goah'ti BOAHTICnj ; BOAHTICnj ;
BOAHTICnj -> +Ind+Prs: PrsV ; +Ind+Prt: PrtV ; +Pot+Prs:Q7z1 PrsC ;
      +Cond: CondV1 ; +Imprt: ImprtVA ; NominalFormsV ;
NominalFormsV -> :X1 NominalFormsV1 ; :X4 NominalFormsV2 ;
      :Q6 NominalFormsV3 ; :X2 NominalFormsV4 ; :Q3 NominalFormsV5 ;
      :Y1 NominalFormsV6 ;
PASSIVE ->  +Pass:uvvo DOHPPEINCH ; +Pass+meahttun+A:uvvomeahttum
      MEAHTTUN ; +Pass+PrfPrc:un K ; +Pass+eaddji+N+Actor:uvvojeaddji¤
      DEVNVCASE ; +Pass+upmi+N:upmi DEVNVCASE ;
DeverbalVerbsDIEHTI ->
 +st:X8st MUITALStem ;
 +st+alla:X6stalla DIEHTIStem ;
 +st+adda:X6stadda DIEHTIStem ;
 +l:l MUITALStem ;
 +l+adda:X2ladda DIEHTIStem ;
 +l+ahtti:lahtti DIEHTIStem ;
 +l+asti:las'ti DIEHTIStem ;
 +h:X4h MUITALStem ;
 +h+alla:X6halla DIEHTIStem ;
 +h+adda:X6hadda DIEHTIStem ;
 +h+asti:X4has'ti DIEHTIStem ;
 +stuvva:X8stuvva SG3PASSV ;
 +d:Q8d MUITALStem ;
DeverbalVerbsBOAHTI ->
 +st:X8st ALISTStem ;
 +st+alla:X6stalla BOAHTIStem ;
 +st+adda:X6stadda BOAHTIStem ;
 +l:l ALISTStem ;
 +l+adda:X2ladda BOAHTIStem ;
 +l+ahtti:lahtti BOAHTIStem ;
 +l+asti:las'ti BOAHTIStem ;
 +h:X4h MUITALStem ;
 +h+alla:X6halla DIEHTIStem ;
 +h+adda:X6hadda DIEHTIStem ;
 +h+asti:X4has'ti DIEHTIStem ;
 +stuvva:X8stuvva SG3PASSV ;
 +d:Q8d ALISTStem ;
Comments to the verb sublexica
Within each of the main groups, there are FIVE types, impersonal verbs, verbs with and without personal passives, verbs without Personal Passive but with Acc obj (+ two more lexicas, see above The VerbRoot lexicon), and inherent passives. The difference between i/a/u and e/á/o verbs is handeled in the rules file, and not in the lexicon file.
The with / without Personal Passive distinction shows up in one sublexicon. DOHPPE has PASSIVE, where GILLE has SG3PASS. So, this is (probably) a transivity difference, cf. also diehtit vs. boahtit. It seems thus that the difference is one of transitivity: 0, 1 and 2 valence.
At present, the file verb-sme-lex.txt comtains all the verbs. In the beginning of the file, all sublexica are exemplified. Then follows the bulk of the verbs, twosyllabic even, manysyllabic even, odd and contracted verbs.
Pronouns
The tag system follows the outline in Nickel.
All Pronouns have the initial lexicon path Root -> Pronoun -> …
Personal pronouns
Lexicon path:
Personal
 firstperspron
  firstperspronsg -> wordforms -> K
  firstpersprondu -> wordforms -> K
  perspronpl -> wordforms -> K
 nonfirstperspron
  nonfirsperspronsg -> wordforms -> K
  nonfirstpersrondu -> wordforms -> K
  perspronpl -> wordforms -> K
Note that 3rd person is identical for all three persons. Not all forms were different for the sg and du forms, but the lexica were split for consistency.
Interrogative pronouns
Mii, Gii, Guhte, Guhtemuš, Makkár, MAn Láhkái. The sublexicon Interrogative contains one entry for Sg Nom, and points the rest to the case paradigms.
Interrogative +Sg+Nom -> K (one entry for gii and one for mii)
        oblintercas (one entry for gii and one for mii) demcas
Demonstrative pronouns
The lexicon path:
Demonstrative
 demcas (one entry for each stem)
  demcassg
   nomdemcassg -> wordforms -> K
   obldemcassg -> wordforms -> K
  demcaspl
   nomdemcaspl -> wordforms -> K
   obldemcaspl -> wordforms -> K
Reflexive pronouns
The Nominative forms are just listed. The oblique ones are directed to the sublexicon reflobl, and there directed via different case stems to appropriate Px sublexica. These sublexica are the same as the ones for nouns, they are found in the sme-lex.txt file. The only exception are some sublexica that are used only for plural forms, these are duplicated here from the sme-lex file, in order not to revise the main lexicon.
Reciprocal pronouns
The section on reciprocal pronouns consists of three parts. The first 6 entries handle the first element of the recipr. The next 12 handle the 2nd part of the non-Px recipr. Finally, the members of the third section point to special Px sublexica, designed for the reciprocal pronouns, and found in the same section.
Relative pronouns
Formally, the relative and interrogative pronouns are identical. In this parser, we skip the separate chapter on relative pronouns, and instead we use the interrogative pronouns.
Indefinite pronouns
We divide the indefinite pronouns into three groups, with a fourth group of leftovers waiting for a better destiny:
- Declineable indefinite pronouns with case + clitic (mihkkege, giige, guhtemušge)
 - Declineable indefinites with normal case paradigms (eará, eanas, muhtin)
 - Indeclineable indefinites (buot, eatnat, guhtet)
 - TODO: A set of ideosyncratic cases
 
Declineable indefinite pronouns with case + clitic
These pronouns have two stems, one nominative, and one oblique, and the
clitic -ge attached to the case ending. The initial lexicon splits
them in two, one hard-coded nominative (e.g. giige), and one oblique
stem (e.g. gea-). Then, the case + clitic sequence is treated as a
single suffix (e.g. locative -sge, etc.). Since the clitic slot has
already been filled, they are directed to # rather than to K.
Declineable indefinites with normal case paradigms
This section hosts a seemingly complicated system of taylored sublexica. It contains three sections: First a section where the pronouns themselved are split into different continuation lexica, then a section with intoermadiate lexica, and finally a section with the case suffixes themselves. The lexica are partly modelled upon nominal lexica.
Naming convention for the sublexica:
- -c, -v:
Consonant stem, vowel stem 
- -n, -ne:
nominative, nominative and essive 
Indeclineable indefinites
There is first a list of multiword indefinites. These are picked out by the preprocessor and copied onto a file abbr.txt and put in the bin/ catalogue. In thie closed-sme-lex.txt they behave like the other indeclineable indefinites.
The ideosyncratic rest category
Indefinite pronouns are complex, and the grammars are not always explicit enough, so this section hosts a set of pronouns, partly with a hard-coded tag, partly just commented out. They are awaiting a principled linguistic solution, but in order to do that, we need more info than we can get from the reference grammars.
TODO: Have a linguistic/native speaker-look at this section.
Numerals
Overview of the lexicon structure
The numeral lexica are formed as a generator, generating all possible
numerals. The basic lexicon is Numeral, and it looks like this:
LEXICON Numeral
MILJON ; ! a noun of its own
UNDERDUHAT ; ! for generator under 1000
JUSTDUHAT ; ! going via 1000
OVERDUHAT ; ! for generator over 1000
OLD ; ! for "thirteen hundred, etc.
!num-basic ; ! replaced by the 5 lexica above
!num-derived ; ! still unimplemented
num-imprecise ;! still almost unimplemented
ARABIC ; ! for the arabic numerals
ROMAN ; ! for the roman numerals
MILJON is a noun. OLD is the old way of counting. num-ordinal act like adjectives, they are not finished yet. ARABIC and ROMAN contain number generators.
So, what is the reason for the three different lexica around 1000?
The reason is that the numeric system turns at the thousand mark. Numbers above it and numbers below it behave in the same way, thus we have both twentyfour and twentyfourthousand, etc.
The path is OVERDUHAT -> JUSTDUHAT -> UNDERDUHAT. OVERDUHAT generates the part of the numeral that is over 1000, and all these lexica then point to JUSTDUHAT. That lexicon has an optional “(one) thousand” before it leads either to DUHAT and via the relevant case paradigm to K, or to UNDERDUHAT. UNDERDUHAT contains the numerals 1-999. UNDERDUHAT starts with the lexicon for one, and gives each group of numerals its own lexicon.
Cardinals and ordinals
The cardinal and ordinal numbers are split at the final lexica, the
OKTAF and 2TO9F lexica. This generates both numbers as second and
fiftysecond.
Indeclinable words
All the lexica for indeclinable words are made the same way:
LEXICON Root
Adverb ;
 LEXICON Adverb
 áđamusat adv ;
  LEXICON adv
  +Adv:0 K ;
The Root lexicon points to the POS lexica (Adverb etc.). Each of the POS lexica lists the entries, with a pointer to an arbitrarily named sublexicon (here “adv”). This sublexicon contains the grammatical tag for the POS in question (the tag has no surface form, hence “:0”), and eventually a pointer towards the cliticon lexicon K. Adverbs can have clitics added, hence K, whereas subjunctions do not, hence no K.
[XXX At the moment particles are not directed to K, perhaps they should be. TODO: Check with corpus and native speakers.]
Adverbs
They are explained in the intro to the section “Indeclinable words” above.
Particles
These are in the closed-sme-lex.txt file. Their tag is +Pcle and th|qeir lexicon path is:
Root -> Particle -> pcle -> #
Subjunctions
Subjunctions are ahte, juos, etc. These are in the closed-sme-lex.txt file. Their lexicon path is:
Root -> Subjunction -> -> #
Conjunctions
Conjunctions are ja, dahjege, etc. These are in the closed-sme-lex.txt file. Their tag is +CC and their lexicon path is:
Root -> Conjunction -> Cc -> #
P-positions
There are three different classes here: Postpositions, occuring after their complement, prepositions, occuring before, and adpositions, occuring both before and after. This could have been done the Lingsoft way as well: Having +Adp as a common tag for both, and eventually +Prep and +Postp as subtags, no subtag would indicate both ways (or both subtags could be used). At the moment, they are left as 3 distinct groups. The classification is based upon Nickel, p-positions found only in Sammallahti’s dictionary and not in Nickel were put in the Adposition group. Empirical studies will probably lead to rearrangement of the present division, this should be looked into in connection with the morphological disambiguator (cg grammar).
Adpositions
Adpositions are are bajil, birra, gaskal, etc. These are in the pp-sme-lex.txt file. Their tag is +Adp and their lexicon path is:
Root -> Adposition -> Pp -> #
Postpositions
Postpositions are are bokte, lusa, etc. These are in the pp-sme-lex.txt file. Their tag is +Po and their lexicon path is:
Root -> Postposition -> Postp -> #
Prepositions
Prepositions are are aisttan, earet, etc. These are in the pp-sme-lex.txt file. Their tag is +Pr and their lexicon path is:
Root -> Preposition -> Prep -> #
Interjections
Interjections are are hoi, huh, kyš-kyš, etc. These are in the closed-sme-lex.txt file. Their tag is +Interj and their lexicon path is:
Root -> Interjection -> Ij -> #
Abbreviations
There is a file called abbr-sme-lex.txt.