North Sami NLP Grammar

Finite state and Constraint Grammar based analysers, proofing tools and other resources

View the project on GitHub giellalt/lang-sme

Introduction

File structure

The file format is documented in the Xerox manuals, especially in Karttunen 1993 Finite-State Lexicon Compiler, but see also the Beesley and Karttunen book. The file itself consists of a section defining Multichar_symbols, and of a large number of lexica, 183 lexica according to the present count (19.10.01). The file sme-lex.txt contains a.o. the continuation lexica for nouns, verbs and adjectives, whereas the bulk of the lexicon is divided into different files, as indicated below.

In the sme-lex.txt file, the Multichar_Symbols section contains all grammatical tags, and all multicharacter members of the alphabet (the latter set is taken from the grammar file).

The Root lexicon points to the lexica of the different parts of speech: (for each sublexicon there is a pointer to the relevant file containing the sublexicon)

LEXICON Root
NounRoot ;      ! -> noun-sme-lex.txt
ProperNoun ;    ! -> the file sme-lex.txt itself
AdjectiveRoot ; ! -> adj-sme-lex.txt
VerbRoot ;      ! -> verb-sme-lex.txt
Pronoun ;       ! -> closed-sme.lex.txt
Adverb ;        ! -> adv-sme-lex.txt
Particles ;     ! -> closed-sme.lex.txt
Subjunction ;   ! -> closed-sme.lex.txt
Conjunction ;   ! -> closed-sme.lex.txt
Adposition ;    ! -> pp-sme.lex.txt
Postposition ;  ! -> pp-sme.lex.txt
Preposition ;   ! -> pp-sme.lex.txt
Interjection ;  ! -> closed-sme.lex.txt

The different part of speech lexica are documented here, in the order just given.

Nouns

The NounRoot lexicon

The structure of the noun-sme.txt file

The file contains the following sections:

  1. LEXICON GuessNoun
  2. LEXICON NounRoot
    1. Stray forms
    2. Compounds
    3. Multiword nouns
    4. The noun stems themselves, sorted alphabetically (the alphabetical sorting is according to the computer algorithm, and not according to the Saami one)

The GuessNoun lexiconbelongs to the guess-sme binary file. The idea behind this is to have a binary file that guesses the grammatical form of nouns on the basis of their suffix, a hypothetical form like plineraiguin could for example be hypothetised to be the Comitative Plural of plinera. The guesser has not been used, and it is unclear whether it works at all. TODO: Check this.

The lexicon NounRoot first has some temporary sections. There is a set “compounds, awaiting a solution on the shorting question”, such like sámegiel, oar, beal, etc., reduced first- and middlepart compounds. The long-term solution is to build a syllable shortener for these compounds, for the time being the lexicalised ones are listed here. This section also contain some loanwords that typically have a shorter, non-sámi form as compounds, and a longer, Saami form when used as independent word. An illustrative example is the word sosial. In isolation, we find sosiála, but as compound, we find sosialdepartemeantta, etc. Thus, these words are listed here, pointing directly to the compound lexicon R.

Finally, the noun stem section itself is the large one, it contains about 26500 nouns, they are divided into different noun stem classes. They are documented in the next section.

The noun stem classes

The sublexica, ordered by inflectional type

Bisyllabic nouns:

GOAHTI !Bisyll. V-Nouns. Short nom-compound-forms goahte-,long/short gen
GOAHTILONG !Long nom-compound-forms,long/short gen
GOAHTILONGSHORT !Sometimes long nom-compound-forms,long/short gen
ALBMI !Bisyll. V-Nouns. Short nom-compound-forms, long gen.
ALBMILONG !Bisyll. V-Nouns. Long nom-compound-forms, long gen.
ALBMILONGSHORT !Bisyll. V-Nouns. Long/SHORT nom-compound-forms, long gen.
AIGI !Bisyll. V-Nouns. Short nom-compound-forms, short gen

ACTOR !Long compound-forms
ACTORLONGSHORT !Sometimes long compound-forms
ACTORTV !deverbal nouns from transitive verbs
ACTORLONGSHORTTV !deverbal nouns from transitive verbs

STAHTA !Bisyll. Non-Gradating a-Nouns; i-Illative
IIJA !Bisyll. Non-Gradating a-Nouns; a-Illative

LUONDU !this word (+vuohta) because of behavior in compounds
RUOKTU !only this word because of its behavior in compounds
MANNI !Bisyll. V-Nouns. Long/SHORT nom-compound-forms, long gen. ILL:mánnii/mánnái
BASSI !Bisyll. V-Nouns. Short nom-compound-forms, short gen., long heapmi-caritive
EADNI !Bisyll. V-Nouns. Short nom-compound-forms, long gen.short caritive
RAFI !Bisyll. V-Nouns. Long nom-compound-forms, long gen. short heapmi-caritive
DUOHKI !only this word so far for disamb. reasons

JOHTOLAT
LEXMUSH

Actios:

BOAHTALADDAN ! No compound-forms
BOAHTALADDANTV !Words coming from transitive verbs, No compound-forms
IHTALUDDAMAT ! No compound-forms, plural only
UPMI
EAPMI
MUITTASJEAPMI
EAPMITV !words stemming from transitives
MUITTASJEAPMITV !words stemming from transitives

Without loc and ill sg:

OLGU !Bisyll. V-Nouns. Short nom-compound-forms, short gen.
MIEHTI !Bisyll. V-Nouns. Short nom-compound-forms, long gen.
LULLI !Bisyll. V-Nouns. Long/SHORT nom-compound-forms, long gen

With comparatives:

GADDI !Bisyll. V-Nouns with Comparative Forms. Short nom-compound-forms, long gen.
GADDILONG !long compound forms
GADDISHORT !Bisyll. V-Nouns with Comparative Forms. Short nom-compound-forms, SHORT gen.

With comparatives, without loc and ill sg:

GADDILONGSHORT !NB! No SgIll and SgLoc because davvi is the only word this far Bisyll. V-Nouns with Comp. Forms, long-short nomcmp, long gencmpOARJI !Bisyll. V-Nouns with Comparative Forms. Short nom-compound-forms, long gen.
LULLILONG !long compound forms

Plurals:

BUVSSAT !Pl.
BUVSSATLONG

BORALMASAT !like JOHTOLAT but plural only

Trisyllabic nouns:

Ending with consonant, gradating:

MATTAR !Tris. Anim. Gradating C-Nouns
MALIS !Tris. Inanim. Gradating C-Nouns, Short compound-forms
MALISLONG !Tris. Inanim. Gradating C-Nouns. Long compound-forms
MALISLONGSHORT !Tris. Inanim. Gradating C-Nouns. Long and short compound-forms

OVCCIS ! Collective numerals
DAIVVAS !Tris. Gradating C-Nouns, The Troms declension: dáivvaš:dáivaha, bearaš:bearraha, njunuš:njunnoha

Ending with vowel, gradating:

BEANA !Trisyll. Anim. Gradating 0-Nouns
SEAMU !Short compound-forms. Trisyll. Inanim. Gradating 0-Nouns
SEAMULONG !Long compound-forms. Trisyll. Inanim. Gradating 0-Nouns

Ending with consonant, non-gradating:

GAHPIR !Short compound-forms. Trisyll. Non-Gradating C-Nouns
GAHPIRLONGSHORT !Long and short compound-forms. Trisyll. Non-Gradating C-Nouns
GAHPIRLONG !Long compound-forms. Trisyll. Non-Gradating C-Nouns

LEXDIMINC !Diminutives

Actios:

BOAHTINLONGSHORT ! Deverbal nouns
BOAHTIN ! Long compound-forms
BOAHTINSHORT ! Short compound-forms
BOAHTINLONGSHORTTV !!Words coming from transitive verbs Deverbal nouns
BOAHTINTV !!Words coming from transitive verbs Long compound-forms
BOAHTINSHORTTV !!Words coming from transitive verbs Short compound-forms
IHTAMAT !Actio plurals

Plurals:

BADJOSAT !Pl. bajus:badjosat
BADJOSATLONG !Pl. bajus:badjosat
ALIMAT !Pl. alin:alimat
CEAKCAGAT !Like seamu but plural only
VUOIGNAHAT !Like DAIVVAS but only Pl. vuoiŋŋaš:vuoigŋahat

Contracted nouns:

BOAZU !Anim. Contracted 0-Nouns
SUOLU !Short compound-forms. Inanim. Contracted 0-Nouns
SUOLULONG !Long compound-forms. Inanim. Contracted 0-Nouns

GISTTA !The Noun gistta, gist -

FALIS !Contracted Anim. C-Nouns
LASIS !Contracted Inanim. C-Nouns

GUOVTTIS ! Collective numerals
GUOVTTU !Here because other lexicons don’t fit for it

Plurals:

DURVAT !like LASIS but pl. only
GIRKOSADDOT !LIKE SATTU but pl, only

Miscellaneous noun types:

GENTLEMAN ! cns-final bisyllabic loanwords (stem mana-
MADII !Another peculiar word that deserves its own lex
MASAI !heavy fin syll !gen = -a ill = -ii !parallel to NYSTØ
BUDEITA !Rather special word
GARGIA ! light fin syll, bisyll. on -o- that doesn’t have change o:u in front of j (i): Kino
BUFFALO !heavy fin syll !gen = -a ill = -ii !parallel to NYSTØ
KULTUR !Recent loanwords on -vra with short cmp-form
KANTUR
MASKIN !Recent loanwords on -iidna with short cmp-form
TELEFON !Recent loanwords on -vdna with short cmp-form
SOSIAL !Recent loanwords on -ála with both short and long cmp-form
OLLUVUOHTA !Exceptional vuohta-Nouns
MAGASH
EGEZHAGAT

Nominal sublexica

tbw. Here we document the nominal section of the sme-lex.txt file.

khjh

The ProperNoun lexicon

The proper nouns are stored in gt/sme/src/propernoun-sme-lex.txt.

The file structure

Propernoun is one file, with the inflectional lexica first, followed by the over 36000 items long proper noun lexicon. This lexicon points to several sublexica. They are shown below, ordered according to the phonological and semantical properties of the stem. After the table comes a list of lexica not yet integrated in the table.

Type                            
StemCoda  CG  IllChange  Loc          Lexicon 
--------------------------------------------------------
Monosyllabic stems
HeavyVow  no   no        -as          NYSTØ  
HeavyCns  no   no        -as          BERN

Bisyllabic stems
LightVow  no   yes       -s           ACCRA  
LightVow  yes  yes       -s           MARJA
LightVow  yes  yes       -s           SUOPMA
Light e   no   no        -s           SIJTE
LightCns  no                          HEANDARAT
LightVow  no   no        -s           NIKOSIIJA
LightVow  yes  yes       -s           HEIM
LightVow  yes  yes       -s           SUND

Trisyllabic stems
LightCns  no   no        -is          LONDON
LightCns  no   no        -is          ANAR
LightCns  yes  yes       -is          DUORTNUS
LightVow  yes                         GUOLBBA
LightCns  no   no        -is          PLACE-DIM
LightCns  yes  yes       -is          RANES
LightCns  yes  yes       -is          CAVKKUS
LightCns  no   no        -is          BALAK
LightCns  no   no        -is          SARAK

Contracted stems
DAVVISUOLU !Inanim. Contracted 0-Names - from SUOLU
GEAVNNIS

Mixed stems
-nen      no   no        -as/nenis  no    C-FI-NEN
HAWAII
SKOHTERMADII

Plural stems
ALEUHTAT is for bisyllabic, would-have-been vowel final stems
VARGGAT  is as aleuhtat, but with Sg Gen and Loc substandard forms
EATNAMAT is for trisyllabic plural stems
ADJAGAT  is for trisyllabic plural stems
SULLOT   is for contracted plural stems
LASSAT   is for contracted plural stems
SKANIK   is for plurals on -k
---------------------------------------------------------
Notes:
Comp means "has comparative forms"

This table is still not completed. The lexica are there, but they are not all properly characterised.

Looking at the distribution, and ignoring the semantic subtags, we have a (slightly outdated) distribution of names over lexicon types as follows:

9114 ACCRA
8482 BERN
8371 MARJA
6480 LONDON
1390 C-FI-NEN
 865 NYSTØ
 296 PLACE
 133 DUORTNUS
  98 DAVVISUOLU
  68 NIILLAS
  62 GUOLBBA
  59 ANAR
  48 ALEUHTAT
  31 SULLOT
  28 HEANDARAT
  27 GIEDDI
  13 VARGGAT
  12 GEAVNNIS
   7 EATNAMAT

Saami geographical names

Saami names have been added from Norway, in 2002. Since that, Statens Kartverk has translated most (all?) Saami names from the old Bergsland/Ruong to the new 1979 orthography. Now, the remaining Sámi names should be added to the list, in the following way:

  1. All North Saami names should be extracted from the map base and run against the transducer
  2. The names that do not get a +Prop tag should be extracted, and then added to the base.

Saami names from Finland and Sweden have not been added, but they are underway.

Saami names from outside Sápmi, as listed in Sámi Atlas are added. Other Saami names from outside Sápmi should also be added, but we don’t know of good sources for such names (language councils are one possibility)

Names in other Saami languages are increasingly being used in North Saami texts, e.g. when referring to South Saami or Lule Saami institutions. These ones should be added and organised in lexica.

Adjectives

The file adj-sme-lex.txt contains 4270 entries.

AdjRoot

The adjective sublexica

In the lexicon file adj-sme-lex.txt, the sublexica are distributed in the following way (30.06.05) (ordered after frequency, thereafter after declension type):

Making linguistic sense of the system (Sammallahti's codes aaa etc.):

 899 DABALAS       aad
 477 JEAGOHEAPMI   bae
 358 BOAKKAS       c
 329 At            attributes
 280 BEAKKAN       aab
 218 GAPPUS        aab
 205 JOHTIL        babaa
 191 NUORRA        aaa
 165 AKTIIVA
 157 JIEDNAI       bad
 150 GARAS         bbb
 139 MEAHTTUS      aab
 114 LAIKI         baa
 106 DEARVVASLAS   aad
  71 ISSORAS       aad
  59 VIELG         babab
  38 CAHKK         babaa babab
  31 BUOREMUS      aae
  30 JUHKKIS       aaa
  26 GUOHCA        aab
  26 EATTAS        babba
  19 LODJI         baa bac
  14 GEARGGUS      aab
  13 SEARRA        baa
  13 DILDDAS       babba
   8 VUDDJII       bad
   8 CIENAL        babbb
   7 LINIS         bbe
   7 DEAHTIS       bbe
   7 BIEKKUS       babba
   7 ASEHAS        baf
   6 OVDDIT        aae
   6 NUOLUS        aab
   5 VATTIS        aab
   5 NJUORAS       bbb
   5 HEAHKAS       babba
   4 VIISSIS       aab aac
   4 LIEKKUS       aab
   4 FINJU-        
   3 UHCC          bba
   3 SUVRRIS       bbe
   3 JALGAT        bbc babab
   3 BUORRE         ab
   3 BU/MUS        
   3 ATTR
   3 ALLAT         bbc
   2 NJALGGAT      bbe
   2 NAMAT




a.................No morphologically distinct attr. form
  aa..............attr. is not inflected

    aaa...........Bisyll.
                  NUORRA      
                  JUHKKIS     Non-gradating 

    aab...........Trisyll.
                  BEAKKAN     Gradating 
                  GAPPUS      Gradating 
                  MEAHTTUS    Compounded non-gradating 
                  GUOHCA      Gradating 
                  GEARGGUS    Gradating 
                  NUOLUS      Gradating 
                  VATTIS      Gradating 
                  VIISSIS     Gradating 
                  LIEKKUS     Gradating 
                  IPMAHA      Gradating

    aac...........Contracted 
                  VIISSIS     

    aad...........Qadrisyll.
                  DABALAS      
                  DEARVVASLAS 
                  ISSORAS     

    aae...........Comparative forms
                  BUOREMUS    
                  OVDDIT      


  ab..............Attr. in partial congruence with the noun
                  BUORRE     





b.................Morphologically distinct attr. form  
  ba..............Attr. form ends with -s

    baa...........Bisyll.
                  LAIKI       Attr. form -es in WeG
                  LODJI       Attr. form -es and -is in WeG
                  SEARRA      Attr. form -s in WeG

    bab...........Trisyll.
      baba........Non-gradating
        babaa.....No contraction in attr. form
                  JOHTIL  is-Attr.
                  CAHKK   is-Attr.

        babab.....Contraction (and ensuing strengthening) in attr. fom
                  VIELG   es-Attr. 
                  CAHKK   es-Attr.
                  JALGAT  es-Attr.   


      babb........Grade alternation in stem
        babba.....s-stems with WeG attr-form
                  EATTAS  is-Attr.
                  DILDDAS is-Attr.
                  BIEKKUS is-Attr
                  HEAHKAS is-Attr

        babbb.....Attr. form in StG
                  CIENAL  is-Attr.


    bac...........Vocalic contracetd attr. form
                  LODJI       -es and -is Attr. in WeG

    bad...........Contracted stems
                  JIEDNAI     Non-gradating 
                  VUDDJII     Non-gradating 

    bae...........Caritives with attr- -his.
                  JEAGOHEAPMI 

    baf...........Qadrisyll. Adj., with Attr.-his
                  ASEHAS      




  bb..............Attr. form ends with -a
    bba...........Bisyll. ending in -i (or -a sometimes) 
                  UHCC        StG attr.
   
    bbb...........Trisyll. ending -as
                  GARAS       Gradating with StG attr.
                  NJUORAS     Gradating with StG attr.

    bbc...........Trisyll. ending -at
                  JALGAT      Non-gradating with StG attr.
                  ALLAT       Non-gradating with StG attr.

    bbe...........Trisyll. ending -is
                  LINIS       Gradating with StG Attr. 
                  DEAHTIS     Contracted with StG Attr.
                  SUVRRIS     Gradating with WeG Attr. 
                  NJALGGAT    Gradating with StG Attr.







c.................No attr.
                  BOAKKAS         Trisyll. 

Verbs

The VerbRoot lexicon

The lexicon is stored in the verb-sme.txt file.

VerbRoot contains 49 sublexica divided into three stem types:

Modals:
**

GALGA
FERTE

Bisyllabic verbs:

Impersonals:
ARVI arvit sataa !Bisyllabic Impersonal Verbs
ARVALADDAT arváladdat sataa !Already derived bisyllabic Impersonal Verbs

Transitives:
DIEHTI diehtit tietaa !Bisyllabic i-Verbs with Personal Passive
BORRA !Bisyllabic a- and u-verbs with Personal Passive
DEAKCU !as BORRA for u-verbs with dim -astit, and a-verbs with dim -istit that are hardcoded
DIEHTISHORT !Short actio compound-form DIEHTILONGSHORT !Long and short actio compound-form DIEHTALADDAT diehtáladdat tietaa !Already derived bisyllabic Verbs with Personal Passive
HAHTTIT !Four-syll kausatives on -httit
DAHTU dáhtut ! As diehti, but -ut verbs, thus without short passive
BOLTU ! As dáhtu but with dim -astit that are harcoded
DOAROSTUVVA !INCHOATIVES on -stuvvat

Intransitives:
ALLU ! -ut verbs, thus without short passive BOAHTI boahtit tulla !Bisyllabic i-Verbs without Personal Passive
BOAHTILONGSHORT !!Long and short actio compound-form DIEVVA ! Bisyllabic a- and u-verbs without Personal Passive but with Actor BOAZZU !as DIEVVA for u-verbs with dim -astit, that are hardcoded BINDU !as DIEVVA (but without short passive) for u-verbs with dim -astit, that are hardcoded BOAHTALADDA boahtáladda tulla !Already derived bisyllabic Verbs without Personal Passive
RAIMMAHALLA !passives on -hallat and INCHOATIVES on -stuvvat

Personal Passive but with Acc obj:
MAHTI máhttiit ! Bisyllabic Verbs without Personal Passive but with Acc obj.
MAHTALADDA máhtáladdat ! Already derived bisyllabic Verbs without Personal Passive but with Acc obj.

Inherent passives:
RAIMMAHALLA !passives on -hallat and INCHOATIVES on -stuvvat UVVA !passives -uvvat

Contracted verbs:

Impersonals:
BORGE borget tehda pyry !Contracted Impersonal Verbs

Transitives:
DOHPPE dohppet tarttua !Contracted Verbs with Personal Passive
MUITA ! Inchoatives and translatives on -á, -o, -e with Personal Passive

Intransitives:
GILLE gillet viitsia !Contracted Verbs without Personal Passive
CIRRO ! Inchoatives and translatives on -á, -o, -e without Personal Passive

Personal Passive but with Acc obj:
MAHTA máhttát !Contracted Verbs without Personal Passive but with Acc obj.

Inherent passives:
BASSO bassot ! Bisyllabic, inherently passive -ot verbs

Trisyllabic verbs:

Impersonals:
COASKKIT čoaskkidiit !Trisyllabic impersonal verbs
ARVVASJ arvvašit !Trisyllabic impersonal verbs ending -šit, -skit, smit, -idit, -ldit, -git and 5-syllables
ARVIL arvilit !Trisyllabic impersonal verbs ending -lit

Transitives:
MUITAL muitalit !Trisyllabic Verbs with Personal Passive
MUITTASJ !Words ending -šit, -skit, -ldit - !directed here as well: !Reciprocals on -dit !Momentatives on -dit, -ádit, -ihit, -e7hit !Frequentatives on -(u)hit !Continuatives on -nit !Inchoatives in -nit

HALIID !Words ending -smit, -idit, -git - BONJAT !!Cont/Freq on -dit, Continuatives on -(u)hit, Reciprocals, momentatives and frequentatives ending -alit
VUORDIL !Trisyllabic Verbs ending -lit, -rit with Personal Passive

Intransitives:
ALIST alistit !Trisyllabic Verbs without Personal Passive
BEAGASJ !Words ending -šit, -skit -ldit, transitives on -hit
!directed here as well: !Reciprocals on -dit !Momentatives on -dit, -ádit, -ihit, -e7hit !Frequentatives on -(u)hit !Continuatives on -nit !Inchoatives in -nit

JORGGIID !Words ending -smit, -idit, -git
BALAT !!Cont/Freq on -dit, Continuatives on -(u)hit, Reciprocals, momentatives and frequentatives ending -alit
SUOTNJAL suotnjalit !Trisyllabic Verbs without Personal Passive ending -lit
BOTNJAS botnjasit !Trisyllabic Verbs without Personal Passive ending -nit and -sit
LASSAN !Trisyllabic Verbs ending -nit without Personal Passive

Personal Passive but with Acc obj:
GEAGAT ! Trisyllabic Verbs without Personal Passive but with Acc obj. BUOVVAL buovvalit ! Trisyllabic Verbs without Personal Passive but with Acc obj ending -lit

The stems are distributed numerically as follows (the -it class includes both even-syllable and odd-syllable verbs):

-at            3722
even-syll  -it 1035 
-ut             794
total          5551

3-syllabic -it 5376

-át            297
-et            2310
-ot             111
total          2718

Verbal sublexica

tbw.

Verbal derivation

Here documenting the main even-syll ones, the other ones are similar. DIEHTI is transitive, BOAHTI is intransitive.

DIEHTI -> +V: DIEHTIStem ; +V: DeverbalVerbsDIEHTI ;
BOAHTI -> +V: BOAHTIStem ; +V: DeverbalVerbsBOAHTI ;
DIEHTIStem -> :Y7j PASSIVE ; BOAHTIINCH ;
BOAHTIStem -> SG3PASSV ; BOAHTIINCH ;
BOAHTIINCH -> DeverbalNounsV ; +goah0ti:goah'ti BOAHTICnj ; BOAHTICnj ;
BOAHTICnj -> +Ind+Prs: PrsV ; +Ind+Prt: PrtV ; +Pot+Prs:Q7z1 PrsC ;
      +Cond: CondV1 ; +Imprt: ImprtVA ; NominalFormsV ;
NominalFormsV -> :X1 NominalFormsV1 ; :X4 NominalFormsV2 ;
      :Q6 NominalFormsV3 ; :X2 NominalFormsV4 ; :Q3 NominalFormsV5 ;
      :Y1 NominalFormsV6 ;
PASSIVE ->  +Pass:uvvo DOHPPEINCH ; +Pass+meahttun+A:uvvomeahttum
      MEAHTTUN ; +Pass+PrfPrc:un K ; +Pass+eaddji+N+Actor:uvvojeaddji¤
      DEVNVCASE ; +Pass+upmi+N:upmi DEVNVCASE ;
DeverbalVerbsDIEHTI ->
 +st:X8st MUITALStem ;
 +st+alla:X6stalla DIEHTIStem ;
 +st+adda:X6stadda DIEHTIStem ;
 +l:l MUITALStem ;
 +l+adda:X2ladda DIEHTIStem ;
 +l+ahtti:lahtti DIEHTIStem ;
 +l+asti:las'ti DIEHTIStem ;
 +h:X4h MUITALStem ;
 +h+alla:X6halla DIEHTIStem ;
 +h+adda:X6hadda DIEHTIStem ;
 +h+asti:X4has'ti DIEHTIStem ;
 +stuvva:X8stuvva SG3PASSV ;
 +d:Q8d MUITALStem ;
DeverbalVerbsBOAHTI ->
 +st:X8st ALISTStem ;
 +st+alla:X6stalla BOAHTIStem ;
 +st+adda:X6stadda BOAHTIStem ;
 +l:l ALISTStem ;
 +l+adda:X2ladda BOAHTIStem ;
 +l+ahtti:lahtti BOAHTIStem ;
 +l+asti:las'ti BOAHTIStem ;
 +h:X4h MUITALStem ;
 +h+alla:X6halla DIEHTIStem ;
 +h+adda:X6hadda DIEHTIStem ;
 +h+asti:X4has'ti DIEHTIStem ;
 +stuvva:X8stuvva SG3PASSV ;
 +d:Q8d ALISTStem ;

Comments to the verb sublexica

Within each of the main groups, there are FIVE types, impersonal verbs, verbs with and without personal passives, verbs without Personal Passive but with Acc obj (+ two more lexicas, see above The VerbRoot lexicon), and inherent passives. The difference between i/a/u and e/á/o verbs is handeled in the rules file, and not in the lexicon file.

The with / without Personal Passive distinction shows up in one sublexicon. DOHPPE has PASSIVE, where GILLE has SG3PASS. So, this is (probably) a transivity difference, cf. also diehtit vs. boahtit. It seems thus that the difference is one of transitivity: 0, 1 and 2 valence.

At present, the file verb-sme-lex.txt comtains all the verbs. In the beginning of the file, all sublexica are exemplified. Then follows the bulk of the verbs, twosyllabic even, manysyllabic even, odd and contracted verbs.

Pronouns

The tag system follows the outline in Nickel.

All Pronouns have the initial lexicon path Root -> Pronoun -> …

Personal pronouns

Lexicon path:

Personal
 firstperspron
  firstperspronsg -> wordforms -> K
  firstpersprondu -> wordforms -> K
  perspronpl -> wordforms -> K
 nonfirstperspron
  nonfirsperspronsg -> wordforms -> K
  nonfirstpersrondu -> wordforms -> K
  perspronpl -> wordforms -> K

Note that 3rd person is identical for all three persons. Not all forms were different for the sg and du forms, but the lexica were split for consistency.

Interrogative pronouns

Mii, Gii, Guhte, Guhtemuš, Makkár, MAn Láhkái. The sublexicon Interrogative contains one entry for Sg Nom, and points the rest to the case paradigms.

Interrogative +Sg+Nom -> K (one entry for gii and one for mii)
        oblintercas (one entry for gii and one for mii) demcas

Demonstrative pronouns

The lexicon path:

Demonstrative
 demcas (one entry for each stem)
  demcassg
   nomdemcassg -> wordforms -> K
   obldemcassg -> wordforms -> K
  demcaspl
   nomdemcaspl -> wordforms -> K
   obldemcaspl -> wordforms -> K

Reflexive pronouns

The Nominative forms are just listed. The oblique ones are directed to the sublexicon reflobl, and there directed via different case stems to appropriate Px sublexica. These sublexica are the same as the ones for nouns, they are found in the sme-lex.txt file. The only exception are some sublexica that are used only for plural forms, these are duplicated here from the sme-lex file, in order not to revise the main lexicon.

Reciprocal pronouns

The section on reciprocal pronouns consists of three parts. The first 6 entries handle the first element of the recipr. The next 12 handle the 2nd part of the non-Px recipr. Finally, the members of the third section point to special Px sublexica, designed for the reciprocal pronouns, and found in the same section.

Relative pronouns

Formally, the relative and interrogative pronouns are identical. In this parser, we skip the separate chapter on relative pronouns, and instead we use the interrogative pronouns.

Indefinite pronouns

We divide the indefinite pronouns into three groups, with a fourth group of leftovers waiting for a better destiny:

  1. Declineable indefinite pronouns with case + clitic (mihkkege, giige, guhtemušge)
  2. Declineable indefinites with normal case paradigms (eará, eanas, muhtin)
  3. Indeclineable indefinites (buot, eatnat, guhtet)
  4. TODO: A set of ideosyncratic cases

Declineable indefinite pronouns with case + clitic

These pronouns have two stems, one nominative, and one oblique, and the clitic -ge attached to the case ending. The initial lexicon splits them in two, one hard-coded nominative (e.g. giige), and one oblique stem (e.g. gea-). Then, the case + clitic sequence is treated as a single suffix (e.g. locative -sge, etc.). Since the clitic slot has already been filled, they are directed to # rather than to K.

Declineable indefinites with normal case paradigms

This section hosts a seemingly complicated system of taylored sublexica. It contains three sections: First a section where the pronouns themselved are split into different continuation lexica, then a section with intoermadiate lexica, and finally a section with the case suffixes themselves. The lexica are partly modelled upon nominal lexica.

Naming convention for the sublexica:

Indeclineable indefinites

There is first a list of multiword indefinites. These are picked out by the preprocessor and copied onto a file abbr.txt and put in the bin/ catalogue. In thie closed-sme-lex.txt they behave like the other indeclineable indefinites.

The ideosyncratic rest category

Indefinite pronouns are complex, and the grammars are not always explicit enough, so this section hosts a set of pronouns, partly with a hard-coded tag, partly just commented out. They are awaiting a principled linguistic solution, but in order to do that, we need more info than we can get from the reference grammars.

TODO: Have a linguistic/native speaker-look at this section.

Numerals

Overview of the lexicon structure

The numeral lexica are formed as a generator, generating all possible numerals. The basic lexicon is Numeral, and it looks like this:

LEXICON Numeral
MILJON ; ! a noun of its own
UNDERDUHAT ; ! for generator under 1000
JUSTDUHAT ; ! going via 1000
OVERDUHAT ; ! for generator over 1000
OLD ; ! for "thirteen hundred, etc.
!num-basic ; ! replaced by the 5 lexica above
!num-derived ; ! still unimplemented
num-imprecise ;! still almost unimplemented
ARABIC ; ! for the arabic numerals
ROMAN ; ! for the roman numerals

MILJON is a noun. OLD is the old way of counting. num-ordinal act like adjectives, they are not finished yet. ARABIC and ROMAN contain number generators.

So, what is the reason for the three different lexica around 1000?

The reason is that the numeric system turns at the thousand mark. Numbers above it and numbers below it behave in the same way, thus we have both twentyfour and twentyfourthousand, etc.

The path is OVERDUHAT -> JUSTDUHAT -> UNDERDUHAT. OVERDUHAT generates the part of the numeral that is over 1000, and all these lexica then point to JUSTDUHAT. That lexicon has an optional “(one) thousand” before it leads either to DUHAT and via the relevant case paradigm to K, or to UNDERDUHAT. UNDERDUHAT contains the numerals 1-999. UNDERDUHAT starts with the lexicon for one, and gives each group of numerals its own lexicon.

Cardinals and ordinals

The cardinal and ordinal numbers are split at the final lexica, the OKTAF and 2TO9F lexica. This generates both numbers as second and fiftysecond.

Indeclinable words

All the lexica for indeclinable words are made the same way:

LEXICON Root
Adverb ;
 LEXICON Adverb
 áđamusat adv ;
  LEXICON adv
  +Adv:0 K ;

The Root lexicon points to the POS lexica (Adverb etc.). Each of the POS lexica lists the entries, with a pointer to an arbitrarily named sublexicon (here “adv”). This sublexicon contains the grammatical tag for the POS in question (the tag has no surface form, hence “:0”), and eventually a pointer towards the cliticon lexicon K. Adverbs can have clitics added, hence K, whereas subjunctions do not, hence no K.

[XXX At the moment particles are not directed to K, perhaps they should be. TODO: Check with corpus and native speakers.]

Adverbs

They are explained in the intro to the section “Indeclinable words” above.

Particles

These are in the closed-sme-lex.txt file. Their tag is +Pcle and th|qeir lexicon path is:

Root -> Particle -> pcle -> #

Subjunctions

Subjunctions are ahte, juos, etc. These are in the closed-sme-lex.txt file. Their lexicon path is:

Root -> Subjunction -> -> #

Conjunctions

Conjunctions are ja, dahjege, etc. These are in the closed-sme-lex.txt file. Their tag is +CC and their lexicon path is:

Root -> Conjunction -> Cc -> #

P-positions

There are three different classes here: Postpositions, occuring after their complement, prepositions, occuring before, and adpositions, occuring both before and after. This could have been done the Lingsoft way as well: Having +Adp as a common tag for both, and eventually +Prep and +Postp as subtags, no subtag would indicate both ways (or both subtags could be used). At the moment, they are left as 3 distinct groups. The classification is based upon Nickel, p-positions found only in Sammallahti’s dictionary and not in Nickel were put in the Adposition group. Empirical studies will probably lead to rearrangement of the present division, this should be looked into in connection with the morphological disambiguator (cg grammar).

Adpositions

Adpositions are are bajil, birra, gaskal, etc. These are in the pp-sme-lex.txt file. Their tag is +Adp and their lexicon path is:

Root -> Adposition -> Pp -> #

Postpositions

Postpositions are are bokte, lusa, etc. These are in the pp-sme-lex.txt file. Their tag is +Po and their lexicon path is:

Root -> Postposition -> Postp -> #

Prepositions

Prepositions are are aisttan, earet, etc. These are in the pp-sme-lex.txt file. Their tag is +Pr and their lexicon path is:

Root -> Preposition -> Prep -> #

Interjections

Interjections are are hoi, huh, kyš-kyš, etc. These are in the closed-sme-lex.txt file. Their tag is +Interj and their lexicon path is:

Root -> Interjection -> Ij -> #

Abbreviations

There is a file called abbr-sme-lex.txt.