Lule Sami NLP Grammar

Finite state and Constraint Grammar based analysers, proofing tools and other resources

View the project on GitHub giellalt/lang-smj

Introduction

This file describes the Lule Saami lexicon files. The documentation is not complete. Status quo for the lexicon files is as follows (030310):

The file structure for the lexicon files is the same as for the other Saami lgs, cf. the North Sámi flowchart.

Lexicons

sme   smj      
ACCRA ACCRA
MARJA MARJA
NYSTØ NYSTØ
BERN BERN
LONDON LONDON
C-FI-NEN LONDON
ANAR ANAR
ALEUHTAT ALEUHTAT
SUND BERN
HEIM BERN
HAWAII ACCRA

The Propernouns

The proper nouns are extracted out of the common file proper-noun.xml. This file takes sme as its starting point. The continuation lexicon correspondence is as follows

Type 
StemCoda    CG  IllChange    Loc      Lexicon  Status
-----------------------------------------------------
LightVow    no   yes         -s        ACCRA     
LightVow    yes  yes         -s        MARJA 
                                      
HeavyVow    no   no          -as       NYSTØ     tmp
                                      
HeavyCns    no   no          -as       BERN
LightCns    no   no          -is       LONDON

DUORTNUS ! Trisyll. Cns-final, cons.grad.
ANAR     ! Trisyll. Cns-final, no cons.grad.
SARAK    ! Trisyll. Cns-final, no cons.grad

BASUDIS  ! Bisyll. Non-Gradating C-Proper Names

GEAVNNIS   ! Contracted. Gradating C-Proper Names
DAVVISUOLU ! Contracted. Gradating C-Proper Names

Plural names

VARGGAT
ALEUHTAT
SULLOT
EATNAMAT

The Nouns

The nouns are divided in the following sublexica (the noun-smj-lex.txt file is organized in chapters according to these sublexica).

Spiik    LEXICON       Type
--------------------------------------------------------------
even     MUORRA        2syll stem with cg (note Q1)
even     GÁDDE         2syll stem with cg (note Q1) with comparatives
even     ÅLGGO         Like MUORRA but no inessive/elative/illative sg, with comparatives
even     MIEHTE        Like MUORRA but no locative/elative/illative sg
even     NOADE         2syll stem without cg (note missing Q1)
even     KINO          2syll stem without any morphophonology
even     VUOSA         2syll stem with cg only pluralforms
even     AHKA          words like tjerastahka with short compound-form
even     OADÁDAGÁ      pl. words like tjerastahka with short compound-form
even     VUOHTA
even     VSBST-EVEN
even     ACTOR
even     ACTORdiehtet
even 4sl GÅNÅGIS       4syll stem with cg
even 4sl CELSIUS       4syll stem without cg
even 4sl BERRAHATTJA   like GÅNÅGIS but pl.
even 4sl JIHPELIJ      gen:jihpelahá
even 4sl OARJJILIJ     gen:oarjjilihá
even 4sl VIESSOMUJ     gen:viessumuhe
even 4sl SIJDDALAHÁ    plural
even 4sl SISSNELUHÁ    plurals
odd      ÅRES          oddsyll, cg, C-final, 2ndsyll vowchange
odd      GÁMAS         oddsyll, cg, C-final, no 2ndsyll vowchange
odd      SÅHKÅR        oddsyll, cg, C-final, no 2ndsyll vowchange and only long essive
odd      GAHPER        oddsyll, no cg, no vowchange
odd      BENA          oddsyll, cg, vowchange, V-final nominative
odd      DÁRBBAGA      like BENA BUT ONLY PLURAL
odd      VÁLLDÁRA      oddsyllable pluralforms only
odd      SNJIERÁGA     oddsyllable pluralforms only
odd      BÆLLJASA      GÁMAS but plural
odd      MANEBU        oddsyllable pluralforms only
odd      BAVSEV        gen:baksima
odd      GÅRDAL        gen:goarddala
odd      RÁBEV         gen:ráhpuga
odd      SEKKAS        gen:sæggása
odd      DERDAJ        gen:dærddaga
odd      BØSOJ         gen:bøssuga
odd      GUOVSOJVUOJOJ gen:vuodjom
odd      RITJAS        Like GÁMAS but without stem a-lengthening for grade I (underlying long -i-)
odd      TJÅLKES       gen:tjoalkkasa
odd      BÁRNEP        gen:bárnebu
odd      SÅGAS         gen:sågaska
odd      SÁGE          gen:sáhkaha
odd      BUTJES        gen:buttjása
odd      VSBST-ODD
cntr     SARVES        cntr C-final with cg
cntr     SUOLOJ
cntr     GUOMOJ       III-I cg
cntr     GÅHKES       gen:gåhkkå
cntr     SJUOKKAJ     gen:sjuoggá
cntr     GISTÁ

loan word lexica:

LEXICON KULTUR !Recent loanwords on -vrra with long and short compound-form
LEXICON FAKTOR !Recent loanwords on -åvrrå with long and short compound-form
LEXICON INSPEKTØR !Recent loanwords on -ørra with long and short compound-form
LEXICON TELEFON !Recent loanwords on -åvnnå with long and short compound-form
LEXICON INSTITUTION !Recent loanwords on -sjåvnnå with long and short compound-form: -TION IN SWEDISH 
LEXICON MISSION !Recent loanwords on -sjåvnnå with long and short compound-form: -SSION IN SWEDISH
LEXICON PENSION !Recent loanwords on -sjåvnnå with long and short compound-form: -SION IN SWEDISH
LEXICON BENSIN !Recent loanwords on -ijnna with long and short compound-form
LEXICON MASKIN !Recent loanwords on -sjijnna with long and short compound-form: -SKIN
LEXICON ADJEKTIV !Recent loanwords on -ijvva with long and short compound-form
LEXICON ANTIKK !Recent loanwords on -hkka with long and short compound-form on -kk/-k
LEXICON APOTEK !Recent loanwords on -hkka with long and short compound-form on -k
LEXICON ADVOKAT !Recent loanwords on -áhtta with long and short compound-form
LEXICON BUDSJETT !Recent loanwords on -æhtta with long and short compound-form
LEXICON INTERNET !Recent loanwords on -æhtta with long and short compound-form: -ET IN SWEDISH
LEXICON IDENTITET !Recent loanwords on -ehtta, -uhtta, -ihtta with long and short compound-form on -t
LEXICON INSTITUTT !Recent loanwords on -ehtta, -uhtta, -ihtta with long and short compound-form on -tt(NOR)/-t(SW)
LEXICON PARTISIPP !Recent loanwords on -hppa with long and short compound-form on -pp
LEXICON ALLEGORI !Recent loanwords on -ija with long and short compound-form
LEXICON PATENT !Recent loanwords on -ænnta with long and short compound-form
LEXICON VARIANT !Recent loanwords on -ánnta with long and short compound-form
LEXICON PARADIS !Recent loanwords on -jssa with long and short compound-form
LEXICON KOLLEKT !Recent loanwords on -ækta with long and short compound-form
LEXICON TEKSTIL !Recent loanwords on -jlla with long and short compound-form
LEXICON SYSTEM !Recent loanwords on -ebma with long and short compound-form
LEXICON ADVERB !Recent loanwords on -ærbba with long and short compound-form
LEXICON AGRONOM !Recent loanwords on -åvmma with long and short compound-form
LEXICON TABELL !Recent loanwords on -äl0la with long and short compound-form
LEXICON AREAL !Recent loanwords on -álla with long and short compound-form
LEXICON TELEGRAM !Recent loanwords on -ám'ma with long and short compound-form
LEXICON ORGAN !Recent loanwords on -ádna with long and short compound-form
LEXICON SEMINAR !Recent loanwords on -árra with long and short compound-form
LEXICON SBSTANS !Recent loanwords on -ánssa with long and short compound-form
LEXICON VALENS !Recent loanwords on -ænssa with long and short compound-form
LEXICON TURIST !Recent loanwords on -ssta with long and short compound-form
LEXICON FANATISM !Recent loanwords on -ssma with long and short compound-form
LEXICON DEMAGOG !Recent loanwords on -ga with long and short compound-form

NB! Note that grade III nouns of the biel’lo type must be added as such in the lexicon, in order not to get wrong CG results. Also, non-gradating nouns must go to different sublexica, and these lexica must point to CG-free case sublexica.

The Adjectives

The adjective lexica follow Spiik’s numbering system, in the following way: /

Spiik   LEXICON      Type
--------------------------------------------------------------------
even 1a GIEVRRA      Attr WeG, -s
even 1b NUORRA       Attr =
even 1c HÁVSSKE      Attr w/o WeG, -s
even 1d UNNE         Attr e > a~e, WeG
even 1e SÁVADAHTTE   Causative-participles
even 1e JUHKKE       participles
even 2  ÁRMMOGIS     Attr = Pred. Nom. Sg.
even 2  MÁHTULASJ    as ÁRMMOGIS but no attr
even 2  ÅLLAGASJ     as ÁRMMOGIS but no comparatives.
misc.   ASIDASJ
misc.   DÁRBBOLASJ   no attr
misc.   TJALMEDIBME
even    SUOLASIEHKE  as GIEVRRA but no attribute
even    NUORTTALABBO inherent comparatives
even    GASSKALAMOS  inherent superlatives
even    TJAVGGÁMUS   !Inherent superlatives
even    DÁBBELIJ     gen:dábbelahá
odd a.  TJIEGOS      Adj. on -k, negating on -dahkes, sm adjs on -os
odd b.  BASSTEL      Adj. on -et, -l, -r
odd c.  MUTTÁK       Adj. on -ák,  {aáå}s, Attr -is or ==
odd d.  ALLAK        Adj. on -ak, Attr on -a
odd d2. GÅBDDÅK      Adjs on -åk, attr. on -å, short comparatives
odd e.  GALMAS       Adj. on WeG -as, Attr on StrG -a
odd f.  SUOHKAT      Adj. on -at
odd f2.
odd f3. LÅSSÅT       Two sets of comparatives and two attr. forms here
odd g.  DIMES        Adj. on -es Attr dibma
odd g.  BITJES       bitjes:bihtjas-
odd g2. OAMES        Adj. on -es Attr =
odd h.  TJALMMIS     Adj. on -is with Attr -is and =
odd     STUORAK      Adjective stuorak, stuoragat, comp:stuoráp
odd     SJÆVNNJAT    miscellanious uneven No attribute, stemvowel change
odd     RIHTSOK      miscellanious uneven No attribute, no stemvowel change
odd     OANEP        Inherent comparatives
odd     TSIBTSA      attr =
odd     IENNILS      no comparatives
odd     GÅNTSAS
odd     DABÁR        like BASSTEL but with CG
odd     SNÁVES       snáves:snávvas-
contr.  SÁDNES       attr =
contr.  GOAVSOS      attr =
contr.  SUVRES       attr suvra
---------------------------------------------------------------------

The Attribute

The case forms

Comparative and superlative

The verbs

Introduction

Auxiliary verbs and main verbs are treated separately. Note the concatenated entries for auxiliary + negation.

The main verbs are divided in even (-at, -åt, -ot, -et), odd (-it) and contracted (-át, -åt~ut, -it) stems.

The auxiliaries

The auxiliaries have the lexica NEG, ÅRROT, LIEHKET. The verb stems are declared initially in the verb-smj.txt file, and the continuation lexica themselves are declared towards the end of the smj-lex.txt file. Since the auxiliaries are irregular, the suffixes are just listed.

The main verbs

VerbRoot lexicon

lexicon for impersonal verbs
lexicon for verbs with personal passives, Transitives
lexicon for verbs without personal passives, Intransitives
lexicon without Personal Passive but with Acc obj
lexicon for inherent passives

Modals:

GALGGAT VIERTTIT

Even-syllable:

Impersonal verbs: BIEGGAT !Impersonals
GALSSJOT !Impersonal o-verbs
BIEKKASTALLAT !Allready derived Impersonals

Intransitives:
RAVGGAT !a- and å-verbs
BUOLLET !e-verbs
VIEDJET !e-verbs GRADE II-I WITH IE DIPHT.
BÅRSSJOT !o-verbs
VILSSJOT !o-verbs with dim -astit that are hardcoded
RAVGGALASSTET !Like RAVGGAT for already derived words (except words ending -uššat)

Transitives:
BASSAT !a- and å-verbs
JUHKAT !a-verbs with dim -istit that are hardcoded
LÁHPPET !e-verbs
JIEHKET !e-verbs GRADE II-I WITH IE DIPHT.
BASSALASSTET !Like BASSAT for already derived words (except words ending -uššat)
DIEHTET !Only this one word, unusual diphtong behavior
GÁDJOT !o-verbs
JÅRGGOT !o-verbs with dim -astit that are hardcoded

*Personal Passive but with Acc obj
MÁHTTET* !transitive verbs without personal passive

Passives:
GUOTTEDALLAT !passives

Uneven-syllable:

Impersonal verbs:
BIEKKASTIT !Impersonals

Transitives:
UNNEDIT ! Transitives
MUJTATJIT !Words ending -tjit, -jdit, reciprocals on -dit, momentatives on -dit, -edit, continuatives on -ldit, -nit, essives on -hit and 5-syllables
BÅNJÅDIT !continuatives on -dit, frequentatives on -odit, reciprocals, momentatives and frequentatives ending -alit
VUORDDELIT !Trisyllabic Verbs ending -lit Intransitives:
JÅLLÅRIT !At the moment IV, we may perhaps change IV/TV.
BEGATJIT !Words ending -tjit, -jdit, reciprocals on -dit, momentatives on -dit, -edit, continuatives on -ldit, -nit, essives on -hit and 5-syllables
BALÁDIT !continuatives on -dit, frequentatives on -odit, reciprocals, momentatives and frequentatives ending -alit
SUOGNALIT !Trisyllabic Verbs ending -lit
LASSÁNIT !verbs ending -nit, -sit
BÁHTARIT !verbs ending -rit

Contracted stems:

Impersonal verbs: SJIERRIT !Impersonals

Intransitives: OADDÁT ! Do not make nouns via -ár derivation DULLUT ! Do not make nouns via -ár derivation VALLIT ! Make nouns via -ár derivation

Transitives: TJUOLLÁT ! Do not make nouns via -ár derivation STRÁFFUT ! Do not make nouns via -ár derivation TSIEGGIT ! Make nouns via -ár derivation Passives: BASSUT !Passives

The disposition

The disposition is as follows:

  1. The stem class lexica and their immediate daughter lexica
    1. Even-syllabic stems
      1. Modals
      2. Ordinary main verbs
    2. Odd-syllabic stems
    3. Contracted stems
  2. Main inflectional categories
  3. Intermediate suffix lexica
  4. Final suffix lexica
  5. Derivation

Inflection

The stem class lexica

Each main stem type has its own section, for the lexica with pointers from the verb file. These lecixa first split into a non-derivational and a derivational part, with BASSATStem and DeverbalVerbsBASSAT (etc., for the other lexica). The latter point to the derivational lexica, cf. next xection. The former is a set of lexica with pointers towards passive (EVENPASSIVE) and non-passive (BASSATINCH) continuation lexica.

The BASSATINCH etc. lexica point twice to INFL-EVEN etc., both directly and via the inchoative -goahte- suffix.

The main inflectional categories

The INFL-EVEN, INFL-o-, INFL-ODD, INFL-CONTR lexica then point to the different person-number sublexica. Note that there in Lule Saami are two person-number suffix series. One is used in even-syllabic verbs for present tense and in odd-syllabic verbs for past tense, this is named with roman numeral I in the sublexica to follow. The other one is used in even-syllabic verbs for past tense and potential, lexica with this suffix set are named with roman numerals II. The reason for this abstraction is that we do not need to write the same suffixes several times, rather, we may refer to the same suffix set from different stem-tense-mode combinations.

The pointers to the person-number lexica are done on blocks of nine and nine.

The intermediate suffix lexica

Here, the person-number blocks are split in different groups, according to morphophonological processes.

The final suffix lexica

At last, we are in a position to assign person-number suffixes.

Derivation

The passive lexica add the passive suffixes, and thereafter point further on to the inflection lexica.

The derivational system takes the North Saami system as a starting point, but does not (yet) take transitivity into account.

From the stems of each stemtype, there are pointers to lexica

DeverbalVerbsBASSAT,DeverbalVerbsJUHKAT, DeverbalVerbsLÁHPPET, DeverbalNounsBASSAT,DeverbalNounsBASSALASSTET, DeverbalVerbsBIEGGAT, DeverbalVerbsDIEHTET, DeverbalNounsDIEHTET, DeverbalVerbsJÅLLÅRIT, DeverbalNounsJÅLLÅRIT, DeverbalNounsMUJTATJIT, DeverbalVerbsVUORDDELIT, DeverbalVerbsLASSÁNIT, DeverbalVerbsBIEKKASTIT, DeverbalNounsOADDÁT, DeverbalVerbsOADDÁT, DeverbalNounsSTRÁFFUT, DeverbalVerbsSJIERRIT.

All these lexica are found in a separate section “Verbal derivation” following the section on verbal inflection.

The closed classes

Pronouns

Personal pronouns

Lule Saami personal pronouns are added to the file closed-smj.lex.txt. They differ from the corresponding North Sami ones in having uniform sublexica across the different persons. The initial lexicon Personal splits into nine sublexica, 3 for each person, each pointing to one sublexicon for each number. The 3 number sublexica (persproncasesg, etc. then give the case forms.

Demonstrative pronouns

This lexicon contains both declineable and indeclineable entries. The declineable ones point to five sublexica, dakkárcase, demcas, juohkkahasjcase, duobbacase and dátcas. Of these, the former splits into number sublexica.

Reflexive pronouns

In the Reflexive sublexicon, the reflexive pronoun iesj gets spelled-out forms for the Px-less nominative, whereas the oblique forms are directed to sublexica for sg and pl. There they get appropriate stems, and the singular forms are redirected to the ordinary nominal Px lexica. Acc PxSg1 and Gen PxSg1 have exceptional forms, they are listed as such.

The plural reflexives do have deviating Px forms, they are listed in the closed-smj-lex.txt file itself, first by establishing dual and plural substems for each person (in the lexicon PxCiPlstem), and then by giving the personal suffixes (in the lexicon PxCPlstem.).

Reciprocal pronouns

Various cases of the word gasska with dual and plural Px-forms.

Interrogative pronouns

These are makkár/makkir, mij, guhtimusj, galla/galli, galles, goabbá and guhti. Makkár/makkir points to dakkárcase (demonstrative). Some case-forms are idiosyncraticand are given separately.The rest are generated in sublexa.

Indefinite pronouns

These are a whole bunch. Typically the nominative sg. and one more case (illative sg. or nominative pl. for example) are given separately, while the other cases are generated in specific lexica. Indefinites that don´t inflect are for example vehi and juohkka.

Subjunctions and Conjunctions

10-20 subjunctions and conjunctions are listed, and given appropriate tags.

Numerals

The numeral lexica are formed as a generator, generating all possible numerals. The basic lexicon is Numeral, and it looks like this:

LEXICON Numeral MILJON ; ! a noun of its own UNDERDUHAT ; ! for generator under 1000 JUSTDUHAT ; ! going via 1000 OVERDUHAT ; ! for generator over 1000 OLD ; ! for “thirteen hundred, etc. !num-ordinal ; ! The basic ordinal numbers !Integrated in card. !num-derived ; ! still unimplemented num-imprecise ; ARABIC ; ! for the arabic numerals !^C^ ROMAN ; ! for the roman numerals !^C^ NUM-PREFIXES ; ! for §34 etc. !^C^ ISOLATED-NUMEXP ; ! for ½ etc. !^C^

MILJON is a noun. OLD is the old way of counting. num-ordinal act like adjectives, they are not finished yet. ARABIC and ROMAN contain number generators.

So, what is the reason for the three different lexica around 1000?

The reason is that the numeric system turns at the thousand mark. Numbers above it and numbers below it behave in the same way, thus we have both twentyfour and twentyfourthousand, etc.

The path is OVERDUHAT -> JUSTDUHAT -> UNDERDUHAT. OVERDUHAT generates the part of the numeral that is over 1000, and all these lexica then point to JUSTDUHAT. That lexicon has an optional “(one) thousand” before it leads either to DUHAT and via the relevant case paradigm to K, or to UNDERDUHAT. UNDERDUHAT contains the numerals 1-999. UNDERDUHAT starts with the lexicon for one, and gives each group of numerals its own lexicon.

Particles and Interjections

140-150 particles and interjections are listed, and given tags.

Adverbs

The file adv-smj-lex.txt contains lexical adverbs. Still missing is defect case paradigms, and adverbs derived from adjectives.

Adpositions

The file differs between pre- and postpositions.