Finite state and Constraint Grammar based analysers, proofing tools and other resources
This file describes the Lule Saami lexicon files. The documentation is not complete. Status quo for the lexicon files is as follows (030310):
The file structure for the lexicon files is the same as for the other Saami lgs, cf. the North Sámi flowchart.
sme smj
ACCRA ACCRA
MARJA MARJA
NYSTØ NYSTØ
BERN BERN
LONDON LONDON
C-FI-NEN LONDON
ANAR ANAR
ALEUHTAT ALEUHTAT
SUND BERN
HEIM BERN
HAWAII ACCRA
The proper nouns are extracted out of the common file proper-noun.xml. This file takes sme as its starting point. The continuation lexicon correspondence is as follows
Type
StemCoda CG IllChange Loc Lexicon Status
-----------------------------------------------------
LightVow no yes -s ACCRA
LightVow yes yes -s MARJA
HeavyVow no no -as NYSTØ tmp
HeavyCns no no -as BERN
LightCns no no -is LONDON
DUORTNUS ! Trisyll. Cns-final, cons.grad.
ANAR ! Trisyll. Cns-final, no cons.grad.
SARAK ! Trisyll. Cns-final, no cons.grad
BASUDIS ! Bisyll. Non-Gradating C-Proper Names
GEAVNNIS ! Contracted. Gradating C-Proper Names
DAVVISUOLU ! Contracted. Gradating C-Proper Names
Plural names
VARGGAT
ALEUHTAT
SULLOT
EATNAMAT
The nouns are divided in the following sublexica (the noun-smj-lex.txt file is organized in chapters according to these sublexica).
Spiik LEXICON Type
--------------------------------------------------------------
even MUORRA 2syll stem with cg (note Q1)
even GÁDDE 2syll stem with cg (note Q1) with comparatives
even ÅLGGO Like MUORRA but no inessive/elative/illative sg, with comparatives
even MIEHTE Like MUORRA but no locative/elative/illative sg
even NOADE 2syll stem without cg (note missing Q1)
even KINO 2syll stem without any morphophonology
even VUOSA 2syll stem with cg only pluralforms
even AHKA words like tjerastahka with short compound-form
even OADÁDAGÁ pl. words like tjerastahka with short compound-form
even VUOHTA
even VSBST-EVEN
even ACTOR
even ACTORdiehtet
even 4sl GÅNÅGIS 4syll stem with cg
even 4sl CELSIUS 4syll stem without cg
even 4sl BERRAHATTJA like GÅNÅGIS but pl.
even 4sl JIHPELIJ gen:jihpelahá
even 4sl OARJJILIJ gen:oarjjilihá
even 4sl VIESSOMUJ gen:viessumuhe
even 4sl SIJDDALAHÁ plural
even 4sl SISSNELUHÁ plurals
odd ÅRES oddsyll, cg, C-final, 2ndsyll vowchange
odd GÁMAS oddsyll, cg, C-final, no 2ndsyll vowchange
odd SÅHKÅR oddsyll, cg, C-final, no 2ndsyll vowchange and only long essive
odd GAHPER oddsyll, no cg, no vowchange
odd BENA oddsyll, cg, vowchange, V-final nominative
odd DÁRBBAGA like BENA BUT ONLY PLURAL
odd VÁLLDÁRA oddsyllable pluralforms only
odd SNJIERÁGA oddsyllable pluralforms only
odd BÆLLJASA GÁMAS but plural
odd MANEBU oddsyllable pluralforms only
odd BAVSEV gen:baksima
odd GÅRDAL gen:goarddala
odd RÁBEV gen:ráhpuga
odd SEKKAS gen:sæggása
odd DERDAJ gen:dærddaga
odd BØSOJ gen:bøssuga
odd GUOVSOJVUOJOJ gen:vuodjom
odd RITJAS Like GÁMAS but without stem a-lengthening for grade I (underlying long -i-)
odd TJÅLKES gen:tjoalkkasa
odd BÁRNEP gen:bárnebu
odd SÅGAS gen:sågaska
odd SÁGE gen:sáhkaha
odd BUTJES gen:buttjása
odd VSBST-ODD
cntr SARVES cntr C-final with cg
cntr SUOLOJ
cntr GUOMOJ III-I cg
cntr GÅHKES gen:gåhkkå
cntr SJUOKKAJ gen:sjuoggá
cntr GISTÁ
loan word lexica:
LEXICON KULTUR !Recent loanwords on -vrra with long and short compound-form
LEXICON FAKTOR !Recent loanwords on -åvrrå with long and short compound-form
LEXICON INSPEKTØR !Recent loanwords on -ørra with long and short compound-form
LEXICON TELEFON !Recent loanwords on -åvnnå with long and short compound-form
LEXICON INSTITUTION !Recent loanwords on -sjåvnnå with long and short compound-form: -TION IN SWEDISH
LEXICON MISSION !Recent loanwords on -sjåvnnå with long and short compound-form: -SSION IN SWEDISH
LEXICON PENSION !Recent loanwords on -sjåvnnå with long and short compound-form: -SION IN SWEDISH
LEXICON BENSIN !Recent loanwords on -ijnna with long and short compound-form
LEXICON MASKIN !Recent loanwords on -sjijnna with long and short compound-form: -SKIN
LEXICON ADJEKTIV !Recent loanwords on -ijvva with long and short compound-form
LEXICON ANTIKK !Recent loanwords on -hkka with long and short compound-form on -kk/-k
LEXICON APOTEK !Recent loanwords on -hkka with long and short compound-form on -k
LEXICON ADVOKAT !Recent loanwords on -áhtta with long and short compound-form
LEXICON BUDSJETT !Recent loanwords on -æhtta with long and short compound-form
LEXICON INTERNET !Recent loanwords on -æhtta with long and short compound-form: -ET IN SWEDISH
LEXICON IDENTITET !Recent loanwords on -ehtta, -uhtta, -ihtta with long and short compound-form on -t
LEXICON INSTITUTT !Recent loanwords on -ehtta, -uhtta, -ihtta with long and short compound-form on -tt(NOR)/-t(SW)
LEXICON PARTISIPP !Recent loanwords on -hppa with long and short compound-form on -pp
LEXICON ALLEGORI !Recent loanwords on -ija with long and short compound-form
LEXICON PATENT !Recent loanwords on -ænnta with long and short compound-form
LEXICON VARIANT !Recent loanwords on -ánnta with long and short compound-form
LEXICON PARADIS !Recent loanwords on -jssa with long and short compound-form
LEXICON KOLLEKT !Recent loanwords on -ækta with long and short compound-form
LEXICON TEKSTIL !Recent loanwords on -jlla with long and short compound-form
LEXICON SYSTEM !Recent loanwords on -ebma with long and short compound-form
LEXICON ADVERB !Recent loanwords on -ærbba with long and short compound-form
LEXICON AGRONOM !Recent loanwords on -åvmma with long and short compound-form
LEXICON TABELL !Recent loanwords on -äl0la with long and short compound-form
LEXICON AREAL !Recent loanwords on -álla with long and short compound-form
LEXICON TELEGRAM !Recent loanwords on -ám'ma with long and short compound-form
LEXICON ORGAN !Recent loanwords on -ádna with long and short compound-form
LEXICON SEMINAR !Recent loanwords on -árra with long and short compound-form
LEXICON SBSTANS !Recent loanwords on -ánssa with long and short compound-form
LEXICON VALENS !Recent loanwords on -ænssa with long and short compound-form
LEXICON TURIST !Recent loanwords on -ssta with long and short compound-form
LEXICON FANATISM !Recent loanwords on -ssma with long and short compound-form
LEXICON DEMAGOG !Recent loanwords on -ga with long and short compound-form
NB! Note that grade III nouns of the biel’lo type must be added as such in the lexicon, in order not to get wrong CG results. Also, non-gradating nouns must go to different sublexica, and these lexica must point to CG-free case sublexica.
The adjective lexica follow Spiik’s numbering system, in the following way: /
Spiik LEXICON Type
--------------------------------------------------------------------
even 1a GIEVRRA Attr WeG, -s
even 1b NUORRA Attr =
even 1c HÁVSSKE Attr w/o WeG, -s
even 1d UNNE Attr e > a~e, WeG
even 1e SÁVADAHTTE Causative-participles
even 1e JUHKKE participles
even 2 ÁRMMOGIS Attr = Pred. Nom. Sg.
even 2 MÁHTULASJ as ÁRMMOGIS but no attr
even 2 ÅLLAGASJ as ÁRMMOGIS but no comparatives.
misc. ASIDASJ
misc. DÁRBBOLASJ no attr
misc. TJALMEDIBME
even SUOLASIEHKE as GIEVRRA but no attribute
even NUORTTALABBO inherent comparatives
even GASSKALAMOS inherent superlatives
even TJAVGGÁMUS !Inherent superlatives
even DÁBBELIJ gen:dábbelahá
odd a. TJIEGOS Adj. on -k, negating on -dahkes, sm adjs on -os
odd b. BASSTEL Adj. on -et, -l, -r
odd c. MUTTÁK Adj. on -ák, {aáå}s, Attr -is or ==
odd d. ALLAK Adj. on -ak, Attr on -a
odd d2. GÅBDDÅK Adjs on -åk, attr. on -å, short comparatives
odd e. GALMAS Adj. on WeG -as, Attr on StrG -a
odd f. SUOHKAT Adj. on -at
odd f2.
odd f3. LÅSSÅT Two sets of comparatives and two attr. forms here
odd g. DIMES Adj. on -es Attr dibma
odd g. BITJES bitjes:bihtjas-
odd g2. OAMES Adj. on -es Attr =
odd h. TJALMMIS Adj. on -is with Attr -is and =
odd STUORAK Adjective stuorak, stuoragat, comp:stuoráp
odd SJÆVNNJAT miscellanious uneven No attribute, stemvowel change
odd RIHTSOK miscellanious uneven No attribute, no stemvowel change
odd OANEP Inherent comparatives
odd TSIBTSA attr =
odd IENNILS no comparatives
odd GÅNTSAS
odd DABÁR like BASSTEL but with CG
odd SNÁVES snáves:snávvas-
contr. SÁDNES attr =
contr. GOAVSOS attr =
contr. SUVRES attr suvra
---------------------------------------------------------------------
Auxiliary verbs and main verbs are treated separately. Note the concatenated entries for auxiliary + negation.
The main verbs are divided in even (-at, -åt, -ot, -et), odd (-it) and contracted (-át, -åt~ut, -it) stems.
The auxiliaries have the lexica NEG, ÅRROT, LIEHKET. The verb stems are
declared initially in the verb-smj.txt
file, and the continuation
lexica themselves are declared towards the end of the smj-lex.txt
file. Since the auxiliaries are irregular, the suffixes are just listed.
lexicon for impersonal verbs
lexicon for verbs with personal passives, Transitives
lexicon for verbs without personal passives, Intransitives
lexicon without Personal Passive but with Acc obj
lexicon for inherent passives
Modals:
GALGGAT VIERTTIT
Even-syllable:
Impersonal verbs: BIEGGAT !Impersonals
GALSSJOT !Impersonal o-verbs
BIEKKASTALLAT !Allready derived Impersonals
Intransitives:
RAVGGAT !a- and å-verbs
BUOLLET !e-verbs
VIEDJET !e-verbs GRADE II-I WITH IE DIPHT.
BÅRSSJOT !o-verbs
VILSSJOT !o-verbs with dim -astit that are hardcoded
RAVGGALASSTET !Like RAVGGAT for already derived words (except words
ending -uššat)
Transitives:
BASSAT !a- and å-verbs
JUHKAT !a-verbs with dim -istit that are hardcoded
LÁHPPET !e-verbs
JIEHKET !e-verbs GRADE II-I WITH IE DIPHT.
BASSALASSTET !Like BASSAT for already derived words (except words ending
-uššat)
DIEHTET !Only this one word, unusual diphtong behavior
GÁDJOT !o-verbs
JÅRGGOT !o-verbs with dim -astit that are hardcoded
*Personal Passive but with Acc obj
MÁHTTET* !transitive verbs without personal passive
Passives:
GUOTTEDALLAT !passives
Uneven-syllable:
Impersonal verbs:
BIEKKASTIT !Impersonals
Transitives:
UNNEDIT ! Transitives
MUJTATJIT !Words ending -tjit, -jdit, reciprocals on -dit, momentatives
on -dit, -edit, continuatives on -ldit, -nit, essives on -hit and
5-syllables
BÅNJÅDIT !continuatives on -dit, frequentatives on -odit, reciprocals,
momentatives and frequentatives ending -alit
VUORDDELIT !Trisyllabic Verbs ending -lit Intransitives:
JÅLLÅRIT !At the moment IV, we may perhaps change IV/TV.
BEGATJIT !Words ending -tjit, -jdit, reciprocals on -dit, momentatives
on -dit, -edit, continuatives on -ldit, -nit, essives on -hit and
5-syllables
BALÁDIT !continuatives on -dit, frequentatives on -odit, reciprocals,
momentatives and frequentatives ending -alit
SUOGNALIT !Trisyllabic Verbs ending -lit
LASSÁNIT !verbs ending -nit, -sit
BÁHTARIT !verbs ending -rit
Contracted stems:
Impersonal verbs: SJIERRIT !Impersonals
Intransitives: OADDÁT ! Do not make nouns via -ár derivation DULLUT ! Do not make nouns via -ár derivation VALLIT ! Make nouns via -ár derivation
Transitives: TJUOLLÁT ! Do not make nouns via -ár derivation STRÁFFUT ! Do not make nouns via -ár derivation TSIEGGIT ! Make nouns via -ár derivation Passives: BASSUT !Passives
The disposition is as follows:
Each main stem type has its own section, for the lexica with pointers
from the verb file. These lecixa first split into a non-derivational and
a derivational part, with BASSATStem
and DeverbalVerbsBASSAT
(etc.,
for the other lexica). The latter point to the derivational lexica, cf.
next xection. The former is a set of lexica with pointers towards
passive (EVENPASSIVE
) and non-passive (BASSATINCH
) continuation
lexica.
The BASSATINCH
etc. lexica point twice to INFL-EVEN
etc., both
directly and via the inchoative -goahte-
suffix.
The INFL-EVEN, INFL-o-, INFL-ODD, INFL-CONTR
lexica then point to the
different person-number sublexica. Note that there in Lule Saami are two
person-number suffix series. One is used in even-syllabic verbs for
present tense and in odd-syllabic verbs for past tense, this is named
with roman numeral I in the sublexica to follow. The other one is used
in even-syllabic verbs for past tense and potential, lexica with this
suffix set are named with roman numerals II. The reason for this
abstraction is that we do not need to write the same suffixes several
times, rather, we may refer to the same suffix set from different
stem-tense-mode combinations.
The pointers to the person-number lexica are done on blocks of nine and nine.
Here, the person-number blocks are split in different groups, according to morphophonological processes.
At last, we are in a position to assign person-number suffixes.
The passive lexica add the passive suffixes, and thereafter point further on to the inflection lexica.
The derivational system takes the North Saami system as a starting point, but does not (yet) take transitivity into account.
From the stems of each stemtype, there are pointers to lexica
DeverbalVerbsBASSAT,DeverbalVerbsJUHKAT, DeverbalVerbsLÁHPPET, DeverbalNounsBASSAT,DeverbalNounsBASSALASSTET, DeverbalVerbsBIEGGAT, DeverbalVerbsDIEHTET, DeverbalNounsDIEHTET, DeverbalVerbsJÅLLÅRIT, DeverbalNounsJÅLLÅRIT, DeverbalNounsMUJTATJIT, DeverbalVerbsVUORDDELIT, DeverbalVerbsLASSÁNIT, DeverbalVerbsBIEKKASTIT, DeverbalNounsOADDÁT, DeverbalVerbsOADDÁT, DeverbalNounsSTRÁFFUT, DeverbalVerbsSJIERRIT
.
All these lexica are found in a separate section “Verbal derivation” following the section on verbal inflection.
Lule Saami personal pronouns are added to the file closed-smj.lex.txt. They differ from the corresponding North Sami ones in having uniform sublexica across the different persons. The initial lexicon Personal splits into nine sublexica, 3 for each person, each pointing to one sublexicon for each number. The 3 number sublexica (persproncasesg, etc. then give the case forms.
This lexicon contains both declineable and indeclineable entries. The declineable ones point to five sublexica, dakkárcase, demcas, juohkkahasjcase, duobbacase and dátcas. Of these, the former splits into number sublexica.
In the Reflexive sublexicon, the reflexive pronoun iesj gets spelled-out forms for the Px-less nominative, whereas the oblique forms are directed to sublexica for sg and pl. There they get appropriate stems, and the singular forms are redirected to the ordinary nominal Px lexica. Acc PxSg1 and Gen PxSg1 have exceptional forms, they are listed as such.
The plural reflexives do have deviating Px forms, they are listed in the closed-smj-lex.txt file itself, first by establishing dual and plural substems for each person (in the lexicon PxCiPlstem), and then by giving the personal suffixes (in the lexicon PxCPlstem.).
Various cases of the word gasska with dual and plural Px-forms.
These are makkár/makkir, mij, guhtimusj, galla/galli, galles, goabbá and guhti. Makkár/makkir points to dakkárcase (demonstrative). Some case-forms are idiosyncraticand are given separately.The rest are generated in sublexa.
These are a whole bunch. Typically the nominative sg. and one more case (illative sg. or nominative pl. for example) are given separately, while the other cases are generated in specific lexica. Indefinites that don´t inflect are for example vehi and juohkka.
10-20 subjunctions and conjunctions are listed, and given appropriate tags.
The numeral lexica are formed as a generator, generating all possible
numerals. The basic lexicon is Numeral
, and it looks like this:
LEXICON Numeral MILJON ; ! a noun of its own UNDERDUHAT ; ! for generator under 1000 JUSTDUHAT ; ! going via 1000 OVERDUHAT ; ! for generator over 1000 OLD ; ! for “thirteen hundred, etc. !num-ordinal ; ! The basic ordinal numbers !Integrated in card. !num-derived ; ! still unimplemented num-imprecise ; ARABIC ; ! for the arabic numerals !^C^ ROMAN ; ! for the roman numerals !^C^ NUM-PREFIXES ; ! for §34 etc. !^C^ ISOLATED-NUMEXP ; ! for ½ etc. !^C^
MILJON is a noun. OLD is the old way of counting. num-ordinal act like adjectives, they are not finished yet. ARABIC and ROMAN contain number generators.
So, what is the reason for the three different lexica around 1000?
The reason is that the numeric system turns at the thousand mark. Numbers above it and numbers below it behave in the same way, thus we have both twentyfour and twentyfourthousand, etc.
The path is OVERDUHAT -> JUSTDUHAT -> UNDERDUHAT. OVERDUHAT generates the part of the numeral that is over 1000, and all these lexica then point to JUSTDUHAT. That lexicon has an optional “(one) thousand” before it leads either to DUHAT and via the relevant case paradigm to K, or to UNDERDUHAT. UNDERDUHAT contains the numerals 1-999. UNDERDUHAT starts with the lexicon for one, and gives each group of numerals its own lexicon.
140-150 particles and interjections are listed, and given tags.
The file adv-smj-lex.txt contains lexical adverbs. Still missing is defect case paradigms, and adverbs derived from adjectives.
The file differs between pre- and postpositions.