Lule Sami language model documentation

Semantic tags

Go for minimal weight (requires –with-backend-format=openfst-tropical)

Removing Err/Orth

Continuation lexicons for abbreviations

Continuation lexicons for abbreviations

Lexica for adding tags and periods

The sublexica

Continuation lexicons for abbrs both with and witout final period

Lexicons without final period

Lexicons with final period

Sublexica for Adjective

Sublexica for Adjective

Even-syllable stems

LEXICON GIEVRRA Adjectives with attribute in WeG and -s. As 1a in Spiik. Sg Acc: gievrav, Attr: gievras.

LEXICON NUORRA Adjectives with attribute same as pred. As 1b in Spiik. Sg Acc: nuorav, Attr: nuorra.

LEXICON GALLJE Adjectives on -e, the attribute is in WeG and e > a. As 1d in Spiik. Sg Acc: galjev, Attr: galja.

LEXICON TJÁBBE Adjectives on -e, the attribute is in WeG and e > a. Same as GALLJE only different adv derivation. Sg Acc: tjáppev, Attr: tjáppa.

LEXICON VILLDA Adjectives with attribute same as pred, without CG. As 1b in Spiik. Sg Acc: nuorav, Attr: nuorra.

LEXICON HÁVSSKE Adjectives with attribute -s, without WeG. As 1c in Spiik. Sg Acc: hávsskev, Attr: hávsskes.

LEXICON TJUODDJE Adjectives with attribute -is, without WeG. presently only “Tjuoddje” Sg Acc: tjuoddjev, Attr: tjuoddjis.


LEXICON SÁVADAHTTE Causative-participles. No attribute. No comparision. As 1e in Spiik. Sg Acc: sávadahttev. PrsPrc of causative verbs “uttrykker at handlingen lar seg gjøre eller er verdt å gjøre” (Kintel 1991).

LEXICON JUHKKE participles with -s attributive. No comparision As 1e in Spiik. Sg Acc: juhkkev, Attr: juhkkes. Spiik: presens particip har med den attributive formen på -s betydelsen “någon som är duktig i, snabb til att, begiven att utföra handlingen”.

LEXICON BÅRRE participles without the -s attributive. As 1e in Spiik. Sg Acc: bårrev, Attr: bårre. Spiik: presens particip har med den attributiva formen utan -s betydelsen ºdem som utför handlingenº.

Test data:

Loan words lexicas

Correctly assimilated loanwords, derived from real noun.

LEXICON METÅVDÅLASJ LOAN! Foreign -isk adjectives adapted in updated normative way. To smj ending -alasj, adjective is truly derived from a noun. Mekanisk-mekanihkka-mekanihkalasj, instead of mekánalasj that goes to MEKÁNALASJ_BADASS. Pred and attr are both -alasj. Attr same as pred. With comparatives.


LEXICON MEKANIHKA_MEKANIJKA_LASJ LOAN! Same type of adjectives as METÅVDÅLASJ, only for adjektives that become mekanihkalasj in norway and mekanijkalasj in sweden, because of differences mekanik vs mekanikk>mekanijkka vs mekanihkka. Attr same as pred. With comparatives.

LEXICON IJJALASJ Just lik METÅVDÅLASJ only for words ending on ijjalasj/iddjalasj, so that we don’t need a lot of Area and Err tags in stems file.


LEXICON OGIJJALASJ Just like IJJALASJ only for words ending on ogijjalasj/ogiddjalasj, so that we don’t need a lot of err tags in stems files. For words like “pedagogijjalasj” which also have “pedagåvgålasj” (not really a wrong derivation, but doesn’t mean pedagogisk) and “pedagogalasj” err taged.


LEXICON SJÅNÅLASJ_SJONAL -sjonal/sjonell and -tional/tionel loanwords. Only for words that work as nouns, so that they are REAL dervations, as nasjonal-nasjåvnnå-nasjåvnålasj. NOT for words like “rasjonell”, with no real noun. Words as “rasjonell>rasjonálla-rasjonálalasj” go to lexicon ÁLLA. The fake derivation “nasjonálalasj” is err taged, so is the strange “nasjonálla/nasjunálla”.


LEXICON SJÅNÅLASJ_SJONELL -sjonal/sjonell and -tional/tionel loanwords. Only for words that work as nouns, so that they are REAL dervations, as nasjonal-nasjåvnnå-nasjåvnålasj. NOT for words like “rasjonell”, with no real noun. Words as “rasjonell>rasjonálla-rasjonálalasj” go to lexicon ÁLLA. The fake derivation “nasjonálalasj” is err taged, so is the strange “nasjonálla/nasjunálla”.


Badly assimilated loanwords, some against norm others with no norm

LEXICON MEKÁNALASJ_BADASS LOAN! Wronly assimilated -lasj adjectives from SE/NO -isk. Looks derived but isn’t since there is no real noun to be derived from. Like mekanisk-mekánalasj, but “mekádna” is no real noun! Like METÅVDÅLASJ, but gives the Err/Der tag, so it’s only for these wronly/non-derived loan adjectives.

LEXICON ARKTALASJ_CMP_INFL Foreign -isk, that are not real derivations. Same as MEKÁNALASJ_BADASS, but no +Use/-Spell tag since ther is no “right” way to assimilate these. This is a question for GG. Adapted to smj by simply adding -alasj in place of -isk. These are not real derivations, but sitation borrowed loan adjectives. Only words without a noun base, like arktisk and syntetisk. Pred and attr are both -lasj. No comparatives.

LEXICON ORÁNSSJA Loan adjectives, not -isk. Used without the -lasj. Adjectives with attribute same as pred. So far only for oránssja.

LEXICON DEMONSTRATIJVA_LASJ_NO_NORM Loan adjectives from norwegian/swedish (Not adjectives ending on -isk). Words like demonstrativ, transitiv, dupleks, informativ, analog, privat. Gives both “demonstratijvva” and “demonstratijvalasj”. Two ways of adapting these adjectives are used, the adding of -lasj isn’t okey, because that’s a false derivation. But GG hasn’t decided how these should be handled. Looks like noun instead of adjective when adapted without the -lasj ending. Attr is in weak grad, used in strong grad ass pred even thou this seems a little bit odd “Værbba l transitijvva”.


LEXICON ÁLA_LASJ_NO_NORM Same as DEMONSTRATIJVA_LASJ_NO_NORM. Only for adjectives ending on -al. Words like digital,liberal, lokal. Gives both “eksponentiálla” and “eksponentiálalasj”. Different lexicon for these -al adjectives because of Err/Orth tags. OBS, “dialektal”, is assimilated “dialevtalasj”, and goes to lexicon METÅVDÅLASJ.


LEXICON ELLA_LASJ_NO_NORM Loanwords, same as ÁLA_LASJ_NO_NORM and DEMONSTRANTIJVA_LASJ_NO_NORM. For NO and SE adjectives ending on -ell, eksperimentell, ideell, parallell. The short form is nom parallælla, attr, parallella The long form: paralellalasj, attr parallellalasj. Different lexicon for these -ell adjectives because of err/orth tags. OBS, “individuell”, is assimilated “indivijdalasj”, and goes to lexicon METÅVDÅLASJ.




Inherent comparatives and superlatives lexica

LEXICON OANEP Inherent comparatives, gives comp and superl. Adjectives that are lexicalized in their comparative (and superlative) forms, like sisŋep, bárep. Some entries are likely incorrect compared forms of other adjectives, like ådåp and ruvvap (more research needed).

LEXICON TJAVGGÁMUS Inherent superlatives, only gives superl. Some words are lexicalized in their superlative forms, like dájvvámus. Some are likely incorrect superlative forms, like tjábbámus (more research is needed)

4-syllable miscellanious stems

LEXICON ÁRMMOGIS Adjectives on -is, attribute same as pred. Odd-syllable comparison. As 2 in Spiik. Sg Acc: ármmogisáv, Attr: ármmogis.

LEXICON SÆHKÁLAK Adjectives on -álak, attribute same as pred. Odd-syllable comparison. So far only for “sæhkálak”.

LEXICON ÅLLAGSJ_CMP_INFL Adjectives on -asj, attribute same as pred. No comparatives. 2 in Spiik. Sg Acc: ållagattjav, Attr: ållagasj.

LEXICON DÁRBULASJ_CMP_INFL Adjectives on -asj, attribute same as pred. Odd-syllable comparison. Sg Acc: dárbulattjav, Attr: dárbulasj. Essive -attjan, -adtjan is subtaged. Err/Orth also -ahttja.

LEXICON ASIDASJ_CMP_INFL Adjectives on -asj, -is attr. Odd-syllable comparison. Sg Acc: asidattjav, Attr: asidis.

LEXICON UDNODIBME Adjectives on -dibme, attribute on -is. Odd-syllable comparison. Sg Acc: udnodimev, Attr: udnodis.

LEXICON TJALMEDIBME Like UDNODIBME but no comparatives. Sg Acc: tjalmedimev, Attr: tjalmedis.

LEXICON SUOLASIEHKE -siehke. Sg Acc: suolasiegev, attr: suolasiek

Odd-syllable stems

With CG Sorted by attr

LEXICON TJIEGOS Attr same as pred. For adjectives with -e in second syllable e>á: divtes>diktásav in StrG. As a. in Spiik. Sg Acc: tjiehkusav, Attr: tjiegos. Consonant gradation.

LEXICON LINES Attr ending on -a. Adjektives ending on -es. Does same as TJIEGOS, but with attr -a. As g. in Spiik. lines, Sg Acc: lidnásav, attr: lidna. Consonant gradation.

LEXICON GALMAS Attr ending on -a or -å. Adjectives on -as, ås- and ás. As e. in Spiik. Sg Acc: galmmasav, attr: galmma, Consonant gradation.

LEXICON OAMES Attr ending on -e. Adjectives on -es with attribute -e. As g2. in Spiik. Sg Acc: oabmásav, Attr: oabme. Consonant gradation.

LEXICON SUOHKAT Attr III -is, not suohkkadis but SUOHKKIS. With CG to attr, not from nom to Acc. Same as JALGGAT only with this CG. SUOHKKIS. Without CG between nom and acc. Adjectives on -at and -åt, with attribute III -is. As f. in Spiik. Sg Acc: suohkadav, attr: suohkkis,

LEXICON MÅJDÅS Adjectives with no attr. With CG. Sg Acc: måjddåsav. If there is an attribute that dosn’t fit to any lexicon it mus be hardcoded.

Without CG

LEXICON VIEKSES Attr same as pred. Without CG, but With vowel changes. Sg Acc: væksásav, Attr: viekses. Like TJIEGOS only without the CG but with vowel changes. Mayby change this to a lexicon withput attr and then hardcode attr?

LEXICON ALEK Attr same as pred. Without CG, without any vowel changes. Like TJIEGOS only without the CG an vowelchanges.

LEXICON BASSTEL Attr ening on -is. Without CG. Adjs on -et, -l, -r, sm om -k, -sj with attr -is and no consonant gradation. As b. in Spiik. Sg Acc: basstelav, Attr: basstelis. Many of these entries might be instances of derivations, like belak, deblak, and maybe also basstel, bargán.

LEXICON MUTTÁK Two attr enings -is and same as pred. Without CG. Adjs on -ák/-ak/-ek, two attr: -is and same as pred. As c. in Spiik. Sg Acc: muttágav, Attr: muttágis and mutták. These seem to be instances of the adjectival -k derivation. Unclear whether such derivation have different attr forms or not, and thats maybe why some of these derivations are found in BASSTEL lexicon.

LEXICON JALGGAT Attr III -is, not jalggadis but JALGGIS. Without CG. Adjectives on -at, with attribute III -is. As f. in Spiik. Sg Acc: jalggadav, attr: jalggis,

LEXICON TJÅRGGÅT Attr III -is, not tjårggådis but tjårggis. Without CG. Same as JALGGAT only for adjectives ending ot -åt. Adjectives on -åt, with attribute III -is. As f. in Spiik. Sg Acc: jalggadav, attr: jalggis,

LEXICON RIHTSOK No attr, without CG and also without any vowelchanges. The lexicon gives no attribute, either because the adjective dosnºt have attr, because there is stemvowel change in attr that the lexicon canºt handle or because there are strange atrributes that donºt fit to any other lexicon (these attributes are hardcoded). Sg Acc: rihtsogav.

exception lexicons for odd-syll

LEXICON IENNILS no comparatives, attr same as pred.

LEXICON RÁDAS Presently only used for “rádas”. This word has special consonant gradation d>dd. Attr same as pred. Sg Acc: ráddasav, Attr: rádas. Consonant gradation.

LEXICON LUOBES Err/Orth lexicon! Does the same as TJIEGOS only e>a instead of usuall e>á, must be some err/orth. Sg Acc: luohpasav, Attr: luobes. Consonant gradation. NO Attr, must be hardcoded

LEXICON LÅSSÅT Two attr, two comp. As f3. in Spiik. So far the only word i this lexicon i “låssåt”, because both låssis and låsså are attr and comparative is both låsep(hybrid?) and låssådabbo.

LEXICON STUORAK Only for stuorak. It hase two attributes. Has even-syllable comparison: stuoráp and stuorámus.Sg Acc: stuoragav, attr: stuor and stuorra. This might be a -k derivation of adjective stuorre attr stuor(ra). The comparison is thus based on the original adjective and thus it naturally is an even syll comparison.

LEXICON ALLAK Adjs on -ak, attr.on -a. Have both gasep/gaggagabbo and alep/allagabbo as comparatives. As d. in Spiik. So far only the adjectives “allak” and “gassak” go to this lexicon.

LEXICON GÅBDDÅK Adjs on -åk, attr. on -å. Has even-syllable comparison: gåbdep and gåbdemus. So far “gåbddåk” is the only word in this lexicon. As d2. in Spiik. Sg Acc: gåbddågav, Attr: gåbddå.

Inherent comparatives and superlatives

LEXICON NUORTTALABBO Inherent comparatives, gives both comp and superl. Most of the words are the compared forms of -el(a) words, like nuorttal, lullel.

LEXICON GASSKALAMOS Inherent superlatives, gives onlys superl. Words that are lexicalized in their superlative forms.

Contracted stems

LEXICON SÁDNES Attr same as pred. Sg Acc: sáddnáv, Attr: sádnes.

LEXICON GOAVSOS Attr same as pred. Sg Acc: goaksuv, Attr: goavsos.(goavsos is so far the only word in this lexicon)

LEXICON SUVRES Sg Acc: suvrráv, Attr: suvra.

LEXICON GÅLMAKTES Attr same as pred. without cg but with vowel changes. Sg Acc: gålmaktáv, Attr: gålmaktes. VIEKSES makes odd-syll same thing.


LEXICON BU/MUS comparison for even-syll adjectives. Also derivates diminutive and adverbs from the comparisions.

LEXICON ABBO/AMOS comparison for odd-syll adjectives. Also derivates diminutive and adverbs from the comparisions.

LEXICON BUStem Comparative even-syll, case and attr.

LEXICON ABBO Comparative odd-syll, get case and attr. With the dialect differences “-ubbo” and “-æbbo”.


LEXICON BUOREMUS Superlative even-syll, get attr and nom case.

LEXICON AMOS Superlative odd-syll, get case and attr. With the dialect differences “-umos” and “-æmos”.

Comparative and Superlative sub-lexica



LEXICON ATTR Sends attributes to

LEXICON ATTR_PrsPrc Attr without -vuohta derivation.

Derivation of adjectives

LEXICON DenominalAdjsV1 ! even noun stems are sent here

LEXICON DenominalAdjsV1_1 ! even noun stems without grade alternation are sent here

LEXICON DenominalAdjsV2 ! even noun stems are sent here. -asj derivation

LEXICON DenominalAdjsKINO ! unassimilated nouns are sent here

LEXICON DenominalAdjsODD ! gives derivation -ahtes

LEXICON DenominalAdjsContr

Derivations to adjectives, hardcoded in adjectives stems file

LEXICON DIEHTEMAHTES ! odd syllable For hardcoded -ahtes words. Derived from odd-syll NomAct (Bårråt>bårråm-bårråmahtes), or from odd-syll verbs as buorránit>buorránahtes. Migth want to split lexicon in two.


LEXICON BÁJNUK ! hardcoded denominal derivations, latus has changed from o>u, a>a, e>á (Bájnno>bájnuk, juolgge>juolgák, giella>gielak. Attr same as pred, no comp in this lexicon.

LEXICON TSÅHPÅK ! hardcoded denominal derivations latus has changed from o>u, a>a, e>á AND -GIS attr. Attr same as pred is err/orth taged. no comp in this lexicon.

LEXICON GIEVLEK ! hardcoded derivations, not same as BÁJNUK since latus has unexpected vowel. Latus hasn’t changed o>u, a>a, e>á. Goes directly to BÁJNUK, only made to sort these different kinds of derivations. Many of these may be derivated from verbs or other adjectives.

LEXICON SJERVAK ! hardcoded derivations, not same as TSÅHPÅK since latus has unexpected vowel. Latus hasn’t changed o>u, a>a, e>á. Goes directly to TSÅHPÅK, only made to sort these different kinds of derivations. Many of these may be derivated from verbs or other adjectives.

LEXICON DIBME ! even and contracted

LEXICON LIS ! Handlernomen på -is?

LEXICON Ahkásasj ! lexicalized and denominal -asj derivations

LEXICON STÁVVALIS ! Must be “stávvalis” in bot pred and attr, as “guovddelis”. OK& Kintel 2012: stávval attr stávvalis this is err/orth taged, also as second compound, this is err/orth taged. No comparison.

Derivations to adjectives, continuation lexicon not for hardcoded adjectives

LEXICON AHTES ! odd syllable, only a continuation lexicon for words that are not in adjectives stems. Just as DIEHTEMAHTES, only with the +A tag that adjectives already get i stems file.


LEXICON AGAdj ! denominal derivations go here, attr same as pred, no comp in this lexicon

Sublexica for Noun

Sublexica for Noun

Even-syllable stems

2syll stems

LEXICON MUORRA Standard even stems with cg (note Q1). OBS: Nouns with invisible 3>2 cg (as busºsa) go to this lexicon.

LEXICON TÁLLA Same as MUORRA, but for words with º (extra length). Not in MUORRA because of other err/orths

LEXICON ALMME Same as MUORRA, but with special -LASJ derivation. For noun that have strong grade -lasj. “Almmelasj” instead of “almálasj” which is Err/Orth-taged..

LEXICON NOADE Even stem without cg. OBS: No nouns with invisible 3>2 cg (as busºsa) in this lexicon. OBS: Because of denominal nouns taking a weak grade stem, entries in grade 3 are given the gradation mark º in order to prevent alternation to weak grade. We should consider creating a separate denominal nouns lexicon for NOADE instead.

LEXICON KÁFFA For even-syll words with cg cg III-I: káf’fa-káfav, jáf’fo-jáfo. No vowelchanges jet, need new twolc code.

LEXICON LINNJA Only for the loan word “linnja”. Because it’s a loan word, the “nnj” is pronounced “nn-j”, and therefore does not behave as the regular lule sami “nj” sound and therefore it doesn’t follow the rule that makes a:á in 1. grade with short vowel in first syllable (It isn’t as linnja-linjáv or birás-birrasav). This word is therefore sub taged. Norwegian/Swedish words with a short “i” followed by two different consonants are assimilated to lule sami in different manners accoring to the consonants in question, but the word is always on grade III (Morén-Duolljá 2014). Both err/orth and correct form is part of this lexicon.

LEXICON BOAKSA Only for word “boaksa”. Both boaksa-båvsa and Err/Orth boaksa-båksa are part of lexicon.

LEXICON SÁMEGIEL Compounds on -giella, with short -giel as middle compound (sámegielåhpadiddje)

LEXICON AHKA Words like tjerastahka, with short compound form

LEXICON DARRHA Only for “darrha” or compounds that end on “darrha”.

Nouns with comparatives

LEXICON GÁDDE 2 syllable stems with cg (note Q1) with comparatives, like MUORRA

LEXICON VUODO 2 syllable stems without cg (note Q1) with comparatives, like NOADE

LEXICON SJIEVNNJET Like GAHPER but with comparatives. Odd-syllable C-final noun without cg, no vowchange, no short Ess.

LEXICON ÅLGGO Like MUORRA, but with comparatives. This lexicon was previously without sg ill/ine/elat, but these nouns can be conjugated for regular location cases. However, “adverbs” like ålggot (from outside), nuorttan (at north), oarjas (to south), etc., are more commonly used to denote location/direction (should therefore maybe consider subing the regular location case forms).

LEXICON MIEHTE Like MUORRA but no locative/elative/illative sg. Presently no words in this lexica except for err subed nuortto

Plural stems

LEXICON BÅVSÅ Like MUORRA, only in plural. All, except ganta, juvdá and ávta, have regular, singular stem counterparts.

LEXICON LÅHTSASA Like GAHPER, only in plural. Without derivations, these should maybe be added.

Partially assimilated loanwords. The first part of the word is “citation borrowed” and keeps its norwegian/swedish orthography, only the last two syllables are adapted to sami.

LEXICON MUORRA_LOAN For loan words that do not fit in a loan word lexicon because of wrong short cmp, or partially assimilated loanwords without separate lexicas (medállja), or for Err/Orths assimilated with cg but with other errors. This lexicon gives no short compound forms. Potential short cmps must therefore be hard coded into the FirstComponent lexicon. This also for compounded words with partially assimilated loan words. Examples of problem words: sirup>siráhppa og stetoskop>stetoskoahppa.

LEXICON MUORRA_LOAN_NO_LASJ Like MUORRA_LOAN without -lasj derivation. This lexicon is made for Sem/Hum words like økonåvmmå, biolåvggå, agronåvmmå and so on. We don’t want agronåvmålasj since it means something else than “agronomisk”, the meaning of agronåvmålasj is barely used but messed up with “agronomijjalasj”

LEXICON MUORRA_LOAN_EXTRA_LENGTH Same as MUORRA_LOAN just for words with º (extra length).

LEXICON KAFIEDJA_CMP_INFL Recent loanwords on -edja. Ends on -é in norwegian. Short and long cmp. “Kafea” and “kaféa” are subtaged. See comments about the -ie/-e dialtags in ALFABIEHTTA.

LEXICON ALLEGORIJJA_CMP_INFL Recent loanwords ending on -i in NOR/SWE, with long and short compound form. Standardized as-iddja (SWE) and -ijºja (NOR). Previously often assimilated as -ija (or just -ia), but both forms are ungrammatical: Short vowels cannot preceed and follow a single intervocalic consonant. -ija is thus ungrammatical as the short a would be lenghtened to á, like “idja-ijá”.

LEXICON TEKSTIJLLA_CMP_INFL Recent loanwords on -ijlla with long and short compound-form. . Frequently typos that does not follow lulesami rules are sub taged; These forms goes against the standarization rule, but are found because of earlier standarizations rules and dictionaries.

LEXICON ASIJLLA_CMP_INFL Recent loanwords on -ijlla, from nor and swe words ending on -yl. With long and short compound-form. Frequently typos that does not follow lulesami rules are sub taged; These forms goes against the standarization rule, but are found because of earlier standarizations rules and dictionaries.

LEXICON BENSIJNNA Recent loanwords on -ijnna with long and short compound-form

LEXICON BENSIJNNA_CMP_INFL Recent loanwords on -ijnna with long and short compound-form. Frequently typos that does not follow lulesami rules are sub taged; These forms goes against the standarization rule, but are found because of earlier standarizations rules and dictionaries.

LEXICON MASJIJNNA_CMP_INFL Recent loanwords on -sjijnna with long and short compound-form: -SKIN

LEXICON ADJEKTIJVVA_CMP_INFL Recent loanwords on -ijvva with long and short compound-form

LEXICON PARADIJSSA_CMP_INFL Recent loanwords on -ijssa with long and short compound-form. Frequently typos that does not follow lulesami rules are sub taged; These forms goes against the standarization rule, but are found because of earlier standarizations rules and dictionaries.

LEXICON TELEFÅVNNÅ_CMP_INFL Recent loanwords on -åvnnå with long and short compound-form. Frequently typos that does not follow lulesami rules are sub taged; These forms goes against the standarization rule, but are found because of earlier standarizations rules and dictionaries.

LEXICON INSTITUSJÅVNNÅ_CMP_INFL Recent loanwords on -sjåvnnå with long and short compound-form: -TION IN SWEDISH. Frequently typos that does not follow lulesami rules are sub taged; These forms goes against the standarization rule, but are found because of earlier standarizations rules and dictionaries.

LEXICON MISJÅVNNÅ_CMP_INFL Recent loanwords on -sjåvnnå with long and short compound-form: -SSION IN SWEDISH. Frequently typos that does not follow lulesami rules are sub taged; These forms goes against the standarization rule, but are found because of earlier standarizations rules and dictionaries.

LEXICON PENSJÅVNNÅ_CMP_INFL Recent loanwords on -sjåvnnå with long and short compound-form: -SION IN SWEDISH. Frequently typos that does not follow lulesami rules are sub taged; These forms goes against the standarization rule, but are found because of earlier standarizations rules and dictionaries.

LEXICON PARTISIHPPA_CMP_INFL Recent loanwords from swe -cip and nor -sipp, becoming -sihppa in Norway, both -sijppa and -sihppa are used in Sweden (Particip vs partisipp). Short and long compound-form.

LEXICON ALKOHÅVLLÅ_CMP_INFL Recent loanwords on -åvllå with long and short compound-form. The old stadarization form “alkohola” is sub taged. Frequently typos that does not follow lulesami rules are sub taged; These forms goes against the standarization rule, but are found because of earlier standarizations rules and dictionaries.

LEXICON AGRONÅVMMÅ_CMP_INFL Recent loanwords on -åvmma with long and short compound-form. -lasj derivation is error taged. The old stadarization form -oma that does not follow lulesami rules is sub taged.

LEXICON DEMAGÅVGGÅ_CMP_INFL Recent loanwords ending on -og with long and short compound form. Assimilated to smj as -åvggå. -lasj derivation is error taged. The old stadarization -oga that does not follow lulesami rules is sub taged.

LEXICON LAKTÅVSSÅ_CMP_INFL Recent loanwords ending on -ose in nrowegian and -os in swedish, with long and short compound form. Assimilated to smj as -åvsså. The old stadarization -oga that does not follow lulesami rules is sub taged.

LEXICON FAKTÅVRRÅ_CMP_INFL Recent loanwords on -åvrrå with long and short compound-form.

LEXICON MIKROSKÅVPPÅ_CMP_INFL Recent loanwords on -åvppå (-op in NOB/SWE) with long and short compound-form. Long vowel and short consonant is assimilated with njuoban, but somehow a lot of -op words are assimilated -oahppa (biskop is pronounced as -opp, so that’s different, maybe some have used “biskop” as template), so this is Err/Orth taged.

LEXICON KULTUVRRA_CMP_INFL Recent loanwords on -vrra with long and short compound-form. Frequently typos that does not follow lulesami rules are sub taged; These forms goes against the standarization rule, but are found because of earlier standarizations rules and dictionaries.

LEXICON TERAPÆVTTA_CMP_INFL Recent loanwords on -ævtta/ievtta with long and short compound-form. No -lasj derivation. Frequently typos that does not follow lulesami rules are sub taged; These forms goes against the standarization rule, but are found because of earlier standarizations rules and dictionaries.

LEXICON ADVÆRBBA_CMP_INFL Recent loanwords on -ærbba with long and short compound-form

LEXICON SUBSTÁNSSA_CMP_INFL Recent loanwords on -ánssa with long and short compound-form. Originally -ans in SWE and NOR. Frequently typos that does not follow lulesami rules are sub taged; These forms goes against the standarization rule, but are found because of earlier standarizations rules and dictionaries.

LEXICON VALÆNSSA_CMP_INFL Recent loanwords on -ænssa with long and short compound-form. Frequently typos that does not follow lulesami rules are sub taged; These forms goes against the standarization rule, but are found because of earlier standarizations rules and dictionaries.

LEXICON ADVOKÁHTTA_CMP_INFL Recent loanwords on -áhtta with long and short compound-form. Frequently typos that does not follow lulesami rules are sub taged; These forms goes against the standarization rule, but are found because of earlier standarizations rules and dictionaries.

LEXICON ALFABIEHTTA_CMP_INFL Recent loanwords originally on -et both in Norway and Sweden. Assimilation differences, however, create two lule sami forms: -iehtta in NOR and -æhtta in SWE. LONG -e is assimilated in different ways in Norway and Sweden: In Norway, it becomes -ie, and in Sweden -e. Tiedja/tedja, systiebma/systebma and so on. This is especially apparent in assimilated words with long e in third grade: E becomes æ in third grade so we get “universitæhtta” in SWE, but this is very strange to people on the norwegian side of the border as they want “universitiehtta”. Both -ie and -e are dialtaged in lexicons HYDROGIEDNA, APOTIEHKKA, SYSTIEBMA, KAFÉ. Previously people often wrote -ehtta in Norway, but this is incorrect as e always becomes æ in grade three.

LEXICON INTERNÆHTTA_CMP_INFL Recent loanwords on -æhtta with long and short compound-form: -ET IN SWEDISH, -ETT in norwegian. Differs from ALFABIEHTTA because -ehtta isn’t used in NOR.

LEXICON TABLÆHTTA_CMP_INFL Recent loanwords on -æhtta with long and short compound-form. -ETT in both norwegian and in swedish.

LEXICON INSTITUHTTA_CMP_INFL Recent loanwords on -uhtta, with long and short compound-form on -utt(NOR)/-ut(SWE). The swedish -ut also gets uvtta, as ANTIHKKA-antijkka, but instituhtta is also used in sweden, so no Area/NO tag.

LEXICON SATELIHTTA_CMP_INFL Recent loanwords on -ihtta, with long and short compound-form on -itt(NOR)/-it(SWE). The swedish -it also gets ijtta, as ANTIHKKA-antijkka, but satelihtta is also used in sweden, so no Area/NO tag.

LEXICON APOTIEHKKA_CMP_INFL Recent loanwords on -iehkka in NOR, -æhkka in SWE. -ehkka as sub. With long and short compound-form on -k. See comments about the -ie/-e dialtags in ALFABIEHTTA.

old “apotehkka” (long e not allowed in grad III, even though it’s in dictionaries it’s wrong)

LEXICON ANTIHKKA_CMP_INFL Recent loanwords on -hkka in Norway, both -ijkka and -hkka are used in Sweden (Antik vs antikk). With long and short compound-form on -kk/-k. The swedish forms were earlier added to stems for the Swedish version, but now added here.

LEXICON SEMINÁRRA_CMP_INFL Recent loanwords on -árra with long and short compound-form. Frequently typos that does not follow lulesami rules are sub taged; These forms goes against the standarization rule, but are found because of earlier standarizations rules and dictionaries.

LEXICON AREÁLLA_CMP_INFL Recent loanwords on -álla with long and short compound-form. Frequently typos that does not follow lulesami rules are sub taged; These forms goes against the standarization rule, but are found because of earlier standarizations rules and dictionaries.

LEXICON AMBASSADERRA_CMP_INFL Recent loanwords on -ør with long and short compound-form. Standarized by Giellagálldo 05.05.14 as -erra. -ørra is subtaged

LEXICON VETERINERRA_CMP_INFL Recent loanwords on -erra. Words ending in -ær in both SWE and NOR. Both long and short compound-form. The old standardization form -æra, without cg, is subtaged, -also -ær’ra and -ærra.

LEXICON ATMOSFERRA_CMP_INFL Recent loanwords -on erra. But with different endings in SE and NO, ending on -ære, -ær in NOR and -är, -ära in SWE (Ingefær NO, ingefära in SE). Only long compound-form, short form must be hardcoded in firstcompnent lexicon. The old standardization form -æra, and -era, without cg, are subtaged, -also -ær’ra and -ærra.

LEXICON KARAKTIERRA_CMP_INFL Recent loanwords -on ierra in NOR, -erra in SWE, because of long e assimilates diffenrent ways. Words ending on -er in NOR, and -er or -är in SWE. Only long compound-form, short form must be hardcoded in firstcompnent lexicon.

LEXICON TABÆLLA_CMP_INFL Recent loanwords on -älºla with long and short compound-form. Frequently typos that does not follow lulesami rules are sub taged; These forms goes against the standarization rule, but are found because of earlier standarizations rules and dictionaries.

LEXICON TELEGRÁMMA_CMP_INFL Recent loanwords on -ámºma with long and short compound-form

LEXICON TOPOGRÁFFA_CMP_INFL Recent loanwords on -áfºfa with long and short compound-form, no -lasj derivation since most of these words are humans.

LEXICON SYSTIEBMA_CMP_INFL Recent loanwords on -ebma/-iebma with long and short compound-form. -em in NOR and SWE. See comments about the -ie/-e dialtags in ALFABIEHTTA. Frequently typos that does not follow lulesami rules are sub taged; These forms goes against the standarization rule, but are found because of earlier standarizations rules and dictionaries.

LEXICON ORGÁDNA_CMP_INFL Recent loanwords on -ádna with long and short compound-form

LEXICON KOLLÆKTA_CMP_INFL Recent loanwords on -ækta with long and short compound-form

LEXICON HYDROGIEDNA_CMP_INFL Recent loanwords on -iedna in NOR and -edna in SWE. Both long and short compound-form. Norwegian/swedish -en. The old standardization form -ena, without cg, is subtaged. See comments about the -ie/-e dialtags in ALFABIEHTTA.

LEXICON PATÆNNTA_CMP_INFL Recent loanwords on -ænnta with long and short compound-form. The -ennta form (used in “Ådå testamennta”) is taged as sub (e always becomes æ in grade three).

LEXICON VARIÁNNTA_CMP_INFL Recent loanwords on -ánnta with long and short compound-form. Frequently typos that does not follow lulesami rules are sub taged; These forms goes against the standarization rule, but are found because of earlier standarizations rules and dictionaries.

LEXICON FANATISSMA_CMP_INFL Recent loanwords on -ssma with long and short compound-form.

LEXICON TURISSTA_CMP_INFL Recent loanwords on -ssta with long and short compound-form. -lasj derivation is error taged. Frequently typos that does not follow lulesami rules are sub taged; These forms goes against the standarization rule, but are found because of earlier standarizations rules and dictionaries.

Loanwords becoming odd-syll

LEXICON PRIEMIJ_CMP_INFL Assimilated loanwords. on -ie/-y, like premie and bandy. Become odd syllable loan words with cg, like “riebij”. Nom: premij, gen prebmiha. Long and short essive.

Loanwords becoming contracted-syll

See further down: ÅLMÅJ_LOAN

Error-lexicons, made to not get too many entries with both Err/Orth and correct

LEXICON A_CMP_INFL Sub-forms. Lexicon for giving sub-variation conjugation by simply adding an -a to the norwegian/swedish word. No cg. Like “alkohola” and “agronoma”. These forms goes against the standarization rule, but are found because of earlier standarizations rules and dictionaries.

LEXICON ERR/ORTH_EVEN_WEAK_CASES Even stem Err/orth lexicon without nominative, illative and essive. Only for entries with ERR/ORTH tag. Made so that we don’t get entries that are both norm and with error tag. Entries like “ålggo” have no grade alternation, a common error is writing it like it has, ålggo>ålgov. tálla>tálav, klimáksa>klimáksav, prefiksa>prefiksav, barggo>barggov

LEXICON ERR/ORTH_EVEN_WEAK_CASES2 Even stem Err/orth lexicon without nominative, illative and essive, AND ALSO without Sg+Gen, Sg+Ine, Pl+Nom, Pl+Com and Pl+Gen (to not get homonemies.

LEXICON ERR/ORTH_EVEN_STRONG_CASES Even stem Err/orth lexicon with only nominative, illative and essive. Only for entries with ERR/ORTH tag. Made so that we don’t get entries that are both norm and with error tag. Hydrogena is used as nom and is err/orth, hydrogena>hydrogenav is not err/orth. marináda-nom, banána-nom

LEXICON ERR/ORTH_ODD Err/Orth lexicon doing the opposit of what odd-syllable nouns do. Strong grade in nom and weak in all other. fiehpar-fiebara

Badly assimilated loanwords

LEXICON NOADE_BADASS 2 syll stems without cg. Badly or wrongly assimilated words, ie. assimilated in a way that isn’t lulesami. (Same as NOADE) Most of the words are Err/Orth tagged with a standardized lemma. Some are Err/Lex tagged, 5.9.2019: EJP/SNM: fjerna +Use/-Spell - sjølv om vi ikkje likar orda, så vil vi sjå til at dei blir skrive rett etter smj-ortografien! Dei fleste orda er uansett merka med +Err/Orth :)

LEXICON C_ILL_IJ_BADASS Badly or wrongly assimilated words. Last letter is consonant, no cg, no vowchange, with illative -ij. (Same as GAHPER) Assimilated in a way that isn’t lulesami. Most of the words are Err/Orth tagged with a standardized lemma. Some are Err/Lex tagged, and some only recieve the +Use/-Spell tag from the lexicon.

LEXICON C_ILL_AJ_BADASS Badly or wrongly assimilated words. Last letter in consonant, no cg, no vowchange, with illative -aj. Should have been assimilated to even-syll, but are used as odd-syll, and mostly just assimilated with changing to letter á. So almosed same as CELSIUS_UNASS.

Unassimilated loanwords

LEXICON KINO_UNASS_CMP_INFL V-final unassimilated loanwords. Not lulesami. No diacritics whatsoever. Words that aren’t assimilated at all. Really just norwegian words with a kind of sami inflection. Get even syllable case marking. Are part of the spell checker.

LEXICON C_ILL_IJ_UNASS C-final unassimilated loanwords, gives illative- ij. Not lulesami. No diacritics whatsoever. Really just foreign words with a kind of sami inflection. Odd syllable case marking (like GAHPER). Are part of the spell checker.

LEXICON C_ILL_AJ_UNASS C-final unassimilated loanwords, gives illativ -aj. Also odd-syll words ending on letter i, as selleri. Not lulesami. No diacritics whatsoever. Really just norwegian words with a kind of sami inflection. Case marking like standard even 4 syllable stems (see proper nouns file on the case marking of foreign words with stressed last syllable). Are part of the spell checker.

+Der4+Der/ahtes:e»g AHTES ; Only for odd-syllble stems

4syll stems

LEXICON GÅNÅGIS Standard C-final 4-syllabic stems

LEXICON BERULASJ For words ending on -asj. Same as GÅNÅGIS but with strong essive and illative -adjtan and -adtjaj subtaged, same with PX “-adjtam”. These forms are barely used today. -lahttja is also Err/Orth-taged.

LEXICON BEDNAGASJ Like BERULASJ, but for derived nouns in diminutive. No cg, no vowchange, no short Ess. Has only one dimin derivation since these words already are dimin, ie. no double dim as for GAHPER. No abessive, not totally sure about this, I think we must use postposition dagi when it’s diminutive,

LEXICON HÁVSAGUSJ Like BEDNAGASJ, but not diminutive. No cg, no vowchange, no short Ess. Has only one dimin derivation. No abessive, not totally sure about this, I think we must use postposition dagi when it’s diminutive,

LEXICON JIHPELIJ gen:jihpelahá

LEXICON OARJJILIJ gen:oarjjilihá

LEXICON VIESSOMUJ gen:viessumuhá

4 syllable plurals

LEXICON OADÁDAGÁ Plural forms of words like tjerastahka with short compound-form

LEXICON BERRAHATTJA Plural stems. Like IEDNITJA, these do not have corresponding singular stems. Most stems here have the same form as the pl nom form of diminutive derivations, but (while it may have originated as a diminuitive derivation) it is not the same derivation (today) and it does not have a singular form.


LEXICON SISSNELUHÁ plurals. presently only for sissŋeluhá

LEXICON DAGI_SINGULAR Earlier we generated “bijladagi” and bijlajdagi as abbessiv. This has been fixed, but to be able to analyse what we earlier generated, we needed this lexicon. Only singular. Gives Err/tag to “bijladagi” and makes correct “bijla dagi”.

LEXICON DAGI_PLURAL Earlier we generated “bijladagi” and bijlajdagi as abbessiv. This has been fixed, but to be able to analyse what we earlier generated, we needed this lexicon. Only plural. Gives Err/tag to “bijlajdagi” and makes correct “bijlaj dagi”.

Adjectival sublexicas. Give 4 syll adjectives inflection


LEXICON N-EVENWEAKSTEM-NO-ABE same as N-EVENWEAKSTEM but without abessive (abessive it Err/Infl-taged). Used for 4-syll nouns

Compound lexicas

Odd-syllable stems

without cg

LEXICON GAHPER Odd-syllable C-final noun without cg, no vowchange, no short Ess. Spiik A3

with cg

LEXICON ÅRES Odd-syllable C-final noun with CG, 2ndsyll vowchange. Long and short essive. Spiik A1

LEXICON SÅHKÅR Odd-syllable C-final noun with CG and 2ndsyll vowelchange. Has only long essive. Spiik 2b

LEXICON MIEHTAR Only for word “miehtar”. Same as SÅHKÅR but with Area-differences and a lot of Err/Orths.

LEXICON GÁMAS Odd-syllable C-final noun with CG, no 2ndsyll vowchange (OBS: a does not change). Long and short essive. Spiik A2

LEXICON BENA Odd-syllable V-final noun with cg, no 2nsyll vowchange. Deletes g. Long and short essive. Spiik 2a

Irregular stems

LEXICON SUOBDE gen: suobddega. Presently only for “suobde”. For some reason -e dosn’t become á. So not in lexicon BENA. Long and short essive.

LEXICON SÁGE gen: sáhkaha. Presently only for “ságe”. Long and short essive.

LEXICON BAVSEV Ends on -v and last vowel changes to i: bavsev:baksIma. Not like gierkav gierkkAma and birev birEma.

LEXICON RÁBEV rábev:ráhpuga. Presently only for “rábev”.

LEXICON RITJAS ! Like GÁMAS but without stem a-lengthening for grade I (underlying long -i-). presently only for “ritjas”.

LEXICON SÅGAS gen: sågaska. Presently only for “sågas”.

LEXICON SJUVÁJ Presently only for “sjuváj”. sjuváj-sjuvvaga. Only this word

LEXICON BØSOJ Because of bösoj in O.Korhonen, and bæsoj-bessuga. Only for these two words. J becomes g.

LEXICON GUOVSOJVUOJOJ vuojoj:vuodjom. Presently only for “guovsojvuojoj”.

LEXICON BUTJES butjes-buttjása. Presently only for “butjes”. This is an sub. Korhonen has this form but if you look in Grundstöm it’s buttjes-budtjasa. Must be a typo in Korhonen, because ttj-tj dosn’t exist in smj. This form is err subed in stems file.

LEXICON TJÅLKES tjålkes:tjoalkkas- Presently only for “tjålkes and tsålkes”. This must be wrong, and it dosn’t exist in Grundström. Å in 1. syll isn’t possible with e in 2. syll. Must be tjoalkes-tjoalkkása or tjålkas-tjoalkkasa. This form is err subed in stems file.

LEXICON VÁJES vájes:vádjas- Presently only for “báhkovájes”. It’s a sub: 2. syll e doesn’t become a. Must be vájes-vádjása or vájas-vádjasa. The second is used in NT, so I belive thats the right one. This form is err subed in stems file.

Derived stems

LEXICON BADJEL Derived nouns with acc -elav, ill -elij, elat -elas, etc. These were previously categorized as adpositions and adverbs, but according to Bruce Morén-Duolljá (2014) they are actually case forms of nouns derived from certain location nouns. Derived from even strong stems (badje -> badjel). Odd syllable inflection, but only singular nominative-elative (not clear if they take comitative and essive case). With comparatives. No Px.

LEXICON BÁRNEP bárnep:bárnebu-. Comparisation of nouns. No -ahtá abesive.

LEXICON OAPPÁSJ Like GAHPER, but for derived nouns in diminutive, have an underived form. Doesn’t get abesive -ahtá or -ahtes derivation. Oddsyll, no cg, no vowchange, no short Ess. Has only one dimin derivation since these words already are dimin, ie. not double dim as in GAHPER.

LEXICON FIERUN Like GAHPER, but instruments derived from verbs. Fierrot>fierun. No short essive.

LEXICON GUOLLÁR Like GAHPER, but actor derived from contracted verbs (ACTOR for evensyll verbs). Guollit>guollár. No short essive.

LEXICON IELLEM Nomen actionionis derived from even verbs. Earlier these went directly to VSBST-ODD, now they get tag Gram/NomAct before going there. Can’t put it in VSBST-ODD lexicon because paths from verb lexicons.

LEXICON TJIEKTJAMA Pl Nomen actionionis derived from even verbs. Earlier these went directly to VSBST-ODD-PL, now they get tag Gram/NomAct before going there. Can’t put it in VSBST-ODD-PL lexicon because of paths from verb lexicons.

LEXICON AKTIDIBME Nomen actionionis derived from uneven verbs, ending DIBME. Earlier these went directly to VSBST-EVEN, now they get tag Gram/NomAct before going there. Can’t put it in VSBST-ODD lexicon because paths from verb lexicons.

LEXICON BERUSTIBME Nomen actionionis derived from uneven verbs, ending STIBME and DIBME is Err/orth-taged. Earlier these went directly to VSBST-EVEN, now they get tag Gram/NomAct before going there. Can’t put it in VSBST-ODD lexicon because paths from verb lexicons.

Plural odd-syll

LEXICON DÁRBBAGA Like BENA, but plural. Presently only for “dárbbaga”, has singular stem counterpart.

LEXICON BÆLLJASA Like GÁMAS, but plural. These have corresponding singular stems.

LEXICON IEDNITJA Odd syllable pluralforms only. These do not have a singular form.

LEXICON SNJIERÁGA Odd syllable pluralforms only. These have corresponding singular stems.

LEXICON MANEBU oddsyllable plural only. presently only for “maŋebu”.

Contracted stems

LEXICON SUOLOJ C-final with cg II-III: ålmåj:ålmmå

LEXICON ÅLMÅJ_LOAN Same as SUOLOJ, only for loan words. Follows Ráhka/Mikkelsen’s Bårjås 2014. C-final with cg II-III: ålmåj:ålmmå

LEXICON GUOMOJ C-final with cg I-III: guomoj:guobbmu

LEXICON SARVES C-final with cg II-III. sarves:sarvvá

LEXICON SVÁLES C-final with cg I-III. sváles:svállá (lºl)

LEXICON GÅHKES C-final with cg II-III with vowel harmony (a/á=å). gåhkes:gåhkkå. Presently only for “gåhkes”.

LEXICON SJUOKKAJ sjuokkaj:sjuoggá. Presently only for “sjuokkaj”.

LEXICON GISTÁ gistá:gisstá. Presently only for “gistá”.

Contracted stems sublexica

Px lexica

LEXICON DUOLMUN Fierrot>fierun, instruments derived from verbs, used only for verb derivation, not for lexicalized lemmas. No short essive.

This (part of) documentation was generated from src/fst/morphology/affixes/nouns.lexc

+Use/NG+Gen:n NAMÁK ; ! adjectival -k derivation does not take pronouns +Use/NG+Ela:sstága K ; !Can’t find this anywhere. Maybe this is really dástága/dastagá? in “dáhtakcas”

+Use/NG+Gen: NAMÁK ; ! adjectival -k derivation does not take pronouns

+Use/NG+Gen:aj NAMÁK ; ! adjectival -k derivation does not take pronouns +Ine:a%>jna K-s ; +Abe+Use/NG:a%>jdak K ; ! covered in non-idiosync
+Abe+Use/NG:a%>jdagi K ; ! covered in non-idiosync
+Abe+Use/NG:a%>jdagá K ; ! covered in non-idiosync
+Abe+Use/NG:a%>jtagá K ; ! covered in non-idiosync

This (part of) documentation was generated from src/fst/morphology/affixes/pronouns.lexc

Lule Sáme Proper noun morphology !

Even syllable proper nouns

Unstressed last syllable

Words in ACCRA lexicons end on vowel, have no CG and get “even-syllable” case marking where case suffixes are added directly. Illative e:i, but not o:u. Last syllable is unstressed. Both non-assimilated and assmilated stems (although not all are fully, or correctly, assmilated)

LEXICON ACCRA-ani Vowel-final names where case endings are added directly, no cg. Illative e changes to i. Animales.

LEXICON ACCRA-obj Vowel-final names where case endings are added directly, no cg. Object names

LEXICON ACCRA-org Vowel-final names where caseendings are added directly, no cg. organizations

LEXICON ACCRA-mal Vowel-final names where case are added directly, no cg. Male names

LEXICON ACCRA-fem Vowel-final names where case endings are added directly, no cg. Female names

LEXICON ACCRA-femsur Vowel-final names where case endings are added directly, no cg. Female names also used as surnames

LEXICON ACCRA-malfem Vowel-final names where case endings are added directly, no cg. Names that can be both female and male names

LEXICON ACCRA-objplc Vowel-final names where case endings are added directly, no cg. Names that can be both objects and place names

LEXICON ACCRA-femplc Vowel-final names where case endings are added directly, no cg. Names that can be both female and place names

LEXICON ACCRA-sur Vowel-final names where case endings are added directly, no cg. Surnames

LEXICON ACCRA-malsur Vowel-final names where case endings are added directly, no cg. Names that can be both male- and surnames

LEXICON ACCRA-plc Vowel-final names where caseendings are added directly, no cg. Place names

LEXICON ACCRA_MWE-plc Vowel-final names where caseendings are added directly, no cg. Place names

LEXICON GIRUNA-plc For proper Kiruna. Same as ACCRA. Different lexicon because of sma.

LEXICON ACCRA-LOAN-org Only nominatives. Vowel-final names where case endings are added directly, no cg. organizations

LEXICON ACCRA-LOAN-obj Only nominatives. Vowel-final names where case endings are added directly, no cg. Object names

LEXICON ACCRA-LOAN-plc Only nominatives. Vowel-final names where case endings are added directly, no cg.Place names

In smj RONDANE is same as ACCRA, in use in smi because of diffrences in sme. No -lasj or -k. Last syllable is unstressed. Non-assimilated-stems.

LEXICON RONDANE-plc E-final names, with no cg. elative -s, ill -ij. Place names

LEXICON RONDANE-SG-plc E-final names, with no cg. elative -s, ill -ij. Place names

LEXICON RONDANE-LOAN Only nominative.Place names

LEXICON RONDANE-SG-LOAN Only nominative. Place names



LEXICON RONDANE-org Organizations

LEXICON RONDANE-mal Male names

LEXICON RONDANE-fem Female names

These sublexica are irrelevant for ACCRA, but added for the sake of the lexicon MARJA

GATA are Norwegian place names that end on -gata. Gets even-syllable casemarking. Last syllable is unstressed. Non-assimilated stems.

LEXICON GATA-plc Norwegian place names that end on -gata. Gets even-syllable casemarking. Last syllable is unstressed.

Words in MARJA end on vowel, with CG, even-syllable case marking. Illative change e to á, illative i stays i. Last syllable is unstressed. Real lule sami stems.

LEXICON MARJA-fem Odd-syllable with cg. Female names

LEXICON MARJA-ani Animal names

LEXICON MARJA-mal Male names


LEXICON MARJA-org Organizations

LEXICON MARJA-plc Vowel final names with Gradation and Ill change (place names)

LEXICON MARJA-sur Surnames

LEXICON MARJA-plc-der = place name derivations and corresponding flag. Presently not used in SMJ.

LEXICON SUOBMA-plc Placenames. Like MARJA but no derivation

LEXICON SUOBMA-org Placenames. Like MARJA but no derivation

Stressed last syllable

These proper nouns are in essence partly assimilated loan word as foreign words with stressed last syllable are assimilated to sami by (often adapting the stressed syllable vowel, and) adding an unstressed syllable consisting of adapted (or if necesarry added) consonants and ending on vowel a (Morén-Duollja 2014). Proper nouns are only partly assimilated in that the stressed syllable vowel is not adapted in any way, neither are consonats inserted, only the final “a” remains. These proper nouns therefore work like regular a-stem nouns and get an even syllable case marking.

Words in lexicon NYSTØ end on vowel, no cg. Non-assimilated stems

LEXICON NYSTØ-fem Femal names

LEXICON NYSTØ-mal Male name


LEXICON NYSTØ-org Organizations

LEXICON NYSTØ-LOAN-org Organizations loan

LEXICON NYSTØ-sur Sur names

LEXICON NYSTØ-LOAN-plc Place names loan

LEXICON NYSTØ-plc Place names

LEXICON NYSTØ_MWE-plc Place names

Words in DUBAI lexicon end on vowel+vowel and have no cg. Last syllable is stressed. Get even syllable case marking. Non-assimilated stems. Not sure if this lexicon is necessary, at least for smj’s sake.

LEXICON DUBAI-fem I-final names. No cg. Female names

LEXICON DUBAI-obj I-final names. No cg. Object names

LEXICON DUBAI-org Organizations

LEXICON DUBAI-mal Male names

LEXICON DUBAI-sur Surnames

LEXICON DUBAI-plc Place names

Words in lexicon BERN end on conconant, no cg, even syllable case marking with -av, -aj, -as, etc. Last syllable is stressed. Both assimilated and non-assmilated stems.

LEXICON BERN-ani Animals

LEXICON BERN-mal Male names

LEXICON BERN-surmal name that are both sur- and male names

LEXICON BERN-fem Female name

Different lexicon for female persons. Audhild.

LEXICON BERN-sur Surnames

LEXICON BERN-plc Placenames

LEXICON BERN_MWE-plc Placenames

LEXICON BERN-objsur Names used as both objects and surnames.

LEXICON BERN-orgsur Names used for both organizations and surnames.

LEXICON BERN-obj Objects. Obs: Different lexicon for organisations. Microsoft.

LEXICON BERN-org Organizations

LEXICON BERN-LOAN-org Organizations loan.

LEXICON BERN-LOAN-plc Placenames loan.

LEXICON BERN-LOAN-obj Objects loan.

Different lexicon for names that are both surnames and places.

Lexicons OY work as BERN lexicons

Words in LONDONBERN are sent to both LONDON and BERN lexicons. Non-assmilated stems.

4-syllable stems

Words in lexicon BASUDIS are trisyllabic in sg nom, and work like standard 4-syllable nouns. End on conconant and have cg. Even syllable case marking with acc -áv, ill -áj, ela -ás, etc. Real lule sami stems.

LEXICON BASUDIS-org Only singular. Placenames

LEXICON BASUDIS-mal Male names

LEXICON BASUDIS-plc Place names


Words in lexicon VARGGAT even-syllable sámi plurals .

LEXICON VARGGAT-plc Plural stems, sáme names. Place names

LEXICON VARGGAT-org Plural stems, sáme names.

Words in lexicon ALEUHTAT even-syllables assimilated plurals.

LEXICON ALEUHTAT-plc Plural names, not sami names. like -váre, -gårtje

Odd syllable case marking

Words in lexicon LONDON end on conconant, no cg, case marking with -av, -ij, -is, etc. Last syllable is unstressed. Gets a regular odd syllable case marking. Both real lule sami stems, assimilated stems and non-assimilated stems

LEXICON LONDON-sur Odd-syllable. Surnames. Final foot structure (X.) and (X..) => Loc:%>is


LEXICON LONDON-org Only singular Organizations

LEXICON LONDON-mal Male names

LEXICON LONDON-malsur Names that can be both male- and surnames. Not used in smj-propernouns

LEXICON LONDON-fem Female names

LEXICON LONDON-malfem Names that can be both male and female names.Not used in smj-propernouns

LEXICON LONDON-malplc Names that can be both male- and placenames.Not used in smj-propernouns

LEXICON LONDON-plc Only singular. Placenames

LEXICON TJIERREK-plc Only singular. Placenames. Without cg. Same as LONDON, but does not get Sem/Sur tag, not usuall for SMJ place names to become surnames.

LEXICON LONDON-orgsur Names that can be both organizations and surnames.Not used in Smj-propernouns


LEXICON LONDON-LOAN-obj Objects loan. Not used in smj-propernouns

LEXICON LONDON-LOAN-plc Only nominatives. Placenames loan. Not used in Smj-propernouns

LEXICON LONDON-LOAN-org Only nominative. Organizations loan.Not used in smj-propernouns

JOKULL-plc are placenames. Lexicon added to make the code compile (?)

+N+Prop+Sem/Plc: LONDONDECL-PLC-SUR ; Placenames. NB added to make the code compile, needs revision. Gets an odd syllable case marking. Non-assimilated stems.

Words in lexicon ANAR end on conconant, no cg, case marking with ill -ij, ela -is. Gets an odd syllable case marking. Lule sami stems.

LEXICON ANAR-mal Male names.

LEXICON ANAR-plc Place names

Words in PIPPI lexicons are i-final, have no cg, no second syllable vowel change, and get odd syllable case marking with acc -hav, ill -hij, elat -his, etc. Works as “riebij”, but without the -j in nominative (it should maybe be Sirij and Pippij in nom?) and without cg. The last syllable is unstressed. Non-assimilated stems.

LEXICON PIPPI-ani IVowel-final names where case endings are added directly, no cg. Animals.

LEXICON PIPPI-obj Vowel-final names where case endings are added directly, no cg. Object names

LEXICON PIPPI-org Vowel-final names where caseendings are added directly, no cg. organizations

LEXICON PIPPI-mal Vowel-final names where case are added directly, no cg. Male names

LEXICON PIPPI-fem Vowel-final names where case endings are added directly, no cg. Female names

LEXICON PIPPI-femsur Vowel-final names where case endings are added directly, no cg. Female names also used as surnames

LEXICON PIPPI-malfem Vowel-final names where case endings are added directly, no cg. Names that can be both female and male names

LEXICON PIPPI-sur Vowel-final names where case endings are added directly, no cg. Surnames

LEXICON PIPPI-plc Vowel-final names where caseendings are added directly, no cg. Place names

LEXICON PIPPI-LOAN-plc Only nominatives. Vowel-final names where case endings are added directly, no cg.Place names

Words in lexicon DUORTNUS end on conconant, have cg and second syllable vowel change o:u, e:á. Odd syllable case marking. Real lule sami or one non-assimilated stem.



LEXICON DUORTNUS-org Odd-syllable ending on consonant, with cg. Organizations

LEXICON DUORTNUS-plc Odd-syllable ending on consonant, with cg.Placenames

LEXICON TIEMPEL-obj Same as DUORTNUS, only without second syll vowel change. Odd syllanle case marking Lexicon presently only for two -tiempel-final words. Lule sami stems.

LEXICON TIEMPEL-org Same as DUORTNUS, only without second syll vowel change. Odd syllanle case marking Lexicon presently only for two -tiempel-final words. Lule sami stems.

Lexicon HEANDARAT is not in use in smj

+Pl+Nom:aQ1 K ; +Pl+Gen:aQ1j K ; +Pl+Gen:aQ1j RHyph ; +Pl+Acc:aQ1jt K ; +Pl+Ill:aQ1jda K ; +Pl+Ine:aQ1jn K ; +Pl+Ela:aQ1js K ; +Pl+Com:aQ1j K ;

Words in lexicon EATNAMAT are odd-syllable plurals. Lule sami stems and non-assimilated stems.

LEXICON EATNAMAT-plc Place names. Presently only for Vuolleednama

LEXICON EATNAMAT-org Organizations

Contracted proper nouns

Words in lexicon DAVVISUOLLU are contracted propernouns ending on -åj/-oj. Lule sami stems

LEXICON DAVVISUOLU-plc Contracted stems ending on -oj. Place names.

Words in lexicon GEAVNNIS are contracted propernouns ending on -s.

LEXICON GEAVNNIS-plc Contracted stems ending on -es. Place names. Lule sami stems.

Words in lexicon SUOLLOT are contracted plurals. Lule sami stems.

LEXICON SULLOT-plc Plural names, only names ending on -suollu.

Lexicons only used in sme/sma and that are sent to other lexicons in smj

ERVASTI is only used in smi-propenouns. Ervasti names are 3-syllable and are needed as a seperate lexicon because of sma. ERVASTI is same as ACCRA in smj and gets even syllable case marking.

MAKI and NIEMI is only used in smi-propenouns. Maki names are even-syllable finnish names and are needed as a seperate lexicon because of sma. MÄKI is same as ACCRA in smj and gets even syllable case marking.

HANNOLA is the same as ACCRA

This (part of) documentation was generated from src/fst/morphology/affixes/propernouns.lexc

Symbol affixes

This (part of) documentation was generated from src/fst/morphology/affixes/symbols.lexc

Sublexica for Verb

Table of content:

IV means intransitive verbs, TV means transitive verbs.

Auxiliary verbs

Negation verb







LEXICON GALGGAT_IV even-syllable modal verbs.

LEXICON VIERTTIT_IV Contracted modal verbs.

Ordinary main verbs

Even-syllable stems


LEXICON GALSSJOT_IV Impersonal o-verbs


LEXICON VILSSJOT_IV o-verbs as BÅRSSJOT but without derivations -stit, -stallat, -stahttet, - stasstet. With dim -astit that are hardcoded


LEXICON BOAHTET_IV e-verbs like BUOLLET_IV without passive


LEXICON ASSTAT_IV only for asstat, no passive

LEXICON RAVGGAT_IV a- and å-verbs only Sg3 passive.


LEXICON RAVGGALASSTET_IV Like RAVGGAT for already derived words (except words ending -uššat) - no actio as first part of compounds, but reintroduced

LEXICON BIEKKASTALLAT_IV Already derived impersonals

LEXICON GUOTTEDALLAT_IV passives on -allat - no actio as first part of compounds, but reintroduced

LEXICON HIEBADUVVAT_IV passives on -uvvat - no actio as first part of compounds, but reintroduced

Transitives LEXICON MÁHTTET_TV verbs without personal passive

LEXICON BASSAT_TV a- and å-verbs. Three passives

LEXICON BASSALASSTET_TV Like BASSAT for already derived words (except words ending -uššat) - no actio as first part of compounds, but reintroduced. Three passives

LEXICON HIEJTEDAHTTET_TV Like BASSALASSTET_TV, but for words ending on -ahttet. Diffrence is Use/NG an Use/-Spell for NomAg “hiejedahttijn”, since this is rearly used an is mixed up with gerundium “hiejtedattijn”. Like BASSAT for already derived words (except words ending -uššat) - no actio as first part of compounds, but reintroduced. Three passives

LEXICON JUHKAT_TV a-verbs like BASSAT_TV but but without derivations -stit, -stallat, -stahttet, - stasstet. Dim -istit that are hardcoded. Three passives

LEXICON LÁHPPET_TV e-verbs. Three passives


LEXICON DIEHTET_TV Only this one word, unusual diphtong behavior. No passive

LEXICON GÁDJOT_TV o-verbs. only duvvat passive.

LEXICON JÅRGGOT_TV o-verbs with dim -astit that are hardcoded. Duvvat and dallat passive.

Odd-syllable stems

This is just awaiting a manual classification


LEXICON BIEKKASTIT_IV Impersonals, only Sg3

LEXICON JÅRGESTIT_IV ONLY FOR -STIT verbs, makes Err/Orth jårgest, an with other verbs in this lexicon they will get err/orth Prs Sg3 even if it’s same as corrct prs sg3

LEXICON MÅRIJDIT_IV ONLY FOR Words ending -IJDIT. Same as BEGATJIT, but a common error is to write “jårgidit”, so the whole -ijdit and Err/Orth -idit is made in this lexicon

LEXICON BEGATJIT_IV Words ending -tjit, -jdit, reciprocals on -dit, momentatives on -dit, -edit, continuatives on -ldit, -nit, essives on -hit and 5-syllables - no actio cmps, but only Sg3 passivereintroduced


LEXICON BALÁDIT_IV continuatives on -dit, frequentatives on -odit, reciprocals, momentatives and frequentatives ending -alit - actio cpms, only Sg3 passive

LEXICON SUOGNALIT_IV Trisyllabic Verbs ending -lit. only Sg3 passive

LEXICON LASSÁNIT_IV verbs ending -nit, -sit, no passive

LEXICON BÁHTARIT_IV verbs ending -rit. only Sg3 passive


LEXICON FÁRMASTIT_TV ONLY FOR verbs ending on -stit. makes Err/Orth jårgest, an with other verbs in this lexicon they will get err/orth Prs Sg3 even if it’s same as corrct prs sg3. All -uvvat passives.

LEXICON HÁLIJDIT_TV ONLY FOR Words ending -IJDIT. Same as MUJTATJIT, but a common error is to write “hálidit”, so the whole -ijdit and Err/Orth -idit is made in this lexicon

LEXICON UNNEDIT_TV All -uvvat passives.

LEXICON MUJTATJIT_TV Words ending -tjit, -jdit, reciprocals on -dit, momentatives on -dit, -edit, continuatives on -ldit, -nit, essives on -hit and 5-syllables - no actio cmps, but reintroduced. All -uvvat passives

LEXICON BÅNJÅDIT_TV continuatives on -dit, frequentatives on -odit, reciprocals, momentatives and frequentatives ending -alit - actio cpms. All -uvvat passives.

LEXICON VUORDDELIT_TV Trisyllabic Verbs ending -lit. All -uvvat passives

Contracted stems



LEXICON OADDÁT_IV Inceptive, (doarrut,jåhttåt). Only Sg3 passive. Does not make nouns via -ár derivation.

LEXICON DULLUT_IV Does not make nouns via -ár derivation. Only Sg3 passiv.

LEXICON TJUOLLÁT_TV Inceptiv. All passive. Does not make nouns via -ár derivation, (gullát, bårråt)

LEXICON STRÁFFUT_TV Does not make nouns via -ár derivation. All duvvat passives.

LEXICON TSIEGGIT_TV Makes nouns via -ár derivation. All duvvat passives.

LEXICON VALLIT_TV Makes nouns via -ár derivation. Gets only passive Sg3

contraced verbs assimilated and outside the main pattern.

LEXICON PLÁNIT_TV Transitive Two-syll contraced words not in third grade as contraced verb have been. Two syllable transitive NEW loan verbs. Makes nouns via -ár derivation. All passives.

LEXICON SLEDUT_IV Intransitive Two-syll contraced words not in third grade as contraced verb have been. Only Sg3 passiv.

LEXICON BADASS_TV NEW badly assimilated two syllable transitive loan verbs. Makes nouns via -ár derivation. All passives. Err/orth taged in stem file

LEXICON BADASS_IV NEW badly assimilated two syllable intransitive loan verbs. Makes nouns via -ár derivation. Only Sg3 passiv. Err/orth taged in stem file.

LEXICON ABBONERE_TV Transitive loan words with more than two syllables with -erit/ierit endings. Duvvat passives. Does not make nouns via -ár derivation. Only the two last syllables are assimilated to sami. LONG -e is assimilated in different ways in Norway and Sweden: In Norway, it becomes -ie, and in Sweden -e.

LEXICON BRILJERE_IV Intransitive loan words with more than two syllables with -erit/ierit endings. Does not make nouns via -ár derivation. Only the two last syllables are assimilated to sami. Long -e is assimilated in different ways in dialects in Norway and Sweden: In Norway it often becomes -ie, while in Sweden itºs usually -e.



This (part of) documentation was generated from src/fst/morphology/affixes/verbs.lexc

This (part of) documentation was generated from src/fst/morphology/compounding.lexc

Lule Sámi morphophonological rule set

This file documents the phonology.twolc file

The file contains the rule set for the non-segmental Lule Sámi morphphonological rules


The file is modeled upon the corresponding file for North Sámi, but has been revised and differs from it on several issues. The grammatical sources are Spiik 1989: Lulesamisk grammatik and Nystø and Johnsen 2001: Sámásta 2.

The rule file has the sections Alphabet, Sets, Definition and Rules. The rules are ordered thematically, with 3 main sections: Consonant alternations (except CG), vowel alternations, and consonant gradation.

Declarations and definitions

The Alphabet section

The real Lule Sámi Alphabet

All Lule Saami letters are listed. The Lule Sámi ENG sound is represented as ñ. Lule Sámi letter repertoire is not fully standardised. In the source code we write (and you shall write!) æ; ø; ŋ, but the parser tolerates input written with the the letters ä; ö; ń, ñ (cf. the 4 rules in the file smj/src/orthography/spellrelax.regex).

The 3rd degree mark º is never realized, hence declared as º:0. º:0 = Gradation mark %/ = Literal /, not the TWOLC reserved symbol ‘:’ = Apostrophe

Literal quotes and angles must be escaped (cf morpheme boundaries further down):

h2, g2 etc. are consonants deleted in the Nom. m3, d3 etc. (?) are consonants that undergo certain processes word-finally. This issue should be looked into. Perhaps the two sets can be unified. The reason why there are more distinctions than for sme, is that the cns deletion process is more phonological in sme.

The Dummy symbols

The Dummy symbols are taken from the sme file for convenience, only a small part of them are actually used, they are defined in the Sets section along the way, included there as soon as they are used. The set of actually used Dummy symbols is thus the set declared in “Dummy”. The Dummy symbols trigger morphophonological rules. X is used for nouns and adjectives, Y for verbs and Q for processes common to all The symbols themselves are used in the following way:

OBS: the definitions are not all correct or sufficiently specific

Morpheme boundaries:

The Sets section

These are the sets:

The Definitions section

In this section, the consonants are defined. This includes consonant clusters in the various grades and consonant alternations.

G3 vs G2

The alternation patterns according to Spiik’s alternations series, here named S4, S5, … for “Spiik alternation series 4, 5, etc.” as they are presented in his grammar..

Class Alternation Series
S7 kkn:k0n series 1
S8 fºf:f0f series 2
S9 jgg:j0g series 3
S4 hkk:h0k series 4
S5 xy:zy (no zeros) series 5
S6 xx:yy (no zeros) series 6
S7 xy:zy (no zeros) series 7
S8 —– (no cg) series 8

Definition of gradation symbols:

The Rules section


The rules section has the following chapters: Consonant alternations in certain pos, vowel lengthening, diphthong simplification, stem vowel alternations, consonant gradation rules

Consonant alternations in certain pos

All rules deal with word-final position.

**Word Final Devoicing of Certain Single Consonants d9 etc. **

**Word Final Devoicing of Certain Single Consonants m9-v ** ! Spilt up because of err/orths ending on v, gierkav> we want err/orth gierkkam

**Err/Orths. **

Word final weakening -tj and -ttj to -sj part 1

Word final weakening -tj and -ttj to -sj part 2

Word Final Deletion of n8 m8 g8 h8

Word Final Neutralization of g8, h8, m8

Deleting Final h9 in Short Essive of Uneven Syllables

Deleting Final l9 in Short Essive of Uneven Syllables

Deleting Final m9 in Short Essive of Uneven Syllables

Deleting Final n9 in Short Essive of Uneven Syllables

Deleting Final r9 in Short Essive of Uneven Syllables

Vowel lengthening

The second syllable vowel a is lengthened to á whenever the stem consonants are in grade 1 and the first syllable vowel is short. Short vowels cannot preceed and follow a single intervocalic consonant.

Compulsatory lengthening in grade I even-syllables

Diphtong simplification

The diphthong simplification handles oa:å and æ:e. Phonologically, these are identical processes, but since the dipthong is written by two letters in the former case and by one letter in the latter, the alternations must be handled separately. This section also handles ie:æ, these are in principle the same as oa:å, but the alternation does not occur in so many contexts.

**oa:å Diphtong Simplification Part I **

oa:å Diphtong Simplification Part II

**æ:e Diphthong Simplification **

**ie:æä Diphthong Simplification Part I **

ie:æä Diphthong Simplification Part II The multichar æä is always the only option

Vowel-change oa:å for verbs part I

Vowel-change oa:å for verbs part II

Stem vowel alternations

This section is divided according to stem vowels: a-, e-, o-, å-stems.

a-stem alternations

For a-stems, there is a:e and a:i. Each alternation is triggered by a combination of phonological content and dummy symbols.

a:e in Present Participle of even-syllable verbs

a:i in Prs Prc of even-syllable verbs

a-stem vowel deletion

e-stem alternations

For e-stems, there is e:i, e:á, e:å, e:u and e:a. Each alternation is triggered by a combination of phonological content and dummy symbols.

e:i in e-stems

The following two rules constitute a <= / => rule pair.

e:á in certain stem types 1

e:á in certain stem types 2

e:å in certain stem types with å as root vowel

e-stem vowel deletion

i-stem alternations

For i-stems, there is i:á. The alternation is triggered by a combination of phonological content and dummy symbols.

i:á in Verb Derivation

o-stem alternations

The duplicates of the three lines of the two following rules are there to resolve the => conflict between the two rules.

o:u in certain stem types 1

o:u in certain stem types 2

u:o in contracted nouns

o-stem vowel deletion

For å-stems there is å:e and å:i and vowel deletion. Each alternation is triggered by a combination of phonological content and dummy symbols.

å:e in Present Participle of even-syllable verbs

å:i in Actor nouns of even-syllable verbs

å-stem vowel deletion

alternations valid for several stem types

Stem vowel deletion in even-syllable verbs, imp 3sg, 3du, 2pl, 3pl

Consonant gradation rules

The consonant gradation rules differ considerably from the corresponding rules for North Sámi. Instead of generalizing oversets of consonants (Cx:Cy <=> …), each rule contains the alternation for one consonant only, and to the right of the <=> arrow is listed all the contexts where the relevant alternation appears. The disadvantage with this method is that the same context must be written several times, if e.g. both p, t and k are deleted in the same contexts, each of these contexts must be written several times, one for each consonant. The advantage is that there are no conflicts during compilation, compilation takes 10 seconds rather than 3 minutes. The earlier North-Sámi-style rule set was ordered according to CG pattern. This pattern is still visible in the new rules, via the reference S1-3 etc. (Spiik’s Series 1, 3-letter pattern, etc) behind each subrule.

This actually opens up for a migration to an xfst rule file instead of the current twolc format, since what xfst really cannot do is generalize over sets (Cx:Cy etc.). This is an issue for future revisions to decide.

The rules are divided in two subsections, deletion rules and change (alternation) rules.

Deletion rules

The b, d, g deletion rules are similar, via the optional ( b ) etc. in front of the “_” symbol, both bm:m and bbm:bm alternations are covered. The contexts differ to a certain extent. For b and d, the III-I special gradation bbm:m is covered by two separate rules, and a special Dummy (X6), not part of the ordinary WeG set.

Note that one of the rules for t:0 refers to #: as part of its context. As soon as clitics are added to the word form, this rule will thus not be triggered. Look into this when the clitics are added.

Consonant gradation b:0 deletes b in S7 and S9 contexts

Consonant gradation d:0 … etc.

Consonant gradation g:0

Consonant gradation k:0

Consonant gradation l:0

Consonant gradation m:0

Consonant gradation n:0

Consonant gradation p:0

Consonant gradation s:0

Consonant gradation ŋ:0

Consonant gradation f:0

Consonant gradation r:0

Consonant gradation v:0

Consonant gradation j:0

Consonant gradation t:0

Gradation Series 4, II-I, tj and ts

Change rules

The Cx:Cy format was kept for hk:g, hp:b, ht:d, since the left context h:0 was unique, and no compilation conflict thus arose.

The bb:pp, gg:kk, dd:tt alternations were split into three rules, since keeping them in one Cx:Cy rule created compilation conflicts. Also, d:t contain a rule not found for the other two…

Gradation Series 4, II-I



g:k change for clitic -ge

dd:tt and dtj, dts

Gradation Series 7, III-II, ks(t), kt, ktj, kts

Exceptional II-III inverse gradation in present participles

This gradation is only for II-I syllable verbs that get III as present participles.


Strategy: Do insertion rule for the initial element.

Consonant insertion as II-III strengthening gradation with bm, gŋ

Consonant insertion as II-III strengthening gradation with dn/j + as I-III strengthening gradation with d

Consonant insertion as II-III strengthening gradation with hk, hp,

Consonant insertion as II-III strengthening gradation with htt(j/s)

Debugging of twol-rules

All rule conflicts have been successfully resolved. The rule file should be kept that way. Look out for conflicts in the compilation process, and resolve them as they appear!

This (part of) documentation was generated from src/fst/morphology/phonology.twolc

Lule Sámi morphological analyser

Definitions for Multichar_Symbols

Tags for POS

Tags for sub-POS

Pronoun subtypes

Error tags

All Err-tags must have a normative form as lemma except Err/Lex

Usage restriction tags

Dialect and Area tags

Compounding tags

The tags are of the following form:

Normative/prescriptive compounding tags

These govern compound behaviour for normative tools like the speller, ie. what a compound SHOULD BE.

The first part of the component may be ..

This part of the component can ..

The second part of the compound may require that the previous (left part) is (and thus overrides the regular CmpN tags):

But these tags can again be overriden by the first word in a compound, if this part of the compound is tagged with a def tag:

Descriptive compounding tags

Tags for compound analysis - this is what a compound actually is. Some of these tags are also used in combination with the above normative tags to actually enforce compound restrictions in the fst.

Inflectional Tags

Tags for Case and Number Inflection

Possessive tags

Adjective specific tags

Verbal inflection

Other tags

Lexeme disambiguation = homonym tags

Stem variant tags

Question and Focus particles:

Other tags

Semantic tags to help disambiguation & syntactic analysis

These tags should always be located just before the POS tag.

Multiple Semantic tags:

Not sure which section this goes in: (before POS)

Derivation tags

The following tags are used to describe the dynamic derivational system in Lule Sámi as encoded in this lexical description. The tags are classified according to a positional system, where each tag can be in one and only one position, and can only combine with tags from an earlier / lower position. This is done to avoid possible overgeneration in the derivational system.

Der#2 tags - tags in second position

There are no such tags in SMJ, but for symmetry and code coherence with SME the class is still kept.

Tags for originating language

The following tags are used to guide conversion to IPA: loan words and foreign names are usually pronounced (approximately) as in the originating (majority) language. Instead of trying to identify the correct pronunciation based on phonotactics (orthotactics actually), we tag all words that can’t be correctly transcribed using the SME transcriber with source language codes. Once tagged, it is possible to split the lexical transducer in smaller ones according to langu- age, and apply different IPA conversion to each of them. The principle of tagging is that we only tag to the extent needed, and following a priority:

  1. any untagged word is pronounced with SME orthographic conventions
  2. NNO and NOB have identical pronunciation, NNO is only used if different in spelling from NOB
  3. SWE has mostly the same pronunciation as NOB, and is only used if different in spelling from NOB
  4. Occasionally even SME (the default) may be tagged, to block other languages from being specified, mainly during semi-automatic language tagging sessions All in all, we want to get as much correctly transcribed to IPA with as little work as possible. On the other hand, if more words are tagged than strictly needed, this should pose no problem as long as the IPA conversion is correct - at least some words will get the same pronunciation whether read as SME or NOB/NNO/SWE.
    • +OLang/SME - North Sámi
    • +OLang/SMS - Skolt Sámi
    • +OLang/SMA - South Sámi
    • +OLang/FIN - Finnish
    • +OLang/SWE - Swedish
    • +OLang/NOB - Norw. bokmål
    • +OLang/NNO - Norw. nynorsk
    • +OLang/ENG - English
    • +OLang/RUS - Russian
    • +OLang/UND - Undefined
    • +OLang/PARA - parallelle navn, navnet skal ikke overføres til andre samisk språk

Flag diacritics

Tags from SME, coming to smj by propernouns.

Flag diacritics

We have manually optimised the structure of our lexicon using following flag diacritics to restrict morhpological combinatorics - only allow compounds with verbs if the verb is further derived into a noun again:

Flag diacritic Explanation
@P.NeedNoun.ON@ (Dis)allow compounds with verbs unless nominalised
@D.NeedNoun.ON@ (Dis)allow compounds with verbs unless nominalised
@C.NeedNoun@ (Dis)allow compounds with verbs unless nominalised
Flag diacritic Explanation
@P.Pmatch.Loc@ Used on multi-token analyses; tell hfst-tokenise/pmatch where in the form/analysis the token should be split. Used e.g. in bijladagi to split bijla from dagi, or after abbreviations with full stops before the full stop, to allow an alternate +CLB analysis of it in case of a sentence final abbreviation. NB! This will give a faulty lemma for the abbreviation, as it will not include the full stop. This can lead to other issues, but presently we have no other solution if we want to keep the full stopp as a separate token. We could leave a full stop at the end of the abbreviation lemma as well (but not on the input side - we only have one full stop in the input). That must be tested, it could work, but then requires special attention when generating suggestions in e.g. grammar checkers - it should not generate two full stops.
@P.Pmatch.Backtrack@ Used on single-token analyses; tell hfst-tokenise/pmatch to backtrack by reanalysing the substrings before and after this point in the form (to find combinations of shorter analyses that would otherwise be missed)
Flag diacritic Explanation
@D.ErrOrth.ON@ To be written
@R.ErrOrth.ON@ To be written
@C.ErrOrth@ To be written
@P.ErrOrth.ON@ To be written

For languages that allow compounding, the following flag diacritics are needed to control position-based compounding restrictions for nominals. Their use is handled automatically if combined with +CmpN/xxx tags. If not used, they will do no harm.

Flag diacritic Explanation
@P.CmpFrst.FALSE@ Require that words tagged as such only appear first
@D.CmpPref.TRUE@ Block such words from entering ENDLEX
@P.CmpPref.FALSE@ Block these words from making further compounds
@D.CmpLast.TRUE@ Block such words from entering R
@D.CmpNone.TRUE@ Combines with the next tag to prohibit compounding
@U.CmpNone.FALSE@ Combines with the prev tag to prohibit compounding
@U.CmpNone.TRUE@ Combines with the two previous ones to block compounding
@P.CmpOnly.TRUE@ Sets a flag to indicate that the word has passed R
@D.CmpOnly.FALSE@ Disallow words coming directly from root.
@U.CmpHyph.FALSE@ Flag to control hyphenated compounds like proper nouns
@U.CmpHyph.TRUE@ Flag to control hyphenated compounds like proper nouns
@C.CmpHyph@ Flag to control hyphenated compounds like proper nouns

Use the following flag diacritics to control downcasing of derived proper nouns (e.g. Finnish Pariisi -> pariisilainen). See e.g. North Sámi for how to use these flags. There exists a ready-made regex that will do the actual down-casing given the proper use of these flags.

Flag diacritic Explanation
@U.Cap.Obl@ Disallow downcasing of names when not derived: Deatnu
@U.Cap.Opt@ Allowing downcasing of derived names: deatnulasj.
@P.Px.add@ Giving possibility for Px-suffixes (all except from Nom 3.p)
@R.Px.add@ Requiring P.Px.add-flag for Px-suffixes (all except from Nom 3.p)
@P.Nom3Px.add@ Giving possibility for Px-suffixes Nom 3.p
@R.Nom3Px.add@ Requiring P.Nom3Px.add flag for Px-suffixes Nom 3.p
Flag diacritic Explanation Flag used to give arabic numerals in smj different cases ;
@U.number.two@ Flag used to give arabic numerals in smj different cases ;
@U.number.three@ Flag used to give arabic numerals in smj different cases ;
@U.number.four@ Flag used to give arabic numerals in smj different cases ;
@U.number.five@ Flag used to give arabic numerals in smj different cases ;
@U.number.six@ Flag used to give arabic numerals in smj different cases ; Flag used to give arabic numerals in smj different cases ;
@U.number.eight@ Flag used to give arabic numerals in smj different cases ;
@U.number.nine@ Flag used to give arabic numerals in smj different cases ; Flag used to give arabic numerals in smj different cases ; Flag used to give arabic numerals in smj different cases ;
@P.number.two@ Flag used to give arabic numerals in smj different cases ;
@P.number.three@ Flag used to give arabic numerals in smj different cases ;
@P.number.four@ Flag used to give arabic numerals in smj different cases ;
@P.number.five@ Flag used to give arabic numerals in smj different cases ;
@P.number.six@ Flag used to give arabic numerals in smj different cases ; Flag used to give arabic numerals in smj different cases ;
@P.number.eight@ Flag used to give arabic numerals in smj different cases ;
@P.number.nine@ Flag used to give arabic numerals in smj different cases ;
@P.number.ten@ Flag used to give arabic numerals in smj different cases ; Flag used to give arabic numerals in smj different cases ;
@P.number.two@ Flag used to give arabic numerals in smj different cases ;
@P.number.three@ Flag used to give arabic numerals in smj different cases ;
@P.number.four@ Flag used to give arabic numerals in smj different cases ;
@P.number.five@ Flag used to give arabic numerals in smj different cases ;
@P.number.six@ Flag used to give arabic numerals in smj different cases ; Flag used to give arabic numerals in smj different cases ;
@P.number.eight@ Flag used to give arabic numerals in smj different cases ;
@P.number.nine@ Flag used to give arabic numerals in smj different cases ;
@P.number.ten@ Flag used to give arabic numerals in smj different cases ; Flag used to give arabic numerals in smj different cases ;
@P.number.two@ Flag used to give arabic numerals in smj different cases ;
@P.number.three@ Flag used to give arabic numerals in smj different cases ;
@P.number.four@ Flag used to give arabic numerals in smj different cases ;
@P.number.five@ Flag used to give arabic numerals in smj different cases ;
@P.number.six@ Flag used to give arabic numerals in smj different cases ; Flag used to give arabic numerals in smj different cases ;
@P.number.eight@ Flag used to give arabic numerals in smj different cases ;
@P.number.nine@ Flag used to give arabic numerals in smj different cases ;
@P.number.ten@ Flag used to give arabic numerals in smj different cases ;

Lexicon Root

The beginning of everything. Every FST defined in LexC must start with the reserved lexicon name Root.

Lexicon ENDLEX

And this is the ENDLEX of everything:

@D.CmpOnly.FALSE@@D.CmpPref.TRUE@@D.NeedNoun.ON@ ENDLEX2 ;

The @D.CmpOnly.FALSE@ flag diacritic is ued to disallow words tagged with +CmpNP/Only to end here. The @D.NeedNoun.ON@ flag diacritic is used to block illegal compounds.




This (part of) documentation was generated from src/fst/morphology/root.lexc

vájnno vájnno vájnno

This (part of) documentation was generated from src/fst/morphology/stems/adjectives.lexc

sme mojonjálmmiid

This (part of) documentation was generated from src/fst/morphology/stems/adverbs.lexc


This (part of) documentation was generated from src/fst/morphology/stems/nouns.lexc



This (part of) documentation was generated from src/fst/morphology/stems/numerals.lexc

Reciprocal pronouns as multiword expression

This (part of) documentation was generated from src/fst/morphology/stems/pronouns.lexc

File containing North Saami abbreviations

Lexica for adding tags and periods

Splitting in 4 + 1 groups, because of the preprocessor

The abbreviation lexicon itself

This class contains homonyms, which are both intransitive abbreviations and normal words. The abbreviation usage is less common and thus only the occurences in the middle of the sentnece (when next word has small letters) can be considered as true cases.

For abbrs for which numerals are complements, but other words not necessarily are. This group treats arabic numerals as if it were transitive but letters as if it were intransitive.

This lexicon is for abbrs that always have a constituent following it.

This class contains homonyms, which are both abbrs for which numerals are complements and normal words. The abbreviation usage is less common and thus only the occurences in the middle of the sentence can be considered as true cases.

This (part of) documentation was generated from src/fst/morphology/stems/smj-abbreviations.lexc

Phonological ACRO converter for Julev Sámi

Converts ACROS to IPA. Intended for use with TTS. > marks undlying morpheme boundary between lemma and inflectional suffix, : is the same, but in the surface orthography. The idea is that the pronunciation of the last letter sound (like e: when reading the letter P) can be different when followed by a case ending compared to when not. If that is not true, the system can be simplified.

Default, letter by letter pronunciation

This (part of) documentation was generated from src/fst/phonetics/acro2ipa.xfscript

retroflex plosive, voiceless t ʈ 0288, 648 ( = ASCII 096) retroflex plosive, voiced d ɖ 0256, 598 labiodental nasal F ɱ 0271, 625 retroflex nasal n ɳ 0273, 627 palatal nasal J ɲ 0272, 626 velar nasal N ŋ 014B, 331 uvular nasal N\ ɴ 0274, 628

bilabial trill B\ ʙ 0299, 665 uvular trill R\ ʀ 0280, 640 alveolar tap 4 ɾ 027E, 638 retroflex flap r ɽ 027D, 637 bilabial fricative, voiceless p\ ɸ 0278, 632 bilabial fricative, voiced B β 03B2, 946 dental fricative, voiceless T θ 03B8, 952 dental fricative, voiced D ð 00F0, 240 postalveolar fricative, voiceless S ʃ 0283, 643 postalveolar fricative, voiced Z ʒ 0292, 658 retroflex fricative, voiceless s ʂ 0282, 642 retroflex fricative, voiced z` ʐ 0290, 656 palatal fricative, voiceless C ç 00E7, 231 palatal fricative, voiced j\ ʝ 029D, 669 velar fricative, voiced G ɣ 0263, 611 uvular fricative, voiceless X χ 03C7, 967 uvular fricative, voiced R ʁ 0281, 641 pharyngeal fricative, voiceless X\ ħ 0127, 295 pharyngeal fricative, voiced ?\ ʕ 0295, 661 glottal fricative, voiced h\ ɦ 0266, 614

alveolar lateral fricative, vl. K alveolar lateral fricative, vd. K\

labiodental approximant P (or v) alveolar approximant r\ retroflex approximant r` velar approximant M\

retroflex lateral approximant l` palatal lateral approximant L velar lateral approximant L

bilabial O\ (O = capital letter) dental |
(post)alveolar !\ palatoalveolar =\ alveolar lateral ||
Ejectives, implosives

ejective > e.g. ejective p p> implosive < e.g. implosive b b< Vowels

close back unrounded M close central unrounded 1 close central rounded } lax i I lax y Y lax u U

close-mid front rounded 2 close-mid central unrounded @\ close-mid central rounded 8 close-mid back unrounded 7

schwa @

open-mid front unrounded E open-mid front rounded 9 open-mid central unrounded 3 open-mid central rounded 3\ open-mid back unrounded V open-mid back rounded O

ash (ae digraph) { open schwa (turned a) 6

open front rounded & open back unrounded A open back rounded Q Other symbols

voiceless labial-velar fricative W voiced labial-palatal approx. H voiceless epiglottal fricative H\ voiced epiglottal fricative <\ epiglottal plosive >\

alveolo-palatal fricative, vl. s\ alveolo-palatal fricative, voiced z\ alveolar lateral flap l\ simultaneous S and x x\ tie bar _ Suprasegmentals

primary stress “ secondary stress % long : half-long :\ extra-short _X linking mark -
Tones and word accents

level extra high _T level high _H level mid _M level low _L level extra low _B downstep ! upstep ^ (caret, circumflex)

contour, rising contour, falling _F contour, high rising _H_T contour, low rising _B_L

contour, rising-falling _R_F (NB Instead of being written as diacritics with _, all prosodic marks can alternatively be placed in a separate tier, set off by < >, as recommended for the next two symbols.) global rise global fall Diacritics

voiceless 0 (0 = figure), e.g. n_0 voiced _v aspirated _h more rounded _O (O = letter) less rounded _c advanced _+ retracted _- centralized _” syllabic = (or _=) e.g. n= (or n=) non-syllabic _^ rhoticity `

breathy voiced _t creaky voiced _k linguolabial _N labialized _w palatalized ‘ (or _j) e.g. t’ (or t_j) velarized _G pharyngealized _?\

dental d apical _a laminal _m nasalized ~ (or _~) e.g. A~ (or A~) nasal release _n lateral release _l no audible release _}

velarized or pharyngealized _e velarized l, alternatively 5 raised _r lowered _o advanced tongue root _A retracted tongue root _q

This (part of) documentation was generated from src/fst/phonetics/smj2sampa-from-old-infra.xfscript

Phonological converter for Julev Sámi

Converts to IPA. Mainly intended for use with TTS.

This (part of) documentation was generated from src/fst/phonetics/txt2ipa.xfscript

At some points we will need the genitives, for approximate numbers. Here they are.

avta guovte gålmå nielje vidá gudá gietja gávtse avtse låge lågenanavta lågenanguovte

This (part of) documentation was generated from src/fst/transcriptions/clock-from-old-infra.lexc

We describe here how abbreviations in Lule Sami are read out, e.g. for text-to-speech systems.

This class contains homonyms, which are both intransitive abbreviations and normal words. The abbreviation usage is less common and thus only the occurences in the middle of the sentnece (when next word has small letters) can be considered as true cases.

For abbrs for which numerals are complements, but other words not necessarily are. This group treats arabic numerals as if it were transitive but letters as if it were intransitive.

This lexicon is for abbrs that always have a constituent following it

This class contains homonyms, which are both abbrs for which numerals are complements and normal words. The abbreviation usage is less common and thus only the occurences in the middle of the sentence can be considered as true cases.

This (part of) documentation was generated from src/fst/transcriptions/transcriptor-abbrevs2text.lexc

We describe here how abbreviations in Lule Sami are read out, e.g. for text-to-speech systems.

This (part of) documentation was generated from src/fst/transcriptions/transcriptor-acro2text.lexc

This is still a dummy file.

This (part of) documentation was generated from src/fst/transcriptions/transcriptor-date-digit2text.lexc

This (part of) documentation was generated from src/fst/transcriptions/transcriptor-numbers-digit2text.lexc

We describe here how abbreviations in Lule Sami are read out, e.g. for text-to-speech systems.

Miscellaneous symbols



This (part of) documentation was generated from src/fst/transcriptions/transcriptor-symbols2text.lexc





This section lists all the tags inherited from the fst, and used as tags in the syntactic analysis. The next section, Sets, contains sets defined on the basis of the tags listed here, those set names are not visible in the output.

Beginning and end of sentence


Parts of speech tags

N A Adv V Pron CS CC CC-CS Po Pr Pcle Num Interj ABBR ACR CLB




Tags for POS sub-categories

Pers Dem Interr Indef Recipr Refl Rel Coll NomAg Prop Allegro Arab Romertall

Tags for morphosyntactic properties

Nom Abe Acc Gen Ine Ela Ill Loc Com Ess Ess Sg Du Pl Cmp/SplitR Cmp/SgNom Cmp/SgGen Cmp/SgGen PxSg1 PxSg2 PxSg3 PxDu1 PxDu2 PxDu3 PxPl1 PxPl2 PxPl3 Px

Comp, both for adverbs and adjectives Superl, both for adverbs and adjectives Attr Ord Qst IV TV Prt Prs Ind Pot Cond Imprt ImprtII Sg1 Sg2 Sg3 Du1 Du2 Du3 Pl1 Pl2 Pl3

Inf ConNeg Neg PrfPrc VGen PrsPrc Ger Sup Actio VAbess


Semantic tags



Syntactic tags


-OTHERS SYN-V @X ## Sets containing sets of lists and tags This part of the file lists a large number of sets based partly upon the tags defined above, and partly upon lexemes drawn from the lexicon. See the sourcefile itself to inspect the sets, what follows here is an overview of the set types. ### Sets for Single-word sets INITIAL ### Sets for word or not WORD NOT-COMMA ### Case sets ADLVCASE CASE-AGREEMENT CASE NOT-NOM NOT-GEN NOT-ACC ### Verb sets NOT-V ### Sets for finiteness and mood REAL-NEG NOT-PRFPRC ### Sets for person SG1-V SG2-V SG3-V DU1-V DU2-V DU3-V PL1-V PL2-V PL3-V ### Pronoun sets ### Adjectival sets and their complements ### Adverbial sets and their complements ### Relations ### Sets of elements with common syntactic behaviour ### NP sets defined according to their morphosyntactic features ### The PRE-NP-HEAD family of sets These sets model noun phrases (NPs). The idea is to first define whatever can occur in front of the head of the NP, and thereafter negate that with the expression **WORD - premodifiers**. ### Border sets and their complements Error tags * * * This (part of) documentation was generated from [tools/grammarcheckers/grammarchecker.cg3]( --- # ### Semantic tags * Rules for removing some Props which are identical to common nouns * **IfonlyVerb** selects the FMAINV reading in the cohort # Removing Err/Orth * * * This (part of) documentation was generated from [tools/grammarcheckers/grc-disambiguator.cg3]( --- # The CLBfinal reading is only possible if directly followed by a full stop. Needs a rewrite, as the CLB reading is still within the same cohort, not the next, if present, since we haven't done the mwe-rewrite yet. * * * This (part of) documentation was generated from [tools/tokenisers/mwe-dis.cg3]( --- # # Tokeniser for smj Usage: ``` $ make $ echo "ja, ja" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst $ echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa boasttu olmmoš, man mielde lahtuid." | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst $ echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst $ echo "márffibiillagáffe" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst ``` Pmatch documentation: <> Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words: * Punct contains ASCII punctuation marks * The symbol after m-dash is soft-hyphen `U+00AD` * The symbol following {•} is byte-order-mark / zero-width no-break space `U+FEFF`. Whitespace contains ASCII white space and the List contains some unicode white space characters * En Quad U+2000 to Zero-Width Joiner U+200d' * Narrow No-Break Space U+202F * Medium Mathematical Space U+205F * Word joiner U+2060 Apart from what's in our morphology, there are 1. unknown word-like forms, and 2. unmatched strings We want to give 1) a match, but let 2) be treated specially by `hfst-tokenise -a` Unknowns are made of: * lower-case ASCII * upper-case ASCII * select extended latin symbols ASCII digits * select symbols * Combining diacritics as individual symbols, * various symbols from Private area (probably Microsoft), so far: * U+F0B7 for "x in box" ## Unknown handling Unknowns are tagged ?? and treated specially with `hfst-tokenise` hfst-tokenise --giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it's safer to let hfst-tokenise handle them. Finally we mark as a token any sequence making up a: * known word in context * unknown (OOV) token in context * sequence of word and punctuation * URL in context * * * This (part of) documentation was generated from [tools/tokenisers/tokeniser-disamb-gt-desc.pmscript]( --- # # Grammar checker tokenisation for smj Requires a recent version of HFST (3.10.0 / git revision>=3aecdbc) Then just: ```sh make echo "ja, ja" \ | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst ``` More usage examples: ```sh echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa \ boasttu olmmoš, man mielde lahtuid." \ | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" \ | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst echo "márffibiillagáffe" \ | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst ``` Pmatch documentation: <> Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words: * Punct contains ASCII punctuation marks * The symbol after m-dash is soft-hyphen `U+00AD` * The symbol following {•} is byte-order-mark / zero-width no-break space `U+FEFF`. Whitespace contains ASCII white space and the List contains some unicode white space characters * En Quad U+2000 to Zero-Width Joiner U+200d' * Narrow No-Break Space U+202F * Medium Mathematical Space U+205F * Word joiner U+2060 Apart from what's in our morphology, there are 1) unknown word-like forms, and 2) unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a * select extended latin symbols * select symbols * various symbols from Private area (probably Microsoft), so far: * U+F0B7 for "x in box" TODO: Could use something like this, but built-in's don't include šžđčŋ: Simply give an empty reading when something is unknown: hfst-tokenise --giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it's safer to let hfst-tokenise handle them. Finally we mark as a token any sequence making up a: * known word in context * unknown (OOV) token in context * sequence of word and punctuation * URL in context * * * This (part of) documentation was generated from [tools/tokenisers/tokeniser-gramcheck-gt-desc.pmscript]( --- # # TTS tokenisation for smj Requires a recent version of HFST (3.10.0 / git revision>=3aecdbc) Then just: ```sh make echo "ja, ja" \ | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst ``` More usage examples: ```sh echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa \ boasttu olmmoš, man mielde lahtuid." \ | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" \ | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst echo "márffibiillagáffe" \ | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst ``` Pmatch documentation: <> Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words: * Punct contains ASCII punctuation marks * The symbol after m-dash is soft-hyphen `U+00AD` * The symbol following {•} is byte-order-mark / zero-width no-break space `U+FEFF`. Whitespace contains ASCII white space and the List contains some unicode white space characters * En Quad U+2000 to Zero-Width Joiner U+200d' * Narrow No-Break Space U+202F * Medium Mathematical Space U+205F * Word joiner U+2060 Apart from what's in our morphology, there are 1) unknown word-like forms, and 2) unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a * select extended latin symbols * select symbols * various symbols from Private area (probably Microsoft), so far: * U+F0B7 for "x in box" TODO: Could use something like this, but built-in's don't include šžđčŋ: Simply give an empty reading when something is unknown: hfst-tokenise --giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it's safer to let hfst-tokenise handle them. Needs hfst-tokenise to output things differently depending on the tag they get * * * This (part of) documentation was generated from [tools/tokenisers/tokeniser-tts-cggt-desc.pmscript](