South Sámi NLP Grammar

Finite state and Constraint Grammar based analysers, proofing tools and other resources

View the project on GitHub giellalt/lang-sma

Page Content

South Sámi language model documentation

All doc-comment documentation in one large file.


src-cg3-disambiguator.cg3.md

S O U T H   S Á M I   D I S A M B I G U A T O R

Delimiters, tags and sets

"<.>" "<!>" "<?>" "<...>" "<¶>" sent

Tags

BOS/EOS:

Morphological tags

Derivation tags

Error usage tags

We define two lists for Err/xxx tags:

Other tags

Semantic tags

Secondary tags

Syntactic tags

Titles

REAL-TITLE OFFICE TITLE

Sets

Sets of morphological tags for syntactic use

CASES ADVLCASE NUMBER

Noun sets

INSTITUTION ORGANIZATION EDUCATION CURRENCY CURRENCY LESSON

Verb sets

REALCOPULAS

COPULAS

V-NOT-COP

MOD-ASP

Adjective sets

Adverb sets

GUKTIEGOSSE

DAESTIE

ILLADV

INEADV1

ELAADV1

INEADV

ELAADV

DV-MOD-ADV

Postposition sets

ILLPO

BOUNDARY SETS

REALCLB

SV-BOUNDARY

NP-BOUNDARY

Derivation sets

V-DER

V-DER-SUF

N-DER N-DER-SUF

A-DER A-DER-SUF

PASS

LEX-V LEX-N LEX-A LEX-ADV

VERB-FORMS 2-PERS

Disambiguation rules

BEFORE-SECTIONS

Rule for adding Sem/Date as a tag to readings which looks like dates (fjernes når vi får felles numeralfil fra shared)

Guessing: Rule for adding Adv Sem/Adr as a tag to readings which looks addresses

Guessing: Rule for adding Adv Sem/Adr as a tag to readings which looks addresses

Rules for adding to verbs denoting verbal actions like: ... jeahta Aili Kestkitalo.

SECTION

Cycle 0 (Early rules)

Removing non-lexicalised forms when lexicalised

Numerals and ACR

Numerals in QPs

CC og not (spesifikke regler lenger ned)

Interj

Possessive suffix

REmove Px if not family

Pronouns

Proper nouns

INITIAL

Verbs

Postpositions

Selecting postpositions when preceded by genitives, etc.

Particles and adverbs

Adjective or Indef

Demonstratives

Genitive

Adjective or not

Rel or Interr OR Indef

Adverbs

Selecting adverbs in local contexts

Verbs

Selecting verbs in local contexts, based upon agreement patterns

Selecting imperative sentence-initially with appropriate right context

Remove verb readings

Select Inf

Mapping rules

CC- and CS-Mapping

CNP mapping

Mapping CNP to CC and CS.

CVP Mapping

Mapping @CVP to all CS

Attributes or not

PrfPrc

Select PrfPrc if DerNomAct

Mapping verbs

killifVinCohort

This rule removes all other readings, if there is a mapped V reading in the same cohort. Every case which this goes wrong, should be fixed in mapping rules or previous disrules.

Person

leah Prs Sg2 = Pl3

Select Inf If Infv

Span sentences

Nomen

Remove Prop Attr if not 1 Prop

Verb or Noun

CC and CS or Adv

Adj or Adv

Grammatisk ord eller N eller A

N or V

Ger or Der/NomAct

Adj or Indef

Num

Adv or Po/Pr

Illative or genetive

Essive

Comitative

Accusative or illative

Indef or Adv

special lemmas

Adverb context prefers Adv

Verb person vs. Inf – moved here in order to have the pronouns disambiguated first.

Proper nouns

Rule set taken from sme

gellie as numeral, not pronoun


This (part of) documentation was generated from src/cg3/disambiguator.cg3


src-cg3-valency.cg3.md

S O U T H   S Á M I   V A L E N C Y A N N O T A T O R

Delimiters

"<.>" "<!>" "<?>" "<...>" "<¶>" sent

Tags and sets

BOS/EOS:

Morphological tags

Number and person tags:

Derivation tags

Error usage tags

We define two lists for Err/xxx tags:

Other tags

Semantic tags

Syntactic tags

Titles

REAL-TITLE OFFICE TITLE

Sets of morphological tags for syntactic use

CASES ADVLCASE NUMBER

Noun sets

INSTITUTION ORGANIZATION EDUCATION CURRENCY CURRENCY LESSON

Verb sets

REALCOPULAS

COPULAS

V-NOT-COP

MOD-ASP

Adective sets

Adverb sets

GUKTIEGOSSE

DAESTIE

ILLADV

INEADV1

ELAADV1

INEADV

ELAADV

DV-MOD-ADV

Postposition sets

ILLPO

BOUNDARY SETS

REALCLB

SV-BOUNDARY

NP-BOUNDARY

Derivation sets

V-DER

V-DER-SUF

N-DER N-DER-SUF

A-DER A-DER-SUF

PASS

LEX-V LEX-N LEX-A

VERB-FORMS 2-PERS

Valency rules


This (part of) documentation was generated from src/cg3/valency.cg3


src-fst-morphology-affixes-abbreviations.lexc.md

Continuation lexicons for abbreviations

Lexica for adding tags and periods

The sublexica

Continuation lexicons for abbrs both with and witout final period

Lexicons without final period

Lexicons with final period


This (part of) documentation was generated from src/fst/morphology/affixes/abbreviations.lexc


src-fst-morphology-affixes-adjectives.lexc.md

Adjective affixes

This is one of two parallel files containing adjective affixes. The files represent two alternative interpretation of the same data (South Saami adjectives). This file is used for spellchecking, the alternative file adjectives-oahpa.lexc is used for dictionary and icall applications. This file is compiled by default, the other one is compiled by in langs/sma giving the command .configure –with-oahpa before compiling.

LEXICON PRED_S

The PRED_S lexicon is used for adjectives Predicatives.

 +Sg+Nom:%>s FINAL1 ; 

LEXICON PRED_0

The PRED_0 lexicon is used for adjectives Predicatives.

 +Sg+Nom: FINAL1 ; 

LEXICON PRED_H

The PRED_H lexicon is used for adjectives Predicatives.

 +Sg+Nom:%>h FINAL1 ; 

LEXICON PRED_NE_ODD

The PRED_NE_ODD lexicon is used for adjectives Predicatives.

 +Sg+Nom:%>ne FINAL1     ; 
         :n     ODDCASEOBL ; 
         :n     ODDCOMP    ; 

LEXICON PRED_N

The PRED_N lexicon is used for adjectives Predicatives.

 +Sg+Nom:%>n FINAL1 ; 

LEXICON e_E_EVEN

The e_E_EVEN lexicon is used for adjectives on –e and –e In attributes and predicatives.With EVEN-NOCOMP.

    :e ATTR_0      ; 
    :e PRED_0      ; 
 +Sg:  NIEJTESGOBL ; 
 +Pl:  NIEJTE_PL   ; 
       NIEJTEREST  ; 
    :e EVENCOMP    ; 

LEXICON e_E_EVENNOCOMP1

The e_E_EVENNOCOMP1 lexicon is used for adjectives on –e and –e stem. In attributes and predicatives.With EVEN-NOCOMP.

     :e ATTR_0      ; 
     :e PRED_0      ; 
  +Sg:  NIEJTESGOBL ; 
  +Pl:  NIEJTE_PL   ; 
        NIEJTEREST  ; 

LEXICON a_A_EVEN1

The a_A_EVEN1 lexicon is used for adjectives on –a and –a In attributes and predicatives.With EVEN-COMP.

    :a ATTR_0     ; 
 +Sg:  MAANASGNOM ; 
       MAANAOBL   ; 
    :a EVENCOMP   ; 

LEXICON as_AS_EVEN1

The as_AS_EVEN1 lexicon is used for adjectives on –as and –as In attributes and predicatives.With EVEN-COMP.

            :a  ATTR_S   ; 
     +Sg+Nom:as FINAL1   ; 
  +Cmp/SgNom:as R        ; 

LEXICON ie_IE_EVEN1

The ie_IE_EVEN1 lexicon is used for adjectives on –ie and –ie In attributes and predicatives.With EVEN-COMP.

 :ie ATTR_0     ; 
     N_IE_FORMS ; 
 :ie EVENCOMP   ; 

LEXICON ie_IE_EVENNOCOMP

The ie_IE_EVENNOCOMP lexicon is used for adjectives on –ie and –ie In attributes and predicatives. With EVEN-COMP.

 :ie ATTR_0     ; 
     N_IE_FORMS ; 

LEXICON a_A_EVEN1_NOCOMP

The a_A_EVEN1_NOCOMP lexicon is used for adjectives on –ie and –ie In attributes and predicatives. With EVEN-COMP.

    :a ATTR_0     ; 
 +Sg:  MAANASGNOM ; 
       MAANAOBL   ; 

LEXICON es_ES_EVEN

The es_ES_EVEN lexicon is used for adjectives on –es and –es In attributes and predicatives. With EVEN-COMP.

           :e      ATTR_S       ; 
           :e      PRED_S       ; 
           :e      EVENCOMP     ; 

LEXICON es_ES_EVENNOCOMP1

The es_ES_EVENNOCOMP1 lexicon is used for adjectives on –es and –es In attributes and predicatives. With EVEN-NOCOMP.

 :e  ATTR_S     ; 
 :e  PRED_S     ; 
 :es ODDCASEOBL ; 

LEXICON ies_IES_EVEN1

The ies_IES_EVEN1 lexicon is used for adjectives on –ies and –ies In attributes and predicatives. With EVEN-COMP.

     ies_IES_EVENNOCOMP1 ; 
 :ie EVENCOMP            ; 

LEXICON ies_IES_EVENNOCOMP1

The ies_IES_EVENNOCOMP1 lexicon is used for adjectives on –ies and –ies In attributes and predicatives. With EVEN-NOCOMP.

  :ie ATTR_S    ; 
  :ie PRED_S    ; 

LEXICON eh_EH_ODDNOCOMP1

LEXICON BAERIES (BÅERIES)

UNEVEN adjective, attr = pred. Comparation uneven syllable. Presentlly only used for the båeries adjective.

  :båerie ATTR_S   ; 
  :båerie PRED_S   ; 
  :båaras ODDCOMP  ; 

LEXICON ÅEHPIES

ODD adjective, attr = pred. Comparation uneven syllable.

LEXICON GIERIES

Umlaut from attr to pred. Comparation uneven syllable. Presentlly only used for “gieries-gearehke” adjective. This lexicon covers the ies - ehke + umlaut change.

        :gierie   ATTR_S  ; 
        :gearahk  ODDCASE ; 
        :gearahk  ODDCOMP ; 
 +Use/NG:gearahtj ODDCOMP ; 
 +Use/NG:gearahg  ODDCOMP ; 

LEXICON BUERIE_UMLAUT_IE_STAMME

EVEN adjective with EVEN-UMLAUT Comparation for -ie-stems.

                          :buer    ie_IE_EVENNOCOMP ; 
                          :buerie  EVENCOMPONLY     ; 
                          :bööre   MES              ; 
         +Der1+Der/Dimin+A:buaratj diminODDCOMP          ; 
         +Der1+Der/Dimin+A:bööretj diminODDCOMP          ; 

Sjekk opp denne!

LEXICON ihks_IHKS_igs_IGS_EVENNOCOMP

Adjective with no comp.

        +Use/NG:ihk%>s ATTRCONT   ; 
               :ig     ATTR_S     ; 
      +Err/Orth:igks   ATTR_H     ; , cf onterligksh
 +Sg+Nom+Use/NG:ihk%>s FINAL1     ; 
 +Sg+Nom+Use/NG:ig%>s  FINAL1     ; 
               :ihk    X_NIEJTE   ; 
        +Use/NG:igk    X_NIEJTE   ; 
        +Use/NG:igke   PRED_0     ; 
               :ihke   PRED_0     ; 
        +Use/NG:ig     N_IE_FORMS ; 

LEXICON e_ES_EVENNOCOMP2

This is for the adjective “jaame”

 :e ATTR_0   ; 
 :e PRED_S   ; 
    eCASEOBL ; 

LEXICON ODDEVEN2

This one gives EVEN and ODD Comparation.

           :es ODDCASEOBL ; 
           :e  EVENCOMP   ; 
 +Cmp/SgNom:es R          ; 
    +Use/NG:es ODDCOMP    ; 

LEXICON es_E_EVEN3

This one gives EVEN Comparation, and -s in attributt and wowel in predikativ, which gives EVEN-COMP.

        :e ATTR_S      ; 
        :e EVENCOMP    ; 

LEXICON as_oes_A_OE_EVEN3

This one gives EVEN Comparation, and -s in attributt and wowel in predikativ, which gives EVEN-COMP.

      +Use/NG:a  ATTR_S      ; 
             :oe ATTR_S      ; 
             :oe EVENCOMP_oe ; 
      +Use/NG:a  EVENCOMP    ; 

LEXICON oeh_ah_OE_A_EVEN3

This one gives EVEN Comparation, and -s in attributt and wowel in predikativ, which gives EVEN-COMP.

        :oe ATTR_H      ; 
 +Use/NG:a  ATTR_H      ; 
            N_OE        ; 
 +Use/NG:   MAANA       ; 
        :oe EVENCOMP_oe ; 
 +Use/NG:e  EVENCOMP    ; 

LEXICON ies_IE_EVEN3

This one gives EVEN Comparation, and -s in attributt and wowel in predikativ, which gives EVEN-COMP.

   :ie ATTR_S     ; 
    N_IE_FORMS ; 
   :ie EVENCOMP   ; 

LEXICON ies_IE_EVEN3NOCOMP

This one gives EVEN Comparation, and -s in attributt and wowel in predikativ.

   :ie ATTR_S     ; 
    N_IE_FORMS ; 

UMLAUT LEXICON asATTR_anADVERB

These 6 adjectives is in the 4. group of the southsámi adjectives, the group which contains all umlaut-adjectives. Theese adjectives whivh have -as as attributeform and an as predicativeform, is south-southsámi adjectives, and they dont have any comparation. This group which covers the ies - an/ as-an and oes-an + umlaut change, is a small undergruppe of the 4.group

  +A:a  ATTR_S   ; 

UMLAUT LEXICON oesATTR

Theese 5 adjectives is in the 4. group of the southsámi adjectives, The group which contains all umlaut-adjectives. Theese adjectives which have -oes as attributeform and -an as predicativeform, is north-southsámi adjectives, and they dont have any comparation. This group which covers the ies -> an/ as-> an and oes-> an + umlaut change, is a small undergruppe of the 4.group

  +A:oe     ATTR_S   ; 
  +A:       N_OE_OBL ; 
  +A:oe  ATTR_H     ; 
         +A:oe        ATTR_H    ; 

LEXICON MAST

The MAST lexicon is used for adjectives on –masten and masth with an used with the stem masten

                ATTR_S      ; 
      +Use/NG:e ATTR_S      ; 
      +Use/NG:  ATTR_H      ; 
      +Use/NG:e ATTR_N      ; 
             :e PRED_N      ; 

IJVE_LOAN_ADJ LEXICON IJVEadj

EVEN adjective EVEN Comparation. Used for all loan-adjectives “ijve”.

          :ijv e_E_EVEN ; 
   +Use/NG:ïjv e_E_EVEN ; 
 +Err/Orth:iv  e_E_EVEN ; 

LEXICON JELLE

The JELLE lexicon is used for loanadjectives on jelle and –jelle with an used with the stem jelle This one should be ‘jeelle’? SGM?

 +Err/Orth:^ell e_ES_LOAN ; 
         :jell e_ES_LOAN ; 

LEXICON UELLE

         :^ell e_ES_LOAN ; 
 +Err/Orth:vell e_ES_LOAN ; 

:ijl e_E_EVEN ;


This (part of) documentation was generated from src/fst/morphology/affixes/adjectives.lexc


src-fst-morphology-affixes-nouns.lexc.md

Nominal inflection sublexica

Inflection for odd-syllable nouns

The default inflectional lexicon for odd-syllable nouns is N_ODD. Words like gierehtse is inflected using this lexicon. Other words inflected like this are: iehkede (evening), guehpere (nail), tjaeleme (writing). Many words in this class will have vowel changes in the second syllable, between a reduced vowel in odd-syllable forms and a full vowel or diphthong in even-syllable forms, as displayed in the paradigm below. This alternation is regulated by two-level rules, but the rules require that the full vowel is spelled out in the lexical entry as follows:

gierehtse+N+Sem/Veh:gieriehts N_ODD "pulk" ; ! gieriehtsisnie

That is, in the stem of the entry it says -rieht-, where ie is the diphthong that is realised in even-syllable word forms. Another example word is darjome:

darjome+N+Sem/Feat:darjoem N_ODD ;

with -oe- as the stem vowel to get a vowel change o => oe in even-syllable word forms.

LEXICON TJE_LASSJE_RESIPR

Inflection for nouns ending in oe

The oe with umlaut generate the uml-ones and have the non-uml ones as +Use/NG.

The oe without umlaut generate the non-uml-ones only, naturally without +Use/NG.

Lexicon N_OE_OBL is for the -oe nouns without umlaut Illative is lifted out in order to allow for Use/NG for the umlauted ones.

LEXICON EETE_LOAN loanwords with -eete -

Inflection for NIEJTE_SG nouns: lexicon NIEJTE_SG

Short descrioption of this lexicon, and its purpose.

LEXICON KONTO

Lexicon for vowel-final words with invariant stems”

               +Sg:     KONTO_SG ; 
               +Pl:     KONTO_PL ; 
                        EVEN_ESS ; 
        +Cmp/SgNom:     R        ; 
        +Cmp/SgGen:%>n  R        ; 
        +Cmp/PlGen:%>j  R        ; 
 +Der1+Der/Dimin+N:%»tj GÅATETJE ; 

This (part of) documentation was generated from src/fst/morphology/affixes/nouns.lexc


src-fst-morphology-affixes-possessive-suffixes.lexc.md

Divvun & Giellatekno - open source grammars for Sámi and other languages

South Saami Possessive suffixes

Px lexica


This (part of) documentation was generated from src/fst/morphology/affixes/possessive-suffixes.lexc


src-fst-morphology-affixes-propernouns.lexc.md

Proper nouns morphology

Table of content

LEXICON LONDON-obj Objects. ODD-syllable

OBS! Egentlig Mâki og Järvi kan egentlig slås sammen! - MAJA

LEXICON ACCRA-femplc

Propernoun

the sne / snie business remains to be sorted out. the sne / snie business remains to be sorted out.

+Pl+Nom:e%>h FINAL1 ; +Pl+Acc:e%>ide FINAL1 ; +Pl+Gen:e%>i FINAL1 ; +Pl+Ill:e%>ide FINAL1 ; +Pl+Ine:e%>ine FINAL1 ; +Pl+Ela:e%>iste FINAL1 ; +Pl+Com:e%>igujmie FINAL1 ;

+Pl: N_ODD_PL ; ! normal noun

LEXICON NIEMI

Propernoun

+N+Prop+Sem/Plc+Sg+Ill:%>an FINAL1 ; !SUB - is this possible? IllSg without Uml in -ie?

+N+Prop+Sem/Plc+Pl: NIEJTE_PL ;

+N+Prop+Sem/Plc+Pl+Com+Err/Orth:%>igyjmie FINAL1 ; !

+N+Prop+Sem/Plc+Pl: CNAME_ODD_PL ; ! name special


This (part of) documentation was generated from src/fst/morphology/affixes/propernouns.lexc


src-fst-morphology-affixes-symbols.lexc.md

Symbol affixes


This (part of) documentation was generated from src/fst/morphology/affixes/symbols.lexc


src-fst-morphology-affixes-verbs.lexc.md

South Saami verbal inflection sublexica

This is the file for the South Saami verb inflection and derivation.

The auxiliaries

First we just list the auxiliaries and their inflection.

The negative verb

Other auxiliaries

Odd-syllable verbs

Odd syllable verbs differ in Prt Sg3. This form is treated separately, and the rest of the paradigm is conflated.

Inflection common to all odd verbs

Even-syllable verbs

Fått tilbakemelding på denne om at “jarkah” er +Ind+Prs+Sg2, og “Jarkh!” er +Imprt. Har forelöpig satt denne inn som Err/Orth

Infinite forms

+V+IV+Act:%>eme FINAL1 ; +V+IV+PrsPrc:%>ije FINAL1 ; +V+IV+PrsPrc:%>ijes FINAL1 ; Derivations ———–

Nominal derivation sublexica

Verbal affixes

Finite forms

Even

Present
Imperative

Ulikestavelsesverb - ODD

Present
Past
Imperative

Common even-odd

Present
Past

Flag diacritica

LEXICON V-EVEN-PRS V-PRS-SG-12 ; V-PRS-SG-3 ; V-EVEN-PRS-DUPL ;


This (part of) documentation was generated from src/fst/morphology/affixes/verbs.lexc


src-fst-morphology-compounding.lexc.md

South Sámi morphological analyser

Prefixing and compounding

Lexicon Prefixes

It contains only one entry:

Lexicon R

This lexicon is the main entry for regular compounding. All entries NOT requiring a hyphen should point to it.

The whole content of it is a list of flag diacritics to control compounding.

After the flags, we continue to the Rreal ; lexicon, for the real compounding action.

It should be noted that some of the flags above require a corresponding flag in the lexicon ENDLEX to work properly.

Lexicon Rreal

This is where the actual compounding happens.

Lexicon RNum

For compounds of the type Num+Noun. We can’t allow Num+Num, thus we use a separate compounding lexicon, since the regular RHyph lexicon below contains a continuation pointing back to the numerals.

Lexicon RHyph

This lexicon is used for compounds requiring a hyphen before the next part. As for the regular compounds, we first add a number of flag diacritics to restrict certain combinations, before we continue to the real compounding lexicon.

Lexicon RHyphReal

This is where the actual hyphen compounding happens. The hyphen is added here.


This (part of) documentation was generated from src/fst/morphology/compounding.lexc


src-fst-morphology-phonology.twolc.md

South Sámi morphophonological rule set

This file documents the phonology.twolc file

Rules

e deletion before i-initial suffix

Diphthong simplification ie:e

Diphthong simplification oe:o

a/e alternation

a/i alternation

a/0 alternation

Even syllabic verbs Du3 e/i alternation V

Proper PlGen, PlCom

**Even syllabic verbs Du2, Du3, Pl1, Pl2 e/i class V **

Spesialregel for ‘soptsesovvedh’ < soptsestidh. Ingen andre verb har st > s framfor passivderivasjon.


This (part of) documentation was generated from src/fst/morphology/phonology.twolc


src-fst-morphology-root.lexc.md

South Sámi morphological analyser

Multichar_Symbols definitions

Tags for POS (Part-Of-Speech, Word class)

Tags for sub-POS

Proper nouns

Pronoun subtypes

Numeral subtypes

Error (non-standard language) tags

Error tag Explanation
+Err/Orth Substandard, unormert form av et ord
+Err/Hyph Substandard, unormert
+Err/SpaceCmp Substandard, unormert
+Err/Attr Substandard, unormert Attr-form av et ord
+Err/Lex lemma med dens ordformer er utenfor normen.
No normative lemma, it’s grammatically correct.
+Err/Der Errors in derivations
+Err/Spellrelax Used to tag spellrelaxed typos (tag is inserted via flag diacritics)
+Err/MissingSpace in use in smi lexc

Usage tags

Usage tag Explanation
+Use/Marg Marginal, korrekte, eksisterende former, men som er sjeldne. vi kan fjerne disse ordene f.eks fra speller, fordi de er så sjeldne og lite i bruk at de lemma som ligger nært kan bli forvekslet.
+Use/-Spell Excluded from speller
+Use/-PLX Excluded in PLX speller
+Use/SpellNoSugg Recognized but not suggested in speller
+Use/Circ Circular path
+Use/CircN Circular number path?
+Use/Ped Remove from pedagogical speller
+Use/NG Do not generate
for isme-ped.fst and apertium
+Use/MT Generate for apertium only
+Use/NotDNorm For (spellings of) words that do not follow the orthographic principles of sma. Divvun suggest that this shouldn’t be normative, even though they are decided upon by GG. Included in speller.
+Use/DNorm For words without formal normalization. Divvun suggest that this should be normative. Included in speller. Based on 2010 normative decision & Ove Lorentz’ suggestions for the norm.
+Use/PMatch Do only include in fst’s for hfst-pmatch
+Use/-PMatch Do not include in fst’s made for hfst-pmatch
+Use/GC only retained in the HFST Grammar Checker disambiguation analyser
+Use/-GC never retained in the HFST Grammar Checker disambiguation analyser
+Use/TTS only retained in the HFST Text-To-Speech disambiguation tokeniser
+Use/-TTS never retained in the HFST Text-To-Speech disambiguation tokeniser

Dialect tags

Dialect tag Explanation
+Dial/-S Not in the South
+Dial/S Only in the South
+Dial/-N Not in the North
+Dial/N Only in the North
+Dial/-NOR Words not in Norway
+Dial/NOR Words only in Norway
+Dial/-SW Words not in Sweden
+Dial/SW Words only in Sweden
+Dial/SH Short forms
+Dial/L Long forms

Normative/prescriptive compounding tags

(to govern compound behaviour for the speller, ie what a compound SHOULD BE)

The left part of a compound should be …

The default is +CmpN/SgN, so when nothing is specified, that will be used. To override that one, specify one or more of the following tags. +CmpN/SgN must be specified if also other tags are listed - unless +CmpN/SgN should not be used, for course.

Normative compounding tag Explanation
+CmpN/Sg Singular
+CmpN/SgN Singular Nominative
+CmpN/SgG Singular Genitive
+CmpN/PlG Plural Genitive

The right part of a compound requires to the left …

These tags overrule the regular tags defined above. One or more can be specified.

Normative left-governing tag Explanation
+CmpN/SgLeft Sg to the left
+CmpN/SgNomLeft etc.
+CmpN/SgGenLeft
+CmpN/PlGenLeft

This part of the component can …

Normative position tag Explanation
+CmpNP/All … be in all positions, default, this tag does not have to be written
+CmpNP/First … only be first part in a compound or alone
+CmpNP/Pref … only be first part in a compound, NEVER alone
+CmpNP/Last … only be last part in a compound or alone
+CmpNP/Suff … only be last part in a compound, NEVER alone
+CmpNP/None … not take part in compounds
+CmpNP/Only … only be part of a compound, i.e. can never be used alone, but can appear in any position

Descriptive compounding tags

Tags for compound analysis - this is what a compound actually is. We use this to research compounding patterns in the corpus.

Descriptive compounding tag Explanation
+Cmp/Sg Compounding using an unspecified singular stem
+Cmp/SgNom Compounding using nominative singular
+Cmp/SgGen Compounding using genitive singular
+Cmp/PlGen Compounding using genitive plural
+Cmp/Attr Compounding using attribute form
+Cmp/eh Compound stem in –eh, as in gaameh-gåaroje, from gaamege
+Cmp/ege Compound stem in –ege, as in gaamege-gåaroje
+Cmp/FinEDel Deletion of final e, as in voelem-gaaroeh, from voeleme
+Cmp/ShH Compounding using a short stem + h: –biejjh– (from biejjie), cf reakedsbiejjhvadtese
+Cmp/Sh Compounding using a short stem: –biejj– (from biejjie)
+Cmp/SplitR This is a split compound with the other part to the right:
“Arbeids- og inkluderingsdepartementet” => Arbeids– = +Cmp/SplitR
+Cmp/SplitL This is a split compound with the other part to the left, this is the oposite of the previous case
+Cmp Dynamic compound - this tag should always be part of a dynamic compound. It is important for Apertium and the speller (to give extra weights to compounds), and useful in other cases as well.
+Cmp/XForm Alle Cmp som ikke har en klar klassifisering
+Cmp/AttrH Alle Cmp som har en attr-h

Tags for Inflection

Tags for Case, Number & Possessive Inflection

Case and number

Possessive

Tense, Person & Number

Tense tag Explanation
+Prs Presens
+Prt Preteritum
Person & Number tag Explanation
+Sg1 Singular, 1.person
+Sg2 Singular, 2.person
+Sg3 Singular, 3.person
+Du1 Dual , 1.person
+Du2 Dual , 2.person
+Du3 Dual , 3.person
+Pl1 Plural , 1.person
+Pl2 Plural , 2.person
+Pl3 Plural , 3.person

Other verbal tags

Verbal tag Explanation
+Neg negation verb ij
+ConNeg main verb complement to Neg, form identical to Imp
+VAbess Verb Abessive
+Inf Infinitive and participles
+PrfPrc Infinitive and participles
+PrsPrc Infinitive and participles
+Ger Gerundium
+VGen Verbgenitive
+Ind Indicative
+Imprt Imperative
+ImprtII Imperative, for Neg: ollem ollh …
+Cond Kondisjonalis, for one form: lidtjie. To be looked at.+ lidtjim, + lidtjih
+Act -eme, could be changed to +Actio

Tags for adjectives

Other tags

Tags for testing the frequency of certain phenomenas in our corpora

Tags for punctuation

Different focus particles

Tags for adverbs and comparated adjectives

Semantic tags

Semantic tags help disambiguation and syntactic analysis. All tags used are defined and listed below.

Multiple Semantic tags

Multiple semantic tags are written as one tag, with the different semantic values separated by an underline _.

All used combinations must be declared below, and the list must be manually maintained. The tags are ordered alphabetically, both the list and the semantic values within one tag.

Tag Explanation
+MWE multi word expressions, goes to abbr

Use the following flag diacritics to control downcasing of derived proper nouns (e.g. Finnish Pariisi -> pariisilainen). See e.g. North Sámi for how to use these flags. There exists a ready-made regex that will do the actual down-casing given the proper use of these flags.

Flag Explanation
@P.Px.add@ Giving possibility for Px-suffixes (all except from Nom 3.p)
@R.Px.add@ Requiring P.Px.add-flag for Px-suffixes (all except from Nom 3.p)
@P.Nom3Px.add@ Giving possibility for Px-suffixes Nom 3.p
@R.Nom3Px.add@ Requiring P.Nom3Px.add flag for Px-suffixes Nom 3.p
@P.Pmatch.Backtrack@ Used on single-token analyses; tell hfst-tokenise/pmatch to backtrack by reanalysing the substrings before and after this poin in the form (to find combinations of shorter analyses that would otherwise be missed)
@D.ErrOrth.ON@ asdf
@C.ErrOrth@ asdf
@P.ErrOrth.ON@ asdf

Derivation tags and derivation position tags in a derivation row

Derivations in the same position are mutually exclusive (can not be combined), whereas tags in different positions can be combined, so that position 1 derivations must precede position 2 derivations, and so on.

Pos1 Pos2 Pos3 POS switches (from-to) Explanation
+Der1       Position tag, required
  +Der2     Position tag, required
    +Der3   Position tag, required
+Der/htalle     VV Passive, frekeventative
+Der/lg     VV Passive
+Der/ijes     NA Nomen agentis
+Der/ihks     VA (Handlernomen- tilbøyelig til å utføre den handlingen som grunnordet angir)
+Der/les     VA Intensive
+Der/ldihkie     VA  
+Der/ldahke     VA Resultatnomen (?)
+Der/ldh     VA Attributt
+Der/ht     VV Causative
+Der/l     VV Subitive
+Der/st     VV Diminutive, Subitive
+Der/d     VV Continuative, Konative, Frequentative, Refleksive, Momentan
+Der/Car       -hts, Caritive, was Der/heapmi in sme
+Der/htj     NN Dim-cont, Frequentative
+Der/Dimin     NN Diminutive
+Der/Rec     NN Forholdsformer
+Der/laakan     AAdv adverb
+Der/laaketje     AA adjektiv
+Der/Comp     AA adjektiv
+Der/Superl     AA adjektiv
  +Der/vuota   AN Noun
  +Der/adte   VV Frequentative, Kontinuativ
  +Der/alla   VV Frequentative
  +Der/eds   NA Attributt
    +Der/PassL VV long only
    +Der/NomAg VN Nomen Agentis
    +Der/NomAct VN Nomen Actionis
    +Der/ahtje VV Inchoative
    +Der/InchL VV Inchoative

Other, non-positional derivations

All non-positional derivations should be preceded by the following tag, to make it possible to target regular expressions in all derivations in a language-independent way: just specify [+Der](+Der1 .. +Der5) and you are set.

Derivation tag POS switch Explanation
+Der/PassS VV short passive only
+Der/A NA comparation of N’s

Tags for originating language

The following tags are used to guide conversion to IPA: loan words and foreign names are usually pronounced (approximately) as in the originating (majority) language. Instead of trying to identify the correct pronunciation based on phonotactics (orthotactics actually), we tag all words that can’t be correctly transcribed using the SMA transcriber with source language codes. Once tagged, it is possible to apply different IPA conversions to each of them. The principle of tagging is that we only tag to the extent needed, and following a priority:

  1. any untagged word is pronounced with native orthographic conventions
  2. NNO and NOB have identical pronunciation, NNO is only used if different in spelling from NOB
  3. SWE has mostly the same pronunciation as NOB, and is only used if different in spelling from NOB
  4. Occasionally even SMA (the default) may be tagged, to block other languages from being specified, mainly during semi-automatic language tagging sessions All in all, we want to get as much correctly transcribed to IPA with as little work as possible. On the other hand, if more words are tagged than strictly needed, this should pose no problem as long as the IPA conversion is correct - at least some words will get the same pronunciation whether read as SMA or NOB/NNO/SWE.
Originating language tag Originating language
+OLang/SME North Sámi
+OLang/SMA South Sámi
+OLang/SMJ Lule Sámi
+OLang/FIN Finnish
+OLang/SWE Swedish
+OLang/NOB Norw. bokmål
+OLang/NNO Norw. nynorsk
+OLang/ENG English
+OLang/RUS Russian
+OLang/UND Undefined
+OLang/PARA parallelle navn, navnet skal ikke overføres til andre samisk språk

Area tags

Triggers for morphophonological rules

Morphophonemes and Sámi letters

Symbols that need to be escaped on the lower side (towards twolc):

Lexeme disambiguation tags

Stem variant tags

The clitic boundary mark

A multichar that usually just goes to zero:

Umlaut and diphthong simplification triggers

Trigger Explanation
%^DISIMP diphthong simplification
%^COMPDISIMP diphthong simplification in comparatives
%^COMPDISIMP2 diphthong simplification in comparatives, type 2
%^COMPDISIMP3 diphthong simplification
%^PLCDISIMP diphthong simplification in ACCRA-names
%^NOMAGieDISIMP diphthong simplification for NomAg ie stems
%^1UML a-uml, like 1sg prs, perf.part of båetedh/V-I, and ill sg of -ie nouns
%^2UML dark e, as 3sg prs & perf.part of tjearodh/V-II, and ill sg of -oe nouns
%^3UML adj Umlaut oeh:an
%^3sUML a-uml in 3sg prs of V-IV (roehtedh - ruahta)
%^3dUML ie-uml in 1du & 3pl prs of V-IV (roehtedh - ruehtien)
%^iæUML not used
%^iUML i-uml in pret of V-I (båetedh - böötim)
%^PASSUML Short passive Umlaut Rx->R5
%^didhUML Der/d Umlaut for GUARKEDH-words
%^htjidhUML Umlaut für der/htjidh derivations
%^adteUML Umlaut für Der/adte and Der/alla derivations
%^aLATUS Latus-Umlaut for -ie stems
%^uLATUS Latus-Umlaut for -oe stems
%^ConsDel Stem consonant deletion in front of Der/PassL
%^ILLELA Stem vowel changes in Illative an Elative
%^PLGENPLCOM Stem vowel changes in final from e -> i, and withoaut -j-
%^COMESS Stem vowel changes in ACCRA-names
Symbol used before # and - in dynamic compounds, and only there. Used to block optional conversion of word boundaries to spaces for error detection in grammar checkers. That is, dynamic compounds are not allowed to be written appart for error detection, only lexicalised ones. This is done to reduce the amound of ambiguity in the raw analyses to an amount we can cope with.

Flag diacritics

We have manually optimised the structure of our lexicon using the following flag diacritics to restrict morhpological combinatorics - only allow compounds with verbs if the verb is further derived into a noun again:

Flag Explanation
@P.NeedNoun.ON@ (Dis)allow compounds with verbs unless nominalised
@D.NeedNoun.ON@ (Dis)allow compounds with verbs unless nominalised
@C.NeedNoun@ (Dis)allow compounds with verbs unless nominalised
@R.ErrOrth.ON@  

For languages that allow compounding, the following flag diacritics are needed to control position-based compounding restrictions for nominals. Their use is handled automatically if combined with +CmpN/xxx tags. If not used, they will do no harm.

Flag Explanation
@P.CmpFrst.FALSE@ Require that words tagged as such only appear first
@D.CmpPref.TRUE@ Block such words from entering ENDLEX
@P.CmpPref.FALSE@ Block these words from making further compounds
@D.CmpLast.TRUE@ Block such words from entering R
@D.CmpNone.TRUE@ Combines with the next tag to prohibit compounding
@U.CmpNone.FALSE@ Combines with the prev tag to prohibit compounding
@U.CmpNone.TRUE@ Combines with the two previous ones to block compounding
@P.CmpOnly.TRUE@ Sets a flag to indicate that the word has passed R
@D.CmpOnly.FALSE@ Disallow words coming directly from root.
@U.CmpHyph.FALSE@ Flag to control hyphenated compounds like proper nouns
@U.CmpHyph.TRUE@ Flag to control hyphenated compounds like proper nouns
@C.CmpHyph@ Flag to control hyphenated compounds like proper nouns

Use the following flag diacritics to control downcasing of derived proper nouns (e.g. Finnish Pariisi -> pariisilainen). See e.g. North Sámi for how to use these flags. There exists a ready-made regex that will do the actual down-casing given the proper use of these flags.

Flag Explanation
@U.Cap.Obl@ Allowing downcasing of derived names: deatnulasj.
@U.Cap.Opt@ Allowing downcasing of derived names: deatnulasj.

The following flag diacritics are used by the grammar checker.

Flag Explanation
@R.SpellRlx.ON@ Flag used to tag spell-relax-analysed strings (and only those).
@D.SpellRlx.ON@ Flag used to tag spell-relax-analysed strings (and only those).
@C.SpellRlx@ Flag used to tag spell-relax-analysed strings (and only those).
@P.Pmatch.Loc@ Used on multi-token analyses; tell hfst-tokenise/pmatch where in the form/analysis the token should be split.
@P.Pmatch.Backtrack@ Used on single-token analyses; tell hfst-tokenise/pmatch to backtrack by reanalysing the substrings before and after this point in the form (to find combinations of shorter analyses that would otherwise be missed)
Flag diacritic Explanation
@U.number.one@ Flag used to give arabic numerals in smj different cases ;
@U.number.two@ Flag used to give arabic numerals in smj different cases ;
@U.number.three@ Flag used to give arabic numerals in smj different cases ;
@U.number.four@ Flag used to give arabic numerals in smj different cases ;
@U.number.five@ Flag used to give arabic numerals in smj different cases ;
@U.number.six@ Flag used to give arabic numerals in smj different cases ;
@U.number.seven@ Flag used to give arabic numerals in smj different cases ;
@U.number.eight@ Flag used to give arabic numerals in smj different cases ;
@U.number.nine@ Flag used to give arabic numerals in smj different cases ;
@U.number.zero@ Flag used to give arabic numerals in smj different cases ;

Lexicon Root

This is the beginning of everything. The Root lexicon is reserved in the LexC language, and must be the first lexicon defined.

Here is the list of top-level lexica in the South Sámi analyser:

Lexicon ENDLEX

And this is the ENDLEX of everything:

@D.CmpOnly.FALSE@@D.CmpPref.TRUE@@D.NeedNoun.ON@ ENDLEX2 ;

The @D.CmpOnly.FALSE@ flag diacritic is ued to disallow words tagged with +CmpNP/Only to end here. The @D.NeedNoun.ON@ flag diacritic is used to block illegal compounds.


This (part of) documentation was generated from src/fst/morphology/root.lexc


src-fst-morphology-stems-adjectives.lexc.md

Adjective stems

This is one of two parallel files containing adjective stems. The files represent two alternative interpretation of the same data (South Saami adjectives). This file is used for spellchecking, the alternative file adjectives-oahpa.lexc is used for dictionary and icall applications. This file is compiled by default, the other one is compiled by in langs/sma giving the command .configure –with-oahpa before compiling.

etnihke+A+OLang/NOB:etnihke IHKE_IHKELES_LOAN ; !samediggediedahus 2012 - etnisk - etnisiteete+les


This (part of) documentation was generated from src/fst/morphology/stems/adjectives.lexc


src-fst-morphology-stems-adverbs.lexc.md

egentlig satt disse inn i noun-adv-leksikon om disse skal være adverb?

<== why no case?

laakte bïejedh - legge for tett sammen


This (part of) documentation was generated from src/fst/morphology/stems/adverbs.lexc


src-fst-morphology-stems-nouns.lexc.md

South Sámi nouns

The lexicon NounRoot

This lexicon is the start of all noun lemmas. It splits the nouns in three classes as follows:

NounRoot –> FirstComponent NounRoot –> HyphNouns NounRoot –> Noun

All nouns with possessives

LEXICON Noun

Splitting nouns in NounNoPx, NounPx (with a P.Px.add flag) and NounPxKin (with a P.Nom3Px.add flag)

https://satni.uit.no/termwiki/index.php?title=Huksenteknihkka:borettslaghttps://satni.uit.no/termwiki/index.php?title=Huksenteknihkka:frittstående_borettslag

https://satni.uit.no/termwiki/index.php?title=Education:embetsstudium

Not according to umlautsystem

—Ije-

Lemma fra GG: merk DNorm

pp, tt, kk –> hp, ht, hk eller bp, dt, gk? bp, dt, gk strider i mot rettskrivingsprinsippene jfr.

6 koreen 5 tyrkijen 20 Bottleneck-hypotesen —- dynamisk sammensetning - how? 17 direkte


This (part of) documentation was generated from src/fst/morphology/stems/nouns.lexc


src-fst-morphology-stems-numerals.lexc.md

hva med

NAMAT ; ! duhatjienat, logigielat, etc. NAMAT derivs are SAS ; !viđajahkásaš


This (part of) documentation was generated from src/fst/morphology/stems/numerals.lexc


src-fst-morphology-stems-pronouns.lexc.md

South Saami pronouns

The Pronoun lexicon points to all the subgrops, presented in this order below:

The Reciprocal pronoun

Personal pronouns

Splitting in 1st, 2nd, 3rd

New lemma form, now number as baseform, due to Oahpa

Lexica for sg

the firstperspronsg for first pers has special consonantism

for nonfirstperspronsg the 2nd and 3rd are identical

Lexica for du

Lexica for pl

Common case lexica

dïhte

DIHTE is a personal pronoun, demonstrative dïhte is treated below.

Demonstrative pronouns

This is for: the attributive forms of dïhte all forms of the other pronouns

The initial demonstrative lexica

Interrogative and relative pronouns

Indefinite pronouns

kongruensContlex

muvhtiecase

naaken

Inflecting even indefinites

Reflexive pronouns


This (part of) documentation was generated from src/fst/morphology/stems/pronouns.lexc


src-fst-morphology-stems-sma-propernouns.lexc.md

(Söderhamn. Gävleb))


This (part of) documentation was generated from src/fst/morphology/stems/sma-propernouns.lexc


src-fst-morphology-stems-verbs.lexc.md

Verb stems

Preamble: Documenting the classes

contlex stem umlaut dict class

Even stems

Verbklasse I

Verbklasse II

Verbklasse III

Verbklasse IV

Verbklasse V

Verbklasse VI

Verbklasse ulikestava

The actual continuation lexica

LEXICON Verb splits to AUX and Regular_verbs

LEXICON AUX lemma for edtjedh, ij and lea, each with their own contlex in affixes.

LEXICON Regular_verbs here comes the whole list, appr. 11000.


This (part of) documentation was generated from src/fst/morphology/stems/verbs.lexc


src-fst-oahpa-filer-aff-adjectives-oahpa.lexc.md

Adjective affixes

This is one of two parallel files containing adjective affixes. The files represent two alternative interpretation of the same data (South Saami adjectives). This file is used for dictionary and icall applications, the alternative file adjectives.lexc is used for spellchecking. This file is compiled by in langs/sma giving the command .configure –with-oahpa before compiling. The other file (adjectives.lexc) is compiled by default.

Adjectives: Adjectival inflection sublexica

Basic adjectival lexica, infl types

even stems

Lexical exceptions

Regular even stem types

type 2

ODD-stems (odd stem declension)

type 0 (attr only)

type 1

type 2

type 3

type 3

type 4

Attribute lexica

Predicative lexica

Odd syllabic stems - adjectives

Even syllabic stems - adjectives

Comparative forms


This (part of) documentation was generated from src/fst/oahpa-filer/aff-adjectives-oahpa.lexc


src-fst-oahpa-filer-stems-adjectives-oahpa.lexc.md

Adjective stems

This is one of two parallel files containing adjective stems. The files represent two alternative interpretation of the same data (South Saami adjectives). This file is used for dictionary and icall applications, the alternative file adjectives.lexc is used for spellchecking. This file is compiled by in langs/sma giving the command .configure –with-oahpa before compiling. The other file (adjectives.lexc) is compiled by default.


The file starts as follows:

LEXICON Adjective

TG-grammatihkeles:TG-grammatihkel LES ;
aajmoes:aajmoe s_S_ODD ;
aajne:aajne ATTR_0 ; \ … \


This (part of) documentation was generated from src/fst/oahpa-filer/stems-adjectives-oahpa.lexc


src-fst-phonetics-txt2ipa.xfscript.md

retroflex plosive, voiceless t ʈ 0288, 648 ( = ASCII 096) retroflex plosive, voiced d ɖ 0256, 598 labiodental nasal F ɱ 0271, 625 retroflex nasal n ɳ 0273, 627 palatal nasal J ɲ 0272, 626 velar nasal N ŋ 014B, 331 uvular nasal N\ ɴ 0274, 628

bilabial trill B\ ʙ 0299, 665 uvular trill R\ ʀ 0280, 640 alveolar tap 4 ɾ 027E, 638 retroflex flap r ɽ 027D, 637 bilabial fricative, voiceless p\ ɸ 0278, 632 bilabial fricative, voiced B β 03B2, 946 dental fricative, voiceless T θ 03B8, 952 dental fricative, voiced D ð 00F0, 240 postalveolar fricative, voiceless S ʃ 0283, 643 postalveolar fricative, voiced Z ʒ 0292, 658 retroflex fricative, voiceless s ʂ 0282, 642 retroflex fricative, voiced z` ʐ 0290, 656 palatal fricative, voiceless C ç 00E7, 231 palatal fricative, voiced j\ ʝ 029D, 669 velar fricative, voiced G ɣ 0263, 611 uvular fricative, voiceless X χ 03C7, 967 uvular fricative, voiced R ʁ 0281, 641 pharyngeal fricative, voiceless X\ ħ 0127, 295 pharyngeal fricative, voiced ?\ ʕ 0295, 661 glottal fricative, voiced h\ ɦ 0266, 614

alveolar lateral fricative, vl. K alveolar lateral fricative, vd. K\

labiodental approximant P (or v) alveolar approximant r\ retroflex approximant r` velar approximant M\

retroflex lateral approximant l` palatal lateral approximant L velar lateral approximant L
Clicks

bilabial O\ (O = capital letter) dental |
(post)alveolar !\ palatoalveolar =\ alveolar lateral ||
Ejectives, implosives

ejective > e.g. ejective p p> implosive < e.g. implosive b b< Vowels

close back unrounded M close central unrounded 1 close central rounded } lax i I lax y Y lax u U

close-mid front rounded 2 close-mid central unrounded @\ close-mid central rounded 8 close-mid back unrounded 7

schwa ə @

open-mid front unrounded E open-mid front rounded 9 open-mid central unrounded 3 open-mid central rounded 3\ open-mid back unrounded V open-mid back rounded O

ash (ae digraph) { open schwa (turned a) 6

open front rounded & open back unrounded A open back rounded Q Other symbols

voiceless labial-velar fricative W voiced labial-palatal approx. H voiceless epiglottal fricative H\ voiced epiglottal fricative <\ epiglottal plosive >\

alveolo-palatal fricative, vl. s\ alveolo-palatal fricative, voiced z\ alveolar lateral flap l\ simultaneous S and x x\ tie bar _ Suprasegmentals

primary stress “ secondary stress % long : half-long :\ extra-short _X linking mark -
Tones and word accents

level extra high _T level high _H level mid _M level low _L level extra low _B downstep ! upstep ^ (caret, circumflex)

contour, rising contour, falling _F contour, high rising _H_T contour, low rising _B_L

contour, rising-falling _R_F (NB Instead of being written as diacritics with _, all prosodic marks can alternatively be placed in a separate tier, set off by < >, as recommended for the next two symbols.) global rise global fall Diacritics

voiceless 0 (0 = figure), e.g. n_0 voiced _v aspirated _h more rounded _O (O = letter) less rounded _c advanced _+ retracted _- centralized _” syllabic = (or _=) e.g. n= (or n=) non-syllabic _^ rhoticity `

breathy voiced _t creaky voiced _k linguolabial _N labialized _w palatalized ‘ (or _j) e.g. t’ (or t_j) velarized _G pharyngealized _?\

dental d apical _a laminal _m nasalized ~ (or _~) e.g. A~ (or A~) nasal release _n lateral release _l no audible release _}

velarized or pharyngealized _e velarized l, alternatively 5 raised _r lowered _o advanced tongue root _A retracted tongue root _q


This (part of) documentation was generated from src/fst/phonetics/txt2ipa.xfscript


src-fst-transcriptions-transcriptor-abbrevs2text.lexc.md

We describe here how abbreviations are in South Sámi are read out, e.g. for text-to-speech systems.

For example:

Kopi fra smj : samme navn som denne fila:

SMJ NOAB ! Abbreviations that are not treated as abbreviations at the end of the sentence = * **esim.:esimerkiksi # ; ** contains abbreviations who are transitive in front of numerals = * **esim.:esimerkiksi # ; ** contains transitive abbreviations = * **esim.:esimerkiksi # ; ** su, dii ============ SMI abbrevisations: ============ smi_ITRAB smi_TRAB smi_TRNUMAB


This (part of) documentation was generated from src/fst/transcriptions/transcriptor-abbrevs2text.lexc


tools-grammarcheckers-grammarchecker.cg3.md

S O U T H S A A M I G R A M M A R C H E C K E R

DELIMITERS

TAGS AND SETS

Tags

This section lists all the tags inherited from the fst, and used as tags in the syntactic analysis. The next section, Sets, contains sets defined on the basis of the tags listed here, those set names are not visible in the output.

Beginning and end of sentence

BOS EOS

Parts of speech tags

N A Adv V Pron CS CC CC-CS Po Pr Pcle Num Interj ABBR ACR CLB LEFT RIGHT WEB PPUNCT PUNCT MWE

COMMA ¶

Tags for POS sub-categories

Pers Dem Interr Indef Recipr Refl Rel Coll NomAg Prop Allegro Arab Romertall

Tags for morphosyntactic properties

Nom Acc Gen Ill Ela Ine Loc Com Ess Ess Sg Du Pl Cmp/SplitR Cmp/SgNom Cmp/SgGen Cmp/SgGen PxSg1 PxSg2 PxSg3 PxDu1 PxDu2 PxDu3 PxPl1 PxPl2 PxPl3 Px

Comp Superl Attr Ord Qst IV TV Prt Prs Ind Pot Cond Imprt ImprtII Sg1 Sg2 Sg3 Du1 Du2 Du3 Pl1 Pl2 Pl3 Inf ConNeg Neg PrfPrc VGen PrsPrc Ger Sup Actio VAbess

Derivation tags

Sets for explicit error analysis from the morphological analyser:

Other secondary tags

Semantic tags

Other semantic sets:

Syntactic tags

Sets containing sets of lists and tags

This part of the file lists a large number of sets based partly upon the tags defined above, and partly upon lexemes drawn from the lexicon. See the sourcefile itself to inspect the sets, what follows here is an overview of the set types.

Sets for Single-word sets

INITIAL

Sets for word or not

WORD NOT-COMMA

Case sets

ADLVCASE

CASE-AGREEMENT CASE

NOT-NOM NOT-GEN NOT-ACC

Verb sets

NOT-V

Sets for finiteness and mood

REAL-NEG

MOOD-V

NOT-PRFPRC

Sets for person

SG1-V SG2-V SG3-V DU1-V DU2-V DU3-V PL1-V PL2-V PL3-V

Pronoun sets

Adjectival sets and their complements

Adverbial sets and their complements

Sets of elements with common syntactic behaviour

NP sets defined according to their morphosyntactic features

The PRE-NP-HEAD family of sets

These sets model noun phrases (NPs). The idea is to first define whatever can occur in front of the head of the NP, and thereafter negate that with the expression WORD - premodifiers.

Border sets and their complements

Grammarchecker sets

Naming convention &errorclass-errortype-wrong-correct: So far only one errorclass: msyn.

RULE SECTION

VERB agreement

Ensure preceding nominal agrees with the verb


This (part of) documentation was generated from tools/grammarcheckers/grammarchecker.cg3


tools-grammarcheckers-grc-disambiguator.cg3.md

S O U T H   S Á M I   D I S A M B I G U A T O R

Delimiters, tags and sets

"<.>" "<!>" "<?>" "<...>" "<¶>" sent

Tags

BOS/EOS:

Morphological tags

Derivation tags

Error usage tags

We define two lists for Err/xxx tags:

Other tags

Other secondary tags

Semantic tags

Secondary tags

Syntactic tags

Titles

REAL-TITLE OFFICE TITLE

Sets

Sets of morphological tags for syntactic use

CASES ADVLCASE NUMBER

Noun sets

INSTITUTION ORGANIZATION EDUCATION CURRENCY CURRENCY LESSON

Verb sets

REALCOPULAS

COPULAS

V-NOT-COP

MOD-ASP

Adjective sets

Adverb sets

GUKTIEGOSSE

DAESTIE

ILLADV

INEADV1

ELAADV1

INEADV

ELAADV

DV-MOD-ADV

Postposition sets

ILLPO

BOUNDARY SETS

REALCLB

SV-BOUNDARY

NP-BOUNDARY

Derivation sets

V-DER

V-DER-SUF

N-DER N-DER-SUF

A-DER A-DER-SUF

PASS

LEX-V LEX-N LEX-A LEX-ADV

VERB-FORMS 2-PERS

Disambiguation rules

BEFORE-SECTIONS

Rule for adding Sem/Date as a tag to readings which looks like dates (fjernes når vi får felles numeralfil fra shared)

Guessing: Rule for adding Adv Sem/Adr as a tag to readings which looks addresses

Guessing: Rule for adding Adv Sem/Adr as a tag to readings which looks addresses

Rules for adding to verbs denoting verbal actions like: ... jeahta Aili Kestkitalo.

SECTION

Cycle 0 (Early rules)

Removing non-lexicalised forms when lexicalised

Numerals and ACR

Numerals in QPs

CC og not (spesifikke regler lenger ned)

Interj

Possessive suffix

REmove Px if not family

Pronouns

Proper nouns

INITIAL

Verbs

Postpositions

Selecting postpositions when preceded by genitives, etc.

Particles and adverbs

Adjective or Indef

Demonstratives

Genitive

Adjective or not

Rel or Interr OR Indef

Adverbs

Selecting adverbs in local contexts

Verbs

Selecting verbs in local contexts, based upon agreement patterns

Selecting imperative sentence-initially with appropriate right context

Remove verb readings

Select Inf

Mapping rules

CC- and CS-Mapping

CNP mapping

Mapping CNP to CC and CS.

CVP Mapping

Mapping @CVP to all CS

Attributes or not

PrfPrc

Select PrfPrc if DerNomAct

Mapping verbs

killifVinCohort

This rule removes all other readings, if there is a mapped V reading in the same cohort. Every case which this goes wrong, should be fixed in mapping rules or previous disrules.

Person

leah Prs Sg2 = Pl3

Select Inf If Infv

Span sentences

Nomen

Remove Prop Attr if not 1 Prop

Verb or Noun

CC and CS or Adv

Adj or Adv

Grammatisk ord eller N eller A

N or V

Ger or Der/NomAct

Adj or Indef

Num

Adv or Po/Pr

Illative or genetive

Essive

Comitative

Accusative or illative

Indef or Adv

special lemmas

Adverb context prefers Adv

Verb person vs. Inf – moved here in order to have the pronouns disambiguated first.

Proper nouns

Rule set taken from sme

gellie as numeral, not pronoun


This (part of) documentation was generated from tools/grammarcheckers/grc-disambiguator.cg3


tools-tokenisers-tokeniser-disamb-gt-desc.pmscript.md

Tokeniser for sma

Usage:

$ make
$ echo "ja, ja" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa boasttu olmmoš, man mielde lahtuid." | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "márffibiillagáffe" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst

Pmatch documentation: https://github.com/hfst/hfst/wiki/HfstPmatch

Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words:

Whitespace contains ASCII white space and the List contains some unicode white space characters

Apart from what’s in our morphology, there are

  1. unknown word-like forms, and
  2. unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a Unknowns are made of:
    • lower-case ASCII
    • upper-case ASCII
    • select extended latin symbols ASCII digits
    • select symbols
    • Combining diacritics as individual symbols,
    • various symbols from Private area (probably Microsoft), so far:
    • U+F0B7 for “x in box”

Unknown handling

Unknowns are tagged ?? and treated specially with hfst-tokenise hfst-tokenise –giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it’s safer to let hfst-tokenise handle them.

Finally we mark as a token any sequence making up a:


This (part of) documentation was generated from tools/tokenisers/tokeniser-disamb-gt-desc.pmscript


tools-tokenisers-tokeniser-gramcheck-gt-desc.pmscript.md

Grammar checker tokenisation for sma

Requires a recent version of HFST (3.10.0 / git revision>=3aecdbc) Then just:

$ make
$ echo "ja, ja" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst

More usage examples:

$ echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa boasttu olmmoš, man mielde lahtuid." | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "márffibiillagáffe" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst

Pmatch documentation: https://github.com/hfst/hfst/wiki/HfstPmatch

Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words:

Whitespace contains ASCII white space and the List contains some unicode white space characters

Apart from what’s in our morphology, there are 1) unknown word-like forms, and 2) unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a

TODO: Could use something like this, but built-in’s don’t include šžđčŋ:

Simply give an empty reading when something is unknown: hfst-tokenise –giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it’s safer to let hfst-tokenise handle them.

Finally we mark as a token any sequence making up a:


This (part of) documentation was generated from tools/tokenisers/tokeniser-gramcheck-gt-desc.pmscript


tools-tokenisers-tokeniser-tts-cggt-desc.pmscript.md

TTS tokenisation for smj

Requires a recent version of HFST (3.10.0 / git revision>=3aecdbc) Then just:

make
echo "ja, ja" \
| hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst

More usage examples:

echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa \
boasttu olmmoš, man mielde lahtuid." \
| hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" \
| hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
echo "márffibiillagáffe" \
| hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst

Pmatch documentation: https://kitwiki.csc.fi/twiki/bin/view/KitWiki/HfstPmatch

Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words:

Whitespace contains ASCII white space and the List contains some unicode white space characters

Apart from what’s in our morphology, there are 1) unknown word-like forms, and 2) unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a

TODO: Could use something like this, but built-in’s don’t include šžđčŋ:

Simply give an empty reading when something is unknown: hfst-tokenise –giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it’s safer to let hfst-tokenise handle them.

Needs hfst-tokenise to output things differently depending on the tag they get


This (part of) documentation was generated from tools/tokenisers/tokeniser-tts-cggt-desc.pmscript