North Sami NLP Grammar

Finite state and Constraint Grammar based analysers, proofing tools and other resources

View the project on GitHub giellalt/lang-sme

Page Content

North Sami language model documentation

All doc-comment documentation in one large file.


src-cg3-disambiguator.cg3.md

DELIMITERS

Sentence delimiters are the following: <.> <!> <?> <…> <¶>

TAGS AND SETS

Tags

This section lists all the tags inherited from the fst, and used as tags in the syntactic analysis. The next section, Sets, contains sets defined on the basis of the tags listed here, those set names are not visible in the output.

Beginning and end of sentence

BOS EOS

Parts of speech tags

Semantic tags

Syntactic tags

Sets containing sets of lists and tags

This part of the file lists a large number of sets based partly upon the tags defined above, and partly upon lexemes drawn from the lexicon. See the sourcefile itself to inspect the sets, what follows here is an overview of the set types.

Sets for Single-word sets

OKTA and go, and the set INITIAL for initial letters OKTA go INITIAL

Sets for word or not

WORD REAL-WORD WORD-NOT-de NOT-COMMA

Derivational affixes

DER-V

DER-V

DER-N

DER-A1

DER-A

A-V

A-NOT-V

Case sets

ADLVCASE

CASE-HALFAGREEMENT CASE-AGREEMENT CASE

NOT-NOM NOT-GEN NOT-ACC

Verb sets

NOT-V

Sets for finiteness and mood

REAL-NEG

MOOD-V

GC

VFIN

VFIN-POS

VFIN-NOT-IMPRT

VFIN-NOT-NEG

NOT-PRFPRC

Sets for person

Sets consisting of forms of “leat” (these ones need to be rewritten)

Pronoun sets

Adjectival sets and their complements

Adverbial sets and their complements

Sets for coordinators

Sets for adverbs that have lookalikes

Here come some adverbs that have identical twins in other POS. If these are found in Adv contexts, we treat them as adverbs.

Sets of elements with common syntactic behaviour

Sets for verbs

V is all readings with a V tag in them, REAL-V should be the ones without an N tag following the V.
The REAL-V set thus awaits a fix to the preprocess V … N bug.

TRANS-V is the set for verbs really taking objects

STRICT-TRANS-V is the set for verbs which don’t let a GenAcc be a modifier of anything else than an object, e.g. Mun organiseren eatni gievkkanis. - eatni wants to be the object

Valency sets

Adverb sets

Adjective sets

NP sets defined according to their morphosyntactic features

The PRE-NP-HEAD family of sets

These sets model noun phrases (NPs). The idea is to first define whatever can occur in front of the head of the NP, and thereafter negate that with the expression WORD - premodifiers.

The set NOT-NPMOD is used to find barriers between NPs. Typical usage: … (*1 N BARRIER NPT-NPMOD) … meaning: Scan to the first noun, ignoring anything that can be part of the noun phrase of that noun (i.e., “scan to the next NP head”)

Other negatively defined morphosyntactic noun sets

Noun sets

Nominal sets defined according to their morphophonological properties Sets for lexeme homonymy (most of them are moved to where the actual rules are.)

The words in the set N-PO can be both N and Po, the set takes that into account.

The LAHKA set family

Nominal sets defined according to their semantical properties

Miscellaneous sets

Border sets and their complements

Syntactic sets

ALLSYNTAG NON-APP

These were the set types.

Guessing: Rule for adding Sem/Date as a tag to readings which looks like dates

Guessing: Rule for adding Adv Sem/Adr as a tag to readings which looks addresses

Removing or selecting proper nouns that are lookalikes

we don’t want propernoun analysis of these words, initially in sentences

*Removes PropPl, but problems with names as Davviriikkaid Ráđi, there we want Prop Pl

*Select PlcSur (Sem/Plc) (Sem/Sur)

Some propernouns have two parts and the first is not a genitive. We still have problems with abbr when these propernouns are inflected or are a part of a cmp. The copy rule adds Attr reading to names which not get it in the fst (Soria). The select rule selects Attr when the next word is e.g. Moria.

Rules for giving Attr to names, e.g. Ole Attr Kåven.

Remove unwanted analyses

Southern Locative vs. Essive

Numerals

Lexicalised derivations

Particular verbs

Propernouns

Some adjectives are never derived as Adv

Rules for Prop Attr, Sem/Sur and Plc

MISC

ONE-COHORT DISAMBIGUATION - CYCLE 0

The idea behind “cycle 0” is to have safe rules without context first. These rules typically chose lexicalisations over derivations, Saami words instead of marginal names, etc.

Lexicalised derivations

*Removes derN if lexicalised.

*Removes derNEss if lexicalised, and both nouns are essive.

*Removes derA or PrsPrc or VGen if lexicalised. VGen is a chance.

*Removes derAdv when Adv is lexicalised.

*Removes VAbess when Adv is lexicalised.

Fragments and headliners

Adjectives or nouns, not adverbs

Adjective plural, not comparative

Adverbs

Lexicalised adverbs

It is useful to select early the adverbial reading for potensial nouns or verbs.

*aloGen removes állu Gen, álo Adv vs. N Gen

*bealisAdv

*bearreAdv beare vs bearri

*ilusAdv

*rámisA

Pronouns

Nouns, not verbs

Lexical selection - nouns

mánnu vs mánus

Not noun

Adposition or not

Not Qst

Interjections

Px-rules for special nouns

Some verb rules

Particular CS

Verb or Noun?

Adpositions

Adpositions, not verbs

Section 2: LOCAL DISAMBIGUATION - CYCLE 1

FAMILY pronouns

Pron Pers 1. p.

Pron Pers 2. p.

Pron Pers 3. p.

An early rule for “eanaš”/”eanas”

Px constraints

First select Px, then remove all remaining Px

We end section 2 by removing all remaining Px

Section 3: Certain verb readings

verb or adv

All imperatives

For imperative disambiguation we need the following: Pick imperative contexts, and thereafter remove imperative. Such contexts are: Imperative verb sentence-initially with exclamation mark

Sg1 - early cycle, safe rules

Sg2 - early cycle, safe rules

Sg3 - early cycle, safe rules

Negative verb, not abbreviation or roman numeral Ii.

Du1 - early cycle, safe rules

These Du1, Du2 rules are (almost) not in use in our corpus, but we keep them for completeness.

Du2 - early cycle, safe rules

The next two rules are not found in the corpus, but logically they belong, to cover the whole paradigm. There is no verb-internal homonymy here, but there is homonymy with e.g. Illative for certain verbs.

Du3 - early cycle, safe rules

The competitor to Du3 is -ba Foc.

Pl1 - early cycle, safe rules

The competitor here is obviously Inf, but also Pl3 and Prt Sg2.

Pl2 - early cycle, safe rules

These rules are not used when disambiguating the corpus

Pl3 - early cycle, safe rules

Select…

The following two may be joined:

Remove…

The following two may be joined:

PrsPrc

OBS: denne er ikke helt bra

*listInf in lists

Section 4: CYCLE 1B: REMOVING THE READINGS THAT WERE LEFT FROM THE 1A RULES

We don’t need more Px sections, it’s done alrady

Noun, adjectiv, PrsPrc or not?

Adjectives and adverbs

Adv or not?

maid has many readings and as Rel it is a member of S-BOUNDARY. Therefore we need to disambiguate is early in this file. Most important is to select Adv. Because of that A ang N still can have Vfin readings, it is difficult to make very general rules.

matPcle

The following two rules are omitted. They only inflect on the disambiguation of mat pcle, a wackernagel, which is done in the rule over here, I think.

Disambiguating abbreviations

Disambiguating particles

Disambiguating rom attr

Disambiguating clitics

Disambiguating numerals

Disambiguating adpositions

čađa

Commented out som adp-rules we don’t need anymore:

geahčai

guovddaš

mađe

miehta

LIST LG-MATERIAL = Inf Adv Nom ;

Diambiguation Noun vs. Po or Pr:

Some particular subjunctions and Neg Sup

go as CS and Qst Pcle

First select all “go” Qst Pcle, then remove them so the rest will be “go” CS

Section 9 WORD-SPECIFIC RULES

Some particular subjunctions

Adverb rules

MAPPING OF COMP-CS< , COMPLEMENTS OF PARTICLES IN COMPARISON

First map all COMP-CS<, then remove the other readings

MAPPING OF CC AND CS

Mostly we map both @CNP and @CVP, then we select @CNP, after that we remove them so @CVP remains

*CVPoppramsing Lásse, Iŋgá ja mun

*CVPCmp/SplitR Cmp/SplitR @CNP

PRONOUNS

Plural?

Interrogative and relative pronouns

Emphatic ieš

Numerals

Indefinite pronouns

The rules are not documented yet

Demonstrative pronouns - should have a look at these

Disambiguating adjectives

Attribute disambiguation

Rules for Attr between Dem and N

Other attribute rules

Special rules for ‘buorre’ (the only adjective showing case agreement)

This block of rules is there to ensure case agreement for comparatives.

alit vs. allat Comp Attr

And now some rules for adverbs that modify adjectives

Proper nouns

VERBS

Disambiguating verbs - part 1

First ConNeg forms, they are dependent upon Neg verbs. Then Imperative (with their special syntax), infinitive, and other infinite forms. Person comes later (in part 2)

ConNeg forms

Number following the rule headers below refer to numbers of hit in a 13 053 859 word corpus.

Imperative

See also Imprt or Ind some sections down.

Infinitive

Rules that prevent later selection of Inf for a finite verb in the frame

INF-V…CC…

Verbgenitive

Supinum vs. potential – no example found in large corpus

Perfect Participle

Topicalized version

the following chapter should be possible to unify.

Actio

Present participle

*orrut vs. orrot)

Rules for “addit” (which is an adjective, but more often a verb)

Actio Loc = N Loc

Actio Nom = Ess

Imprt or Ind

Nouns or verbs

The rules are no documented yet

Demonstrative pronouns, agreement in DP - should it be moved to after verbmappings?

The rules are no documented yet

VERB MAPPINGS

Verbs as predicatives (@SPRED>) and (@<OPRED)

The tags (@SPRED>) and (@<OPRED) target PrfPrc

The rules are no documented yet

Passive verbs often have

Verbs as prenominal participles (@>N):

(@+FAUXV) and (@+FMAINV) target Neg, orrut

(@A<) target Inf

(@<SUBJ) target Inf

(@<SPRED) target Inf

(@<ADVL) target Inf, Actio Ess

@-F<OBJ target Inf

(@N<) target Inf, Actio Ess

(@<ADVL) target Inf, Actio Ess

(@<OBJ) target Inf, Actio Ess, PrfPrc

(@+FMAINV) and (@+FAUXV) and (@-FAUXV)

(@-FMAINV) and (@-FAUXV)

And then we remove the verbs which didn’t get any syntactic tag, in favour of verbs with syntactic tags.

killifVinCohort This rule removes all other readings, if there is a mapped V reading in the same cohort. Every case which this goes wrong, should be fixed in mapping rules or previous disrules.

NOUNS

CASE DISAMBIGUATION

Num as subject, tricky cases - the rule should be here because of the verbdisambiguation

ACCUSATIVE-GENITIVE DISAMBIGUATION

Secure rules for choosing Acc

Semantihkka: Choosing accusative or genitive semantically

Other genitive rules

Genlassin Selects Gen if first one to the right is lassin *bargostipeanddaid lassin

lassinIll Selects Ill if first one to the left is lassin *lassin Sarai

*GenAhkásaš Selects Gen

Gen and preposition/postposition

Genitive in place adverbials ROUTE

Adjectives take object

Temporal adverbials: Choosing accusative or genitive TIME

Reflexive pronouns: acc or gen

Accusative object

*topOBJPers Removes Gen if you are Acc, and to you right is a Pron followed by a transitive verb. You have to be sentence initial

*AccVAbess Selects Gen if to the right is abessive

Gen modifiers inside NP

Accusative in coordination

Intransitive verbs can sometimes be transitive

Accusative or genitive in front of ALU and in front of adjectives

Exceptional accusative attributes in front of ALU nouns.

Numerals

NumGenMeasure Genitive numerals in front of ruvdnosaš with friends

Leftover accusatives

*COMPInfAcc Selects Acc if you are Gen and to the left is an Inf TV @COMP-CS<

Accusative before @COMP-CS<

Accusative before some A

Accusative sentence-finally

Genitive

Nominative and accusative

*NomIFInitialThenSg3 Selects Nom if -1 BOS and 1 oblique / Sg3 lookalike. Works in fragments.

Nominative

Miscellaneous rules

Vocatives, subjects of sentence fragments

Nominative in titles and sentence fragments

Nominative after “go”, “dego”, “dugo” and “nugo”

Preverbal subjects

Postverbal subjects

Nominative predicatives

Nominative as objects in existential clauses

Nominative in coordination and apposition

Nominative in parallell constructions

Not nominative

Comitative rules

NP internal disambiguation of Com

Disambiguation based upon verb valency

Disambiguation of Com depending on Adv or certain verb or N

Animate nouns

HAB-ACTOR in habitive-constructions

váldit vára + Loc

dahkat earrodearvvuođat geainna nu

eallit mainna nu

Disambiguation based upon verb valency

COM-V

tools (concrete and abstract)

BODY as an instrument

Dynamic-verbs

Event-tool-actio

Most actio can be both tool and event.

PLACE-V

STATE-V (eallit)

Movement-verbs

The super-set Dynamic-verb according to choose (Pl Loc) or (Sg Com)

The idea is that the superset DYNAMIC-V are not connected to TOOL, ABSTR-TOOL or CONCEPT in (Pl Loc). This is the “minste felles multiplum”. The sub-sets are different, f.i. many of them (but not all) are not connected to HUMAN in (Pl Loc), one is not connected to ABSTR-ENTITY and ACTOR in (Pl Loc). We work with negation so the rules don´t destroy analysis because of insufficent sets.

First the general-rules for selecting (Sg Com), then the more special rules for selecting (Sg Com), and then we selct (Pl Loc) for the rest of them under # Another round of locative rules.

HUMAN-LOC-V

Locative and comitative - Disambiguation based upon coordination

And then we remove the remaining Sg Com analysis

Essive OBS

Late case rules (after other case rules have worked).

VERBS PART 2, Section #22

Finite or not

Finite

Not Finite

Indicative Negative

Infinitive

Indicative or imperative

Verbs according to person and number

Sg1 - First person singular

Sg2 - Second person singular

Sg3 - Third person singular

Infinitive and clausal subject

Rules that look backwards for a subject across a relative clause:

Rules that look backwards for a subject across a subordinate clause (CP boundary):

Extension possibilities: Coordination

Son oaidná du ja mu ovdal go boahtit…

Coordinated Sg3 verbs

Not V + Sg3

Du1 - First person dual

The previous two rules look marginal.

Du2 - Second person dual

Rules for leahppi = (“leahppi” N Sg Nom)

Du3 - Third person dual

Pl1 - First person plural

Pl2 - Second person plural

Pl3 - Third person plural

Rules for a special infinitive construction

More finite verbs

Passive

Infinitive

Present Participle

Actio/Perfect Participle

Actio

Selecting some more finite verbs

Lexical disambiguation of verbs

NOMEN

Case rules

Other rules for nouns and pronouns

Determiners

Adverbs and adjectives

NOUNS

Variant lemmas

VERBS

Test: Go for minimal weight.

Final removing rules

Removing Err/Orth


This (part of) documentation was generated from src/cg3/disambiguator.cg3


src-cg3-semanticroles.cg3.md


This (part of) documentation was generated from src/cg3/semanticroles.cg3


src-cg3-speech_disambiguator.cg3.md

DELIMITERS

Sentence delimiters are the following: <.> <!> <?> <…> <¶>

TAGS AND SETS

Tags

This section lists all the tags inherited from the fst, and used as tags in the syntactic analysis. The next section, Sets, contains sets defined on the basis of the tags listed here, those set names are not visible in the output.

Beginning and end of sentence

BOS EOS

Parts of speech tags

N A Adv V Pron CS CC Po Pr Pcle Num Interj ABBR ACR CLB LEFT RIGHT WEB LEFT RIGHT because of apertium

Tags for POS sub-categories

Pers Dem Interr Indef Recipr Refl Rel Coll NomAg G3 Prop Allegro Arab Romertall

Tags for morphosyntactic properties

Nom Acc Gen Ill Loc Com Ess Sg Du Pl Cmp/SplitR Cmp/Attr Cmp/Cit Cmpnd Cmp/SgNom Cmp/SgGen Cmp/SgGen Cmp/PlGen Cmp/Sh Cmp PxSg1 PxSg2 PxSg3 PxDu1 PxDu2 PxDu3 PxPl1 PxPl2 PxPl3 Comp Superl Attr Ord Qst IV TV Prt Prs Ind Pot Cond Imprt ImprtII Sg1 Sg2 Sg3 Du1 Du2 Du3 Pl1 Pl2 Pl3 Inf ConNeg Neg PrfPrc VGen PrsPrc Ger Sup Actio

Tags for clitic particles

Foc/ge Foc/ge Foc/ge Foc/gen Foc/ges Foc/gis Foc/naj Foc/ba Foc/be Foc/hal Foc/han Foc/bat Foc/son Foc/mis Foc/mat

Derivation tags

Der/PassL Der/PassS Der/NomAg NomAg Der/adda Der/alla Der/easti Der/d Der/eamoš Der/amoš Der/geahtes Der/h Der/Car Der/Car Der/huhtti Der/huvva Der/halla Der/l Der/lasj Der/las Der/meahttun Der/muš Der/NomAct Der/sasj Der/st Der/stuvva Der/upmi Der/supmi Der/vuota Der/InchL Der/laakan Der/laagasj Der/jagáš Der/A Der/A* pga av bug i lookup2cg Der/Dimin Der/viđá Der/viđi Der/veara Der/AAdv Der/Adv Der/dáfot Der/keahtta Der/nuolus Der/náittot Der/seagat Der/suttat Der/ár <vdic> Cmp/Hyph <subqst> <ind>

Semantic tags

The semantic tags are included from a generated file.

Syntactic tags

Sets containing sets of lists and tags

This part of the file lists a large number of sets based partly upon the tags defined above, and partly upon lexemes drawn from the lexicon. See the sourcefile itself to inspect the sets, what follows here is an overview of the set types.

Sets for Single-word sets

OKTA and go, and the set INITIAL for initial letters OKTA go INITIAL

Sets for word or not

WORD REAL-WORD WORD-NOT-de NOT-COMMA

Derivational affixes

DER-V

DER-V

DER-N

DER-A1

DER-A

A-V

A-NOT-V

Case sets

ADLVCASE

CASE-HALFAGREEMENT CASE-AGREEMENT CASE

NOT-NOM NOT-GEN NOT-ACC

Verb sets

NOT-V

Sets for finiteness and mood

REAL-NEG

MOOD-V

GC

VFIN

VFIN-POS

VFIN-NOT-IMPRT

VFIN-NOT-NEG

NOT-PRFPRC

Sets for person

Sets consisting of forms of “leat” (these ones need to be rewritten)

Pronoun sets

Adjectival sets and their complements

Adverbial sets and their complements

Sets for coordinators

Sets for adverbs that have lookalikes

Here come some adverbs that have identical twins in other POS. If these are found in Adv contexts, we treat them as adverbs.

Sets of elements with common syntactic behaviour

Sets for verbs

V is all readings with a V tag in them, REAL-V should be the ones without an N tag following the V.
The REAL-V set thus awaits a fix to the preprocess V … N bug.

TRANS-V is the set for verbs really taking objects

STRICT-TRANS-V is the set for verbs which don’t let a GenAcc be a modifier of anything else than an object, e.g. Mun organiseren eatni gievkkanis. - eatni wants to be the object

Valency sets

Adverb sets

Adjective sets

NP sets defined according to their morphosyntactic features

The PRE-NP-HEAD family of sets

These sets model noun phrases (NPs). The idea is to first define whatever can occur in front of the head of the NP, and thereafter negate that with the expression WORD - premodifiers.

The set NOT-NPMOD is used to find barriers between NPs. Typical usage: … (*1 N BARRIER NPT-NPMOD) … meaning: Scan to the first noun, ignoring anything that can be part of the noun phrase of that noun (i.e., “scan to the next NP head”)

Other negatively defined morphosyntactic noun sets

Noun sets

Nominal sets defined according to their morphophonological properties Sets for lexeme homonymy (most of them are moved to where the actual rules are.)

The words in the set N-PO can be both N and Po, the set takes that into account.

The LAHKA set family

Nominal sets defined according to their semantical properties

Miscellaneous sets

Border sets and their complements

Syntactic sets

ALLSYNTAG NON-APP

These were the set types.

Guessing: Rule for adding Sem/Date as a tag to readings which looks like dates

Guessing: Rule for adding Adv Sem/Adr as a tag to readings which looks addresses

Removing or selecting proper nouns that are lookalikes

we don’t want propernoun analysis of these words, initially in sentences

*Removes PropPl, but problems with names as Davviriikkaid Ráđi, there we want Prop Pl

*Select PlcSur (Sem/Plc) (Sem/Sur)

Some propernouns have two parts and the first is not a genitive. We still have problems with abbr when these propernouns are inflected or are a part of a cmp. The copy rule adds Attr reading to names which not get it in the fst (Soria). The select rule selects Attr when the next word is e.g. Moria.

Rules for giving Attr to names, e.g. Ole Attr Kåven.

Remove unwanted analyses

Southern Locative vs. Essive

Numerals

Lexicalised derivations

Particular verbs

Propernouns

Some adjectives are never derived as Adv

Rules for Prop Attr, Sem/Sur and Plc

MISC

ONE-COHORT DISAMBIGUATION - CYCLE 0

The idea behind “cycle 0” is to have safe rules without context first. These rules typically chose lexicalisations over derivations, Saami words instead of marginal names, etc.

Lexicalised derivations

*Removes derN if lexicalised.

*Removes derNEss if lexicalised, and both nouns are essive.

*Removes derA or PrsPrc or VGen if lexicalised. VGen is a chance.

*Removes derAdv when Adv is lexicalised.

*Removes VAbess when Adv is lexicalised.

Fragments and headliners

Adjectives or nouns, not adverbs

Adjective plural, not comparative

Adverbs

Lexicalised adverbs

It is useful to select early the adverbial reading for potensial nouns or verbs.

*aloGen removes állu Gen, álo Adv vs. N Gen

*bealisAdv

*bearreAdv beare vs bearri

*ilusAdv

*rámisA

Pronouns

Nouns, not verbs

Lexical selection - nouns

mánnu vs mánus

Not noun

Adposition or not

Not Qst

Interjections

Px-rules for special nouns

Some verb rules

Particular CS

Verb or Noun?

Adpositions

Adpositions, not verbs

Section 2: LOCAL DISAMBIGUATION - CYCLE 1

FAMILY pronouns

Pron Pers 1. p.

Pron Pers 2. p.

Pron Pers 3. p.

An early rule for “eanaš”/”eanas”

Px constraints

First select Px, then remove all remaining Px

We end section 2 by removing all remaining Px

Section 3: Certain verb readings

verb or adv

All imperatives

For imperative disambiguation we need the following: Pick imperative contexts, and thereafter remove imperative. Such contexts are: Imperative verb sentence-initially with exclamation mark

Sg1 - early cycle, safe rules

Sg2 - early cycle, safe rules

Sg3 - early cycle, safe rules

Negative verb, not abbreviation or roman numeral Ii.

Du1 - early cycle, safe rules

These Du1, Du2 rules are (almost) not in use in our corpus, but we keep them for completeness.

Du2 - early cycle, safe rules

The next two rules are not found in the corpus, but logically they belong, to cover the whole paradigm. There is no verb-internal homonymy here, but there is homonymy with e.g. Illative for certain verbs.

Du3 - early cycle, safe rules

The competitor to Du3 is -ba Foc.

Pl1 - early cycle, safe rules

The competitor here is obviously Inf, but also Pl3 and Prt Sg2.

Pl2 - early cycle, safe rules

These rules are not used when disambiguating the corpus

Pl3 - early cycle, safe rules

Select…

The following two may be joined:

Remove…

The following two may be joined:

PrsPrc

OBS: denne er ikke helt bra

*listInf in lists

Section 4: CYCLE 1B: REMOVING THE READINGS THAT WERE LEFT FROM THE 1A RULES

We don’t need more Px sections, it’s done alrady

Noun, adjectiv, PrsPrc or not?

Adjectives and adverbs

Adv or not?

maid has many readings and as Rel it is a member of S-BOUNDARY. Therefore we need to disambiguate is early in this file. Most important is to select Adv. Because of that A ang N still can have Vfin readings, it is difficult to make very general rules.

matPcle

The following two rules are omitted. They only inflect on the disambiguation of mat pcle, a wackernagel, which is done in the rule over here, I think.

Disambiguating abbreviations

Disambiguating particles

Disambiguating rom attr

Disambiguating clitics

Disambiguating numerals

Disambiguating adpositions

čađa

Commented out som adp-rules we don’t need anymore:

geahčai

guovddaš

mađe

miehta

LIST LG-MATERIAL = Inf Adv Nom ;

Diambiguation Noun vs. Po or Pr:

Some particular subjunctions and Neg Sup

go as CS and Qst Pcle

First select all “go” Qst Pcle, then remove them so the rest will be “go” CS

Section 9 WORD-SPECIFIC RULES

Some particular subjunctions

Adverb rules

MAPPING OF COMP-CS< , COMPLEMENTS OF PARTICLES IN COMPARISON

First map all COMP-CS<, then remove the other readings

MAPPING OF CC AND CS

Mostly we map both @CNP and @CVP, then we select @CNP, after that we remove them so @CVP remains

*CVPoppramsing Lásse, Iŋgá ja mun

*CVPCmp/SplitR Cmp/SplitR @CNP

PRONOUNS

Plural?

Interrogative and relative pronouns

Emphatic ieš

Numerals

Indefinite pronouns

The rules are not documented yet

Demonstrative pronouns - should have a look at these

Disambiguating adjectives

Attribute disambiguation

Rules for Attr between Dem and N

Other attribute rules

Special rules for ‘buorre’ (the only adjective showing case agreement)

This block of rules is there to ensure case agreement for comparatives.

alit vs. allat Comp Attr

And now some rules for adverbs that modify adjectives

Proper nouns

VERBS

Disambiguating verbs - part 1

First ConNeg forms, they are dependent upon Neg verbs. Then Imperative (with their special syntax), infinitive, and other infinite forms. Person comes later (in part 2)

ConNeg forms

Number following the rule headers below refer to numbers of hit in a 13 053 859 word corpus.

Imperative

See also Imprt or Ind some sections down.

Infinitive

Rules that prevent later selection of Inf for a finite verb in the frame

INF-V…CC…

Verbgenitive

Supinum vs. potential – no example found in large corpus

Perfect Participle

Topicalized version

the following chapter should be possible to unify.

Actio

Present participle

*orrut vs. orrot)

Rules for “addit” (which is an adjective, but more often a verb)

Actio Loc = N Loc

Actio Nom = Ess

Imprt or Ind

Nouns or verbs

The rules are no documented yet

Demonstrative pronouns, agreement in DP - should it be moved to after verbmappings?

The rules are no documented yet

VERB MAPPINGS

Verbs as predicatives (@SPRED>) and (@<OPRED)

The tags (@SPRED>) and (@<OPRED) target PrfPrc

The rules are no documented yet

Passive verbs often have

Verbs as prenominal participles (@>N):

(@+FAUXV) and (@+FMAINV) target Neg, orrut

(@A<) target Inf

(@<SUBJ) target Inf

(@<SPRED) target Inf

(@<ADVL) target Inf, Actio Ess

@-F<OBJ target Inf

(@N<) target Inf, Actio Ess

(@<ADVL) target Inf, Actio Ess

(@<OBJ) target Inf, Actio Ess, PrfPrc

(@+FMAINV) and (@+FAUXV) and (@-FAUXV)

(@-FMAINV) and (@-FAUXV)

And then we remove the verbs which didn’t get any syntactic tag, in favour of verbs with syntactic tags.

killifVinCohort This rule removes all other readings, if there is a mapped V reading in the same cohort. Every case which this goes wrong, should be fixed in mapping rules or previous disrules.

NOUNS

CASE DISAMBIGUATION

Num as subject, tricky cases - the rule should be here because of the verbdisambiguation

ACCUSATIVE-GENITIVE DISAMBIGUATION

Secure rules for choosing Acc

Semantihkka: Choosing accusative or genitive semantically

Other genitive rules

Genlassin Selects Gen if first one to the right is lassin *bargostipeanddaid lassin

lassinIll Selects Ill if first one to the left is lassin *lassin Sarai

Gen and preposition/postposition

Genitive in place adverbials ROUTE

Adjectives take object

Temporal adverbials: Choosing accusative or genitive TIME

Reflexive pronouns: acc or gen

Accusative object

*topOBJPers Removes Gen if you are Acc, and to you right is a Pron followed by a transitive verb. You have to be sentence initial

*AccVAbess Selects Gen if to the right is abessive

Gen modifiers inside NP

Accusative in coordination

Intransitive verbs can sometimes be transitive

Accusative or genitive in front of ALU and in front of adjectives

Exceptional accusative attributes in front of ALU nouns.

Numerals

NumGenMeasure Genitive numerals in front of ruvdnosaš with friends

Leftover accusatives

*COMPInfAcc Selects Acc if you are Gen and to the left is an Inf TV @COMP-CS<

Accusative before @COMP-CS<

Accusative before some A

Accusative sentence-finally

Genitive

Nominative and accusative

*NomIFInitialThenSg3 Selects Nom if -1 BOS and 1 oblique / Sg3 lookalike. Works in fragments.

Nominative

Miscellaneous rules

Vocatives, subjects of sentence fragments

Nominative in titles and sentence fragments

Nominative after “go”, “dego”, “dugo” and “nugo”

Preverbal subjects

Postverbal subjects

Nominative predicatives

Nominative as objects in existential clauses

Nominative in coordination and apposition

Nominative in parallell constructions

Not nominative

Comitative rules

NP internal disambiguation of Com

Disambiguation based upon verb valency

Disambiguation of Com depending on Adv or certain verb or N

Animate nouns

HAB-ACTOR in habitive-constructions

váldit vára + Loc

dahkat earrodearvvuođat geainna nu

eallit mainna nu

Disambiguation based upon verb valency

COM-V

tools (concrete and abstract)

BODY as an instrument

Dynamic-verbs

Event-tool-actio

Most actio can be both tool and event.

PLACE-V

STATE-V (eallit)

Movement-verbs

The super-set Dynamic-verb according to choose (Pl Loc) or (Sg Com)

The idea is that the superset DYNAMIC-V are not connected to TOOL, ABSTR-TOOL or CONCEPT in (Pl Loc). This is the “minste felles multiplum”. The sub-sets are different, f.i. many of them (but not all) are not connected to HUMAN in (Pl Loc), one is not connected to ABSTR-ENTITY and ACTOR in (Pl Loc). We work with negation so the rules don´t destroy analysis because of insufficent sets.

First the general-rules for selecting (Sg Com), then the more special rules for selecting (Sg Com), and then we selct (Pl Loc) for the rest of them under # Another round of locative rules.

HUMAN-LOC-V

Locative and comitative - Disambiguation based upon coordination

And then we remove the remaining Sg Com analysis

Essive OBS

Late case rules (after other case rules have worked).

VERBS PART 2, Section #22

Finite or not

Finite

Not Finite

Indicative Negative

Infinitive

Indicative or imperative

Verbs according to person and number

Sg1 - First person singular

Sg2 - Second person singular

Sg3 - Third person singular

Infinitive and clausal subject

Rules that look backwards for a subject across a relative clause:

Rules that look backwards for a subject across a subordinate clause (CP boundary):

Extension possibilities: Coordination

Son oaidná du ja mu ovdal go boahtit…

Coordinated Sg3 verbs

Not V + Sg3

Du1 - First person dual

The previous two rules look marginal.

Du2 - Second person dual

Rules for leahppi = (“leahppi” N Sg Nom)

Du3 - Third person dual

Pl1 - First person plural

Pl2 - Second person plural

Pl3 - Third person plural

Rules for a special infinitive construction

More finite verbs

Passive

Infinitive

Present Participle

Actio/Perfect Participle

Actio

Selecting some more finite verbs

Lexical disambiguation of verbs

NOMEN

Case rules

Other rules for nouns and pronouns

Determiners

Adverbs and adjectives

NOUNS

Variant lemmas

VERBS

Final removing rules

Removing Err/Orth


This (part of) documentation was generated from src/cg3/speech_disambiguator.cg3


src-cg3-valency.cg3.md


This (part of) documentation was generated from src/cg3/valency.cg3


src-fst-morphology-affixes-abbreviations.lexc.md

Continuation lexicons for abbreviations

Lexica for adding tags and periods

The sublexica

Continuation lexicons for abbrs both with and witout final period

Lexicons without final period

Lexicons with final period


This (part of) documentation was generated from src/fst/morphology/affixes/abbreviations.lexc


src-fst-morphology-affixes-acronyms.lexc.md

North Saami acronyms - affix part

The lexica giving tags and suffixes to the acronyms


This (part of) documentation was generated from src/fst/morphology/affixes/acronyms.lexc


src-fst-morphology-affixes-adjectives.lexc.md

Divvun & Giellatekno - open source grammars for Sámi and other languages

North Saami adjective declension file

Bisyllabic adjectives

Consonant-final even-syllabic adjectives

Trisyllabic adjectives

Contracted adjectives

Special cases

Final note on the adjective sublexica

todo: Rewrite the adj lexica so that the attr variation is kept separate from the otherwise uniform declension.

Adjective declension

GOAL: Keep GAPPUS- and MALLAS- apart, because of the Px(1)V issue, but unify the rest. GAPPUS- and MALLAS- differ in the A and N treatment of Pl Nom Px (only 1st p. for A and all persons for N). Now that MALLASI- is deleted, GAPPUS- and MALLAS- are identical. We check by pointing GAPPUS- to MALLAS-. Look into this. and remove GAPPUS- for MALLAS- eventually.

Nominal derivation

Noun derivation

Adjective derivation

Adverbs from adjectives

Adjectives from nouns


This (part of) documentation was generated from src/fst/morphology/affixes/adjectives.lexc


src-fst-morphology-affixes-nouns.lexc.md

Divvun & Giellatekno - open source grammars for Sámi and other languages

North Saami noun declension

Bisyllabic nouns

it does not have the Prop tag.

Bisyllabic nouns 2f. Actor lexicas

2f. Actor lexicas

+Use/NG:%> GOAHTAI ; ! Ill sublexicon no dipth simpl

Trisyllabic nouns

Trisyllabic nouns

Contracted nouns

Contracted nouns

Sublexica for nominal stems

Declension

Noun declension

Px lexica

Some GOAHTE-type lexica…

Other lexica

+Use/NG: GOAHTAI ; ! Ill sublexicon


This (part of) documentation was generated from src/fst/morphology/affixes/nouns.lexc


src-fst-morphology-affixes-numerals.lexc.md

North Saami numerals


This (part of) documentation was generated from src/fst/morphology/affixes/numerals.lexc


src-fst-morphology-affixes-possessive-suffixes.lexc.md

Divvun & Giellatekno - open source grammars for Sámi and other languages

North Saami Possessive suffixes


This (part of) documentation was generated from src/fst/morphology/affixes/possessive-suffixes.lexc


src-fst-morphology-affixes-pronouns.lexc.md

some multiword prons, according to Nickel


This (part of) documentation was generated from src/fst/morphology/affixes/pronouns.lexc


src-fst-morphology-affixes-propernouns.lexc.md

Different lexicon for female persons and place names.

Different lexicon for personal surnames. Blind


This (part of) documentation was generated from src/fst/morphology/affixes/propernouns.lexc


src-fst-morphology-affixes-symbols.lexc.md

Symbol affixes


This (part of) documentation was generated from src/fst/morphology/affixes/symbols.lexc


src-fst-morphology-affixes-verbs.lexc.md

Divvun & Giellatekno - open source grammars for Sámi and other languages

Verb conjugation

Basic lexica for bisyllabic verbs

Modals

These are treated separately because modals do not participate in derivation

Ordinary bisyllabic verbs

Bisyllabic verbs

Intermediate lexica for even-syllable verbs

Basic lexica for contracted verbs

BAsic lexica for Contracted verbs

Basic lexica for trisyllabic verbs

Basic lexica for trisyllabic verbs

Finite declension

Present tense

Vocalic stems

Consonantal stems

Past tense

Vocalic stems

Consonantal stems

Imperative mood

Infinite forms

V- and C-final

Continuation lex

Derivation


This (part of) documentation was generated from src/fst/morphology/affixes/verbs.lexc


src-fst-morphology-clitics.lexc.md

Divvun & Giellatekno - open source grammars for Sámi and other languages

Clitics

The lexicon K_only is for paths not going to the K-less ENDLEX

The following lexicons are not referenced by the K lexicon, but directly in specific cases.


This (part of) documentation was generated from src/fst/morphology/clitics.lexc


src-fst-morphology-compounding.lexc.md

Divvun & Giellatekno - open source grammars for Sámi and other languages

North Sámi compounding

This file governs prefixing and compounding, with the following lexica and pointers. All lexica and lexicon entries are documented.

LEXICON Prefixes = lexicon for adding *eahpe and pointing to N, A, V

LEXICON R = lexicon which is pointed to from affixes files. Here the strings get flags to control compounding (@P.CmpFrst.FALSE@ etc.) and are redirected to RAlmostReal.

LEXICON RAlmostReal = lexicon pointed to from R (where flags are added) and pointing to RrealAfterCmpNFlags and (with +Cmp tag) to MiddleNouns. lexicalising the 3-part compounds, with the tag ShCmp. It has two entries:

LEXICON Rreal = This is the former R lexicon, renamed to avoid the MiddleNouns loop. The string gets flags like for R, and directed to RrealAfterCmpNFlags.

LEXICON RrealAfterCmpNFlags = This was also part of the former R lexicon, here renamed to avoid the MiddleNouns loop. Here it gets flags ensuring the result is N+N.

LEXICON RHyph = Recursive lexicon from all classes REQUIRING a hyphen to follow.

LEXICON RHyphTags = adds +Cmp/Hyph and +Cmp, and then - on lower side.

LEXICON RNum = For Num Cmp Noun, vi vil ikke ha Num Cmp Num

LEXICON Rnoun = the lexicon has two entries:

LEXICON RProp = lexicon pointed to from propernouns, and containing 3 entries

LEXICON RPropTags = A special lexicon for handling proper noun compounding without hyphens. Two entries:

LEXICON flagON-R = turns NeedsVowRed on:

LEXICON flagOFF-R = turns NeedsVowRed off:


This (part of) documentation was generated from src/fst/morphology/compounding.lexc


src-fst-morphology-phonology.bergslan.twolc.md

North Sámi morphophonological rule set

This file documents the phonology.twolc file

The file contains the rule set for the non-segmental North Sámi morphophonological rules

Note that when copied over to newinfra, this file will be labeled sme-phon-L1.twolc. The file sme-phon-L1.twolc will not be the source file to edit, rather, the source file will be this file, gt/sme/src/twol-sme.txt. This file (in the old infra) is the ordinary sme fst file to be edited. The L2 sme fst, on the other hand, will have lags/sme/src/phonology/sme-phon-L2.twolc as its sourcefile, the file to be edited.

º is for CnsGrad of the lg:lgg and lºl:ll type ¤:0 prevents ConsGrad in certain words ' is the real apostroph

boahºtiY4t ! It seems it should be Q3. … both?!

čuorºvuY4t ! Q2, it seems.

Changed because:we get almmáj- and not almmái- Postvocalic j surfaces as i Is this what we want?? without right context??? postvoc j:i <=> Vow: ( :0 ) (Dummy: ) _ ;


This (part of) documentation was generated from src/fst/morphology/phonology.bergslan.twolc


src-fst-morphology-phonology.twolc.md

North Sámi morphophonological rule set

This file documents the phonology.twolc file

The file contains the rule set for the non-segmental North Sámi morphophonological rules

Note that when copied over to newinfra, this file will be labeled sme-phon-L1.twolc. The file sme-phon-L1.twolc will not be the source file to edit, rather, the source file will be this file, gt/sme/src/twol-sme.txt. This file (in the old infra) is the ordinary sme fst file to be edited. The L2 sme fst, on the other hand, will have lags/sme/src/phonology/sme-phon-L2.twolc as its sourcefile, the file to be edited.

º is for CnsGrad of the lg:lgg and lºl:ll type ¤:0 prevents ConsGrad in certain words ' is the real apostroph

boahºtiY4t ! It seems it should be Q3. … both?!

čuorºvuY4t ! Q2, it seems.

Changed because:we get almmáj- and not almmái- Postvocalic j surfaces as i Is this what we want?? without right context??? postvoc j:i <=> Vow: ( :0 ) (Dummy: ) _ ;


This (part of) documentation was generated from src/fst/morphology/phonology.twolc


src-fst-morphology-root.lexc.md

Divvun & Giellatekno - open source grammars for North Sámi.

North Sámi morphological analyser

Multicharacter symbols

Tags for POS

Tags for sub-POS

Tags for Inflection

Tags for Case and Number Inflection

Possessive tags

Adjectival tags

Moods

Tenses

Verb person-number

Infinite verb forms

Other tags

Question and Focus particles:

Tags distinguishing different versions of the same lemma (before POS)

Note: These high +v… number are in use for one word only: doavttergrádakursa

Escaped chars

Error (non-standard language) tags

Usage tags

Dialect tags:

Tags for indicating the orthography used

+Orth/Strd - Standard orthography +Orth/IPA - IPA transcription

The above should either be used in pairs, or not at all. That is, if a word doesn’t need an IPA stem (because the word in all its inflection can be converted to IPA by the standard IPA conversion rules), then none of these tags should be used. On the other hand, if the word has a spelling that doesn’t follow the orthographic rules, and thus needs an exceptional IPA stem to get it right, then the exceptional stem must be marked with the +Orth/IPA, and the regular orthography stem must be marked with the tag +Orth/Strd. This is so that we can exclude the one or the other from different fst’s, but only when the oposite stem variant is present.

Tags for indicating alternative orthographies, cf configure.ac

+AltOrth/standard - Standard orthography +AltOrth/bergslan - Bergsland-Ruong orthography +AltOrth/-standard - NOT Standard orthography +AltOrth/-bergslan - NOT Bergsland-Ruong orthography

Multichars for marking start and end of IPA sequences

Compounding tags

The tags are of the following form:

This entry / word should be in the following position(s):

If unmarked, any position goes.

The tagged part of the compound should make a compound using:

Unmarked = Default, ie +CmpN/SgN for SME.

The second part of the compound may require that the previous (left part) is:

Tags for descriptive compound analysis - this is what a compound actually is:

Compounding tag ordering

To ease writing and maintaining regexes etc for manipulating and enforcing compounding, it is important to keep the tags in a certain order. The order is:

  1. +CmpN/ tags
  2. +CmpNP/ tags
  3. +Cmp/ tags - this is always true since the descriptive tags are always part of the continuation lexicons, and will be located after the POS tag.

Semantic tags to help disambiguation & synt. analysis: (before POS)

Multiple Semantic tags:

Tags for derivation

Explanation:

Positional derivational tags

+Der1 +Der2 +Der3  +Der4 POS transition Comments
+Der/Dimin       NN (was: Der/aš & Der/š)
+Der/lasj       NA  
+Der/meahttun       VA  
+Der/d       VV  
+Der/h       VV - -hit/Causative
+Der/Caus       VV - -ahtti/Causative
+Der/huhtti       VV  
+Der/l       VV  
+Der/st       VV  
+Der/las       VA * +Der1+Der2 - can only combine with Der3
+Der/Car       NA * +Der1+Der2 - can only combine with Der3
+Der/laakan       AA * +Der1+Der2 - can only combine with Der3
+Der/halla       VV * +Der1+Der2 - can only combine with Der3
+Der/huvva       VV * +Der1+Der2 - can only combine with Der3
+Der/stuvva       VV * +Der1+Der2 - can only combine with Der3
+Der/PassS       VV - short passive
  +Der/t     NA  
  +Der/ár     ACRO>N  
  +Der/NomAg     VN  
  +Der/NomAct     VN Der/NomAct har to realisasjonar, med ulike restriksjonar, this is previous Der/eapmi
  +Der/sasj     NA  
  +Der/adda     VV  
  +Der/alla     VV  
  +Der/AAdv     QA check this!
  +Der/easti     VV  
  +Der/laagasj     QA  
  +Der/Comp     AA  
  +Der/Superl     AA  
    +Der/PassL   VV long passive
    +Der/vuota   AN  
      +Der/InchL VV  
      +Der/amoš VN  
      +Der/eamoš VN  
      +Der/geahtes VA  
      +Der/keahtta VA  
      +Der/muš VN  
      +Der/supmi VN  
      +Der/upmi VN  

Non-positional derivations

All non-positional derivations should be preceded by the following tag, to make it possible to target regular expressions at all derivations in a language-independent way: just specify +Der|+Der1 .. +Der4 and you are set.

Tag POS transition Comment
+Der n/a generic derivation tag used in front of all non-positional derivations.
+Der/veara NA#  
+Der/viđá NA#  
+Der/viđi NA#  
+Der/has ? only one in the code

Miscellanious list

See lexicons NAMAT and SAS for these:

Tags for originating language

The following tags are used to guide conversion to IPA: loan words and foreign names are usually pronounced (approximately) as in the originating (majority) language. Instead of trying to identify the correct pronunciation based on phonotactics (orthotactics actually), we tag all words that can’t be correctly transcribed using the SME transcriber with source language codes. Once tagged, it is possible to split the lexical transducer in smaller ones according to langu- age, and apply different IPA conversion to each of them. The principle of tagging is that we only tag to the extent needed, and following a priority:

  1. any untagged word is pronounced with SME orthographic conventions
  2. NNO and NOB have identical pronunciation, NNO is only used if different in spelling from NOB
  3. SWE has mostly the same pronunciation as NOB, and is only used if different in spelling from NOB
  4. Occasionally even SME (the default) may be tagged, to block other languages from being specified, mainly during semi-automatic language tagging sessions All in all, we want to get as much correctly transcribed to IPA with as little work as possible. On the other hand, if more words are tagged than strictly needed, this should pose no problem as long as the IPA conversion is correct - at least some words will get the same pronunciation whether read as SME or NOB/NNO/SWE.

Triggers for morphophonological rules

Morphophonemes and Sámi letters

= a symbol used in front of # to block backtracking and mwe reanalysis in hfst-tokenise (e.g. in dynanic compounds). Makes it possible to distinguish lexical and dynamic compounds in rules. It is converted to zero together with #.

Symbols that need to be escaped on the lower side (towards twolc):

Flag diacritics

We have manually optimised the structure of our lexicon using following flag diacritics to restrict morhpological combinatorics - only allow compounds with verbs if the verb is further derived into a noun again:

Flag Explanation
@P.NeedNoun.ON@ (Dis)allow compounds with verbs unless nominalised
@D.NeedNoun.ON@ (Dis)allow compounds with verbs unless nominalised
@C.NeedNoun@ (Dis)allow compounds with verbs unless nominalised
@P.Vgen.add@ (Dis)allow VGen
@R.Vgen.add@ (Dis)allow VGen
@P.12p.add@ (Dis)allow 1. and 2. pers forms
@R.12p.add@ (Dis)allow 1. and 2. pers forms
@P.Pmatch.Loc@ Used on multi-token analyses; tell hfst-tokenise/pmatch where in the form/analysis the token should be split.
@P.Pmatch.Backtrack@ Used on single-token analyses; tell hfst-tokenise/pmatch to backtrack by reanalysing the substrings before and after this point in the form (to find combinations of shorter analyses that would otherwise be missed)
Flag Explanation
@D.ErrOrth.ON@  
@C.ErrOrth@  
@P.ErrOrth.ON@  
@R.ErrOrth.ON@  

For languages that allow compounding, the following flag diacritics are needed to control position-based compounding restrictions for nominals. Their use is handled automatically if combined with +CmpN/xxx tags. If not used, they will do no harm.

Flag Explanation
@P.CmpFrst.FALSE@ Require that words tagged as such only appear first
@D.CmpPref.TRUE@ Block such words from entering ENDLEX
@P.CmpPref.FALSE@ Block these words from making further compounds
@D.CmpLast.TRUE@ Block such words from entering R
@D.CmpNone.TRUE@ Combines with the next tag to prohibit compounding
@U.CmpNone.FALSE@ Combines with the prev tag to prohibit compounding
@U.CmpNone.TRUE@ Combines with the two previous ones to block compounding
@P.CmpOnly.TRUE@ Sets a flag to indicate that the word has passed R
@D.CmpOnly.FALSE@ Disallow words coming directly from root.
@D.CmpHyph.TRUE@ Flag to control hyphenated compounds like proper nouns
@U.CmpHyph.FALSE@ Flag to control hyphenated compounds like proper nouns
@U.CmpHyph.TRUE@ Flag to control hyphenated compounds like proper nouns
@C.CmpHyph@ Flag to control hyphenated compounds like proper nouns

Use the following flag diacritics to control downcasing of derived proper nouns (e.g. Finnish Pariisi -> pariisilainen). See e.g. North Sámi for how to use these flags. There exists a ready-made regex that will do the actual down-casing given the proper use of these flags.

Flag Explanation
@U.Cap.Obl@ Allowing downcasing of derived names: deatnulasj.
@U.Cap.Opt@ Allowing downcasing of derived names: deatnulasj.
Flag diacritic Explanation
@U.number.one@ Flag used to give arabic numerals in smj different cases ;
@U.number.two@ Flag used to give arabic numerals in smj different cases ;
@U.number.three@ Flag used to give arabic numerals in smj different cases ;
@U.number.four@ Flag used to give arabic numerals in smj different cases ;
@U.number.five@ Flag used to give arabic numerals in smj different cases ;
@U.number.six@ Flag used to give arabic numerals in smj different cases ;
@U.number.seven@ Flag used to give arabic numerals in smj different cases ;
@U.number.eight@ Flag used to give arabic numerals in smj different cases ;
@U.number.nine@ Flag used to give arabic numerals in smj different cases ;
@U.number.ten@ Flag used to give arabic numerals in smj different cases ;
@U.number.zero@ Flag used to give arabic numerals in smj different cases ;
@P.number.one@ Flag used to give arabic numerals in smj different cases ;
@P.number.two@ Flag used to give arabic numerals in smj different cases ;
@P.number.three@ Flag used to give arabic numerals in smj different cases ;
@P.number.four@ Flag used to give arabic numerals in smj different cases ;
@P.number.five@ Flag used to give arabic numerals in smj different cases ;
@P.number.six@ Flag used to give arabic numerals in smj different cases ;
@P.number.seven@ Flag used to give arabic numerals in smj different cases ;
@P.number.eight@ Flag used to give arabic numerals in smj different cases ;
@P.number.nine@ Flag used to give arabic numerals in smj different cases ;
@P.number.ten@ Flag used to give arabic numerals in smj different cases ;
@P.number.zero@ Flag used to give arabic numerals in smj different cases ;

Basic lexica, pointing to the other lexicon files

Abbreviation

Lexicon ENDLEX And this is the ENDLEX of everything:

@D.CmpOnly.FALSE@@D.CmpPref.TRUE@@D.NeedNoun.ON@ ENDLEX2 ;

The @D.CmpOnly.FALSE@ flag diacritic is used to disallow words tagged with +CmpNP/Only to end here. The @D.NeedNoun.ON@ flag diacritic is used to block illegal compounds.

ENDLEX2

ENDLEX3

ENDLEX4


This (part of) documentation was generated from src/fst/morphology/root.lexc


src-fst-morphology-stems-adjectives.lexc.md

North Sámi adjective lexicon


This (part of) documentation was generated from src/fst/morphology/stems/adjectives.lexc


src-fst-morphology-stems-adpositions.lexc.md

North Saami adposition lexicon

First come the 3 continuation lexica, the division is based on Nickel and should probably be revised. Then comes the adpositions themselves. The uninflecting ones are pointed to the 3 tag lexica, the Px ones to the Px lexica in sme-lex.txt and closed-sme-lex.txt.


This (part of) documentation was generated from src/fst/morphology/stems/adpositions.lexc


src-fst-morphology-stems-adverbs.lexc.md

North Saami adverbs

First comes some multiword adverbs, declared as MWE in tok.txt Of these, the ones going to adv are not treated as MWE in abbr.txt and preprocess, whereas the ones going to multiadv are treated as one unit in the syntax. There are only a handful of words in the multiadv lexicon, they are the ones that are mentioned in sme-dis.rle. Goal: have mwe adverbs with syntactic behaviour as single words going to multiadv.

Thereafter comes the ordinary adverb list.

Then comes the gradating advs

Lexica for adverb subtypes

The main adverb lexicon


This (part of) documentation was generated from src/fst/morphology/stems/adverbs.lexc


src-fst-morphology-stems-conjunctions.lexc.md

North Saami Conjunctions


This (part of) documentation was generated from src/fst/morphology/stems/conjunctions.lexc


src-fst-morphology-stems-interjections.lexc.md

North Saami Interjections


This (part of) documentation was generated from src/fst/morphology/stems/interjections.lexc


src-fst-morphology-stems-nouns.lexc.md

North Sámi noun lexicon !


This (part of) documentation was generated from src/fst/morphology/stems/nouns.lexc


src-fst-morphology-stems-numerals.lexc.md

North Saami numerals

The initial lexica

The LEXICON CmpNumeral lexicon is the entrance for compounds with numbers. Introduced to restrict such compounding to a subgroup of numerals only, mainly to exclude roman numerals, that turned out to be too problematic. With this change, roman numerals are only recognised on their own.

Arabic numerals

Arabic numeral expressions can be classified in at least the following categories:

And for sure more than these. Previously everything has been more or less lumped together, but to avoid noise and to get better input for grammar checking the ARABICS section should be rewritten such that each category gets its own lexicon. That way it is easier to restrict the syntax of numerical expressions in each category.


This (part of) documentation was generated from src/fst/morphology/stems/numerals.lexc


src-fst-morphology-stems-particles.lexc.md

This file contains the Particles

Perhaps this should be opened to a direction to K and all the ge versions should be removed. (i.e. only goit, not goitge). This errouneously permits gege, goge, etc., though, and we thus leave things as they are.


This (part of) documentation was generated from src/fst/morphology/stems/particles.lexc


src-fst-morphology-stems-pronouns.lexc.md

This file contains the Pronouns

Interrogative pronouns

Giving ideosyncratic Sg Nom of gii, mii lexically Sending the oblique forms of gii, mii to an oblique sublexicon Giving the stem of guhte, guhtemuš, goabbá

Relative pronouns

Demonstrative pronouns

Giving baseform + all demonstrative stems

Pointing to case paradigms

Reflexive pronouns

Two nominative reflexives, and pointer to the rest The Pl one is used for Du as well, here given two entries. Should one of them be removed?

Reciprocal pronouns

The first 4 entries handle the first element of the recipr. The next 12 handle the 2nd part of the non-Px recipr. The members of the third section point to Px lexica.

Indefinite pronouns

Dividing the indefinites in three groups

Declineable indefinite pronouns with case + clitic

Declineable indefinites with normal case paradigms

Separate lexica for exceptional entries

The indeclineable indefinites


This (part of) documentation was generated from src/fst/morphology/stems/pronouns.lexc


src-fst-morphology-stems-sme-abbreviations.lexc.md

File containing North Saami abbreviations

Lexica for adding tags and periods

Splitting in 4 + 1 groups, because of the preprocessor

The abbreviation lexicon itself

This class contains homonyms, which are both intransitive abbreviations and normal words. The abbreviation usage is less common and thus only the occurences in the middle of the sentnece (when next word has small letters) can be considered as true cases.

For abbrs for which numerals are complements, but other words not necessarily are. This group treats arabic numerals as if it were transitive but letters as if it were intransitive.

This lexicon is for abbrs that always have a constituent following it.

This class contains homonyms, which are both abbrs for which numerals are complements and normal words. The abbreviation usage is less common and thus only the occurences in the middle of the sentence can be considered as true cases.


This (part of) documentation was generated from src/fst/morphology/stems/sme-abbreviations.lexc


src-fst-morphology-stems-sme-propernouns.lexc.md

The North Saami proper noun lexicon


This (part of) documentation was generated from src/fst/morphology/stems/sme-propernouns.lexc


src-fst-morphology-stems-sme-punctuation.lexc.md

Punctuation symbols

They are all tagged +RIGHT even though the correct quotation mark is supposed to be used on both sides. This is done to simplify generation, by keeping the same tagging as the standard analysis.


This (part of) documentation was generated from src/fst/morphology/stems/sme-punctuation.lexc


src-fst-morphology-stems-subjunctions.lexc.md

The North Saami Subjunctions


This (part of) documentation was generated from src/fst/morphology/stems/subjunctions.lexc


src-fst-morphology-stems-verbs.lexc.md

North Saami verbs

Negative verbs

Copula

Stray forms

Main verbs

Here comes the main list of verbs.


This (part of) documentation was generated from src/fst/morphology/stems/verbs.lexc


src-fst-phonetics-text2tts-fin.xfscript.md

retroflex plosive, voiceless t ʈ 0288, 648 ( = ASCII 096) retroflex plosive, voiced d ɖ 0256, 598 labiodental nasal F ɱ 0271, 625 retroflex nasal n ɳ 0273, 627 palatal nasal J ɲ 0272, 626 velar nasal N ŋ 014B, 331 uvular nasal N\ ɴ 0274, 628

bilabial trill B\ ʙ 0299, 665 uvular trill R\ ʀ 0280, 640 alveolar tap 4 ɾ 027E, 638 retroflex flap r ɽ 027D, 637 bilabial fricative, voiceless p\ ɸ 0278, 632 bilabial fricative, voiced B β 03B2, 946 dental fricative, voiceless T θ 03B8, 952 dental fricative, voiced D ð 00F0, 240 postalveolar fricative, voiceless S ʃ 0283, 643 postalveolar fricative, voiced Z ʒ 0292, 658 retroflex fricative, voiceless s ʂ 0282, 642 retroflex fricative, voiced z` ʐ 0290, 656 palatal fricative, voiceless C ç 00E7, 231 palatal fricative, voiced j\ ʝ 029D, 669 velar fricative, voiced G ɣ 0263, 611 uvular fricative, voiceless X χ 03C7, 967 uvular fricative, voiced R ʁ 0281, 641 pharyngeal fricative, voiceless X\ ħ 0127, 295 pharyngeal fricative, voiced ?\ ʕ 0295, 661 glottal fricative, voiced h\ ɦ 0266, 614

alveolar lateral fricative, vl. K alveolar lateral fricative, vd. K\

labiodental approximant P (or v) alveolar approximant r\ retroflex approximant r` velar approximant M\

retroflex lateral approximant l` palatal lateral approximant L velar lateral approximant L
Clicks

bilabial O\ (O = capital letter) dental |
(post)alveolar !\ palatoalveolar =\ alveolar lateral ||
Ejectives, implosives

ejective > e.g. ejective p p> implosive < e.g. implosive b b< Vowels

close back unrounded M close central unrounded 1 close central rounded } lax i I lax y Y lax u U

close-mid front rounded 2 close-mid central unrounded @\ close-mid central rounded 8 close-mid back unrounded 7

schwa @

open-mid front unrounded E open-mid front rounded 9 open-mid central unrounded 3 open-mid central rounded 3\ open-mid back unrounded V open-mid back rounded O

ash (ae digraph) { open schwa (turned a) 6

open front rounded & open back unrounded A open back rounded Q Other symbols

voiceless labial-velar fricative W voiced labial-palatal approx. H voiceless epiglottal fricative H\ voiced epiglottal fricative <\ epiglottal plosive >\

alveolo-palatal fricative, vl. s\ alveolo-palatal fricative, voiced z\ alveolar lateral flap l\ simultaneous S and x x\ tie bar _ Suprasegmentals

primary stress “ secondary stress % long : half-long :\ extra-short _X linking mark -
Tones and word accents

level extra high _T level high _H level mid _M level low _L level extra low _B downstep ! upstep ^ (caret, circumflex)

contour, rising contour, falling _F contour, high rising _H_T contour, low rising _B_L

contour, rising-falling _R_F (NB Instead of being written as diacritics with _, all prosodic marks can alternatively be placed in a separate tier, set off by < >, as recommended for the next two symbols.) global rise global fall Diacritics

voiceless 0 (0 = figure), e.g. n_0 voiced _v aspirated _h more rounded _O (O = letter) less rounded _c advanced _+ retracted _- centralized _” syllabic = (or _=) e.g. n= (or n=) non-syllabic _^ rhoticity `

breathy voiced _t creaky voiced _k linguolabial _N labialized _w palatalized ‘ (or _j) e.g. t’ (or t_j) velarized _G pharyngealized _?\

dental d apical _a laminal _m nasalized ~ (or _~) e.g. A~ (or A~) nasal release _n lateral release _l no audible release _}

velarized or pharyngealized _e velarized l, alternatively 5 raised _r lowered _o advanced tongue root _A retracted tongue root _q


This (part of) documentation was generated from src/fst/phonetics/text2tts-fin.xfscript


src-fst-phonetics-text2tts-nob.xfscript.md

retroflex plosive, voiceless t ʈ 0288, 648 ( = ASCII 096) retroflex plosive, voiced d ɖ 0256, 598 labiodental nasal F ɱ 0271, 625 retroflex nasal n ɳ 0273, 627 palatal nasal J ɲ 0272, 626 velar nasal N ŋ 014B, 331 uvular nasal N\ ɴ 0274, 628

bilabial trill B\ ʙ 0299, 665 uvular trill R\ ʀ 0280, 640 alveolar tap 4 ɾ 027E, 638 retroflex flap r ɽ 027D, 637 bilabial fricative, voiceless p\ ɸ 0278, 632 bilabial fricative, voiced B β 03B2, 946 dental fricative, voiceless T θ 03B8, 952 dental fricative, voiced D ð 00F0, 240 postalveolar fricative, voiceless S ʃ 0283, 643 postalveolar fricative, voiced Z ʒ 0292, 658 retroflex fricative, voiceless s ʂ 0282, 642 retroflex fricative, voiced z` ʐ 0290, 656 palatal fricative, voiceless C ç 00E7, 231 palatal fricative, voiced j\ ʝ 029D, 669 velar fricative, voiced G ɣ 0263, 611 uvular fricative, voiceless X χ 03C7, 967 uvular fricative, voiced R ʁ 0281, 641 pharyngeal fricative, voiceless X\ ħ 0127, 295 pharyngeal fricative, voiced ?\ ʕ 0295, 661 glottal fricative, voiced h\ ɦ 0266, 614

alveolar lateral fricative, vl. K alveolar lateral fricative, vd. K\

labiodental approximant P (or v) alveolar approximant r\ retroflex approximant r` velar approximant M\

retroflex lateral approximant l` palatal lateral approximant L velar lateral approximant L
Clicks

bilabial O\ (O = capital letter) dental |
(post)alveolar !\ palatoalveolar =\ alveolar lateral ||
Ejectives, implosives

ejective > e.g. ejective p p> implosive < e.g. implosive b b< Vowels

close back unrounded M close central unrounded 1 close central rounded } lax i I lax y Y lax u U

close-mid front rounded 2 close-mid central unrounded @\ close-mid central rounded 8 close-mid back unrounded 7

schwa @

open-mid front unrounded E open-mid front rounded 9 open-mid central unrounded 3 open-mid central rounded 3\ open-mid back unrounded V open-mid back rounded O

ash (ae digraph) { open schwa (turned a) 6

open front rounded & open back unrounded A open back rounded Q Other symbols

voiceless labial-velar fricative W voiced labial-palatal approx. H voiceless epiglottal fricative H\ voiced epiglottal fricative <\ epiglottal plosive >\

alveolo-palatal fricative, vl. s\ alveolo-palatal fricative, voiced z\ alveolar lateral flap l\ simultaneous S and x x\ tie bar _ Suprasegmentals

primary stress “ secondary stress % long : half-long :\ extra-short _X linking mark -
Tones and word accents

level extra high _T level high _H level mid _M level low _L level extra low _B downstep ! upstep ^ (caret, circumflex)

contour, rising contour, falling _F contour, high rising _H_T contour, low rising _B_L

contour, rising-falling _R_F (NB Instead of being written as diacritics with _, all prosodic marks can alternatively be placed in a separate tier, set off by < >, as recommended for the next two symbols.) global rise global fall Diacritics

voiceless 0 (0 = figure), e.g. n_0 voiced _v aspirated _h more rounded _O (O = letter) less rounded _c advanced _+ retracted _- centralized _” syllabic = (or _=) e.g. n= (or n=) non-syllabic _^ rhoticity `

breathy voiced _t creaky voiced _k linguolabial _N labialized _w palatalized ‘ (or _j) e.g. t’ (or t_j) velarized _G pharyngealized _?\

dental d apical _a laminal _m nasalized ~ (or _~) e.g. A~ (or A~) nasal release _n lateral release _l no audible release _}

velarized or pharyngealized _e velarized l, alternatively 5 raised _r lowered _o advanced tongue root _A retracted tongue root _q


This (part of) documentation was generated from src/fst/phonetics/text2tts-nob.xfscript


src-fst-phonetics-text2tts-sme.xfscript.md

retroflex plosive, voiceless t ʈ 0288, 648 ( = ASCII 096) retroflex plosive, voiced d ɖ 0256, 598 labiodental nasal F ɱ 0271, 625 retroflex nasal n ɳ 0273, 627 palatal nasal J ɲ 0272, 626 velar nasal N ŋ 014B, 331 uvular nasal N\ ɴ 0274, 628

bilabial trill B\ ʙ 0299, 665 uvular trill R\ ʀ 0280, 640 alveolar tap 4 ɾ 027E, 638 retroflex flap r ɽ 027D, 637 bilabial fricative, voiceless p\ ɸ 0278, 632 bilabial fricative, voiced B β 03B2, 946 dental fricative, voiceless T θ 03B8, 952 dental fricative, voiced D ð 00F0, 240 postalveolar fricative, voiceless S ʃ 0283, 643 postalveolar fricative, voiced Z ʒ 0292, 658 retroflex fricative, voiceless s ʂ 0282, 642 retroflex fricative, voiced z` ʐ 0290, 656 palatal fricative, voiceless C ç 00E7, 231 palatal fricative, voiced j\ ʝ 029D, 669 velar fricative, voiced G ɣ 0263, 611 uvular fricative, voiceless X χ 03C7, 967 uvular fricative, voiced R ʁ 0281, 641 pharyngeal fricative, voiceless X\ ħ 0127, 295 pharyngeal fricative, voiced ?\ ʕ 0295, 661 glottal fricative, voiced h\ ɦ 0266, 614

alveolar lateral fricative, vl. K alveolar lateral fricative, vd. K\

labiodental approximant P (or v) alveolar approximant r\ retroflex approximant r` velar approximant M\

retroflex lateral approximant l` palatal lateral approximant L velar lateral approximant L
Clicks

bilabial O\ (O = capital letter) dental |
(post)alveolar !\ palatoalveolar =\ alveolar lateral ||
Ejectives, implosives

ejective > e.g. ejective p p> implosive < e.g. implosive b b< Vowels

close back unrounded M close central unrounded 1 close central rounded } lax i I lax y Y lax u U

close-mid front rounded 2 close-mid central unrounded @\ close-mid central rounded 8 close-mid back unrounded 7

schwa @

open-mid front unrounded E open-mid front rounded 9 open-mid central unrounded 3 open-mid central rounded 3\ open-mid back unrounded V open-mid back rounded O

ash (ae digraph) { open schwa (turned a) 6

open front rounded & open back unrounded A open back rounded Q Other symbols

voiceless labial-velar fricative W voiced labial-palatal approx. H voiceless epiglottal fricative H\ voiced epiglottal fricative <\ epiglottal plosive >\

alveolo-palatal fricative, vl. s\ alveolo-palatal fricative, voiced z\ alveolar lateral flap l\ simultaneous S and x x\ tie bar _ Suprasegmentals

primary stress “ secondary stress % long : half-long :\ extra-short _X linking mark -
Tones and word accents

level extra high _T level high _H level mid _M level low _L level extra low _B downstep ! upstep ^ (caret, circumflex)

contour, rising contour, falling _F contour, high rising _H_T contour, low rising _B_L

contour, rising-falling _R_F (NB Instead of being written as diacritics with _, all prosodic marks can alternatively be placed in a separate tier, set off by < >, as recommended for the next two symbols.) global rise global fall Diacritics

voiceless 0 (0 = figure), e.g. n_0 voiced _v aspirated _h more rounded _O (O = letter) less rounded _c advanced _+ retracted _- centralized _” syllabic = (or _=) e.g. n= (or n=) non-syllabic _^ rhoticity `

breathy voiced _t creaky voiced _k linguolabial _N labialized _w palatalized ‘ (or _j) e.g. t’ (or t_j) velarized _G pharyngealized _?\

dental d apical _a laminal _m nasalized ~ (or _~) e.g. A~ (or A~) nasal release _n lateral release _l no audible release _}

velarized or pharyngealized _e velarized l, alternatively 5 raised _r lowered _o advanced tongue root _A retracted tongue root _q


This (part of) documentation was generated from src/fst/phonetics/text2tts-sme.xfscript


src-fst-phonetics-txt2ipa.xfscript.md

retroflex plosive, voiceless t ʈ 0288, 648 ( = ASCII 096) retroflex plosive, voiced d ɖ 0256, 598 labiodental nasal F ɱ 0271, 625 retroflex nasal n ɳ 0273, 627 palatal nasal J ɲ 0272, 626 velar nasal N ŋ 014B, 331 uvular nasal N\ ɴ 0274, 628

bilabial trill B\ ʙ 0299, 665 uvular trill R\ ʀ 0280, 640 alveolar tap 4 ɾ 027E, 638 retroflex flap r ɽ 027D, 637 bilabial fricative, voiceless p\ ɸ 0278, 632 bilabial fricative, voiced B β 03B2, 946 dental fricative, voiceless T θ 03B8, 952 dental fricative, voiced D ð 00F0, 240 postalveolar fricative, voiceless S ʃ 0283, 643 postalveolar fricative, voiced Z ʒ 0292, 658 retroflex fricative, voiceless s ʂ 0282, 642 retroflex fricative, voiced z` ʐ 0290, 656 palatal fricative, voiceless C ç 00E7, 231 palatal fricative, voiced j\ ʝ 029D, 669 velar fricative, voiced G ɣ 0263, 611 uvular fricative, voiceless X χ 03C7, 967 uvular fricative, voiced R ʁ 0281, 641 pharyngeal fricative, voiceless X\ ħ 0127, 295 pharyngeal fricative, voiced ?\ ʕ 0295, 661 glottal fricative, voiced h\ ɦ 0266, 614

alveolar lateral fricative, vl. K alveolar lateral fricative, vd. K\

labiodental approximant P (or v) alveolar approximant r\ retroflex approximant r` velar approximant M\

retroflex lateral approximant l` palatal lateral approximant L velar lateral approximant L
Clicks

bilabial O\ (O = capital letter) dental |
(post)alveolar !\ palatoalveolar =\ alveolar lateral ||
Ejectives, implosives

ejective > e.g. ejective p p> implosive < e.g. implosive b b< Vowels

close back unrounded M close central unrounded 1 close central rounded } lax i I lax y Y lax u U

close-mid front rounded 2 close-mid central unrounded @\ close-mid central rounded 8 close-mid back unrounded 7

schwa @

open-mid front unrounded E open-mid front rounded 9 open-mid central unrounded 3 open-mid central rounded 3\ open-mid back unrounded V open-mid back rounded O

ash (ae digraph) { open schwa (turned a) 6

open front rounded & open back unrounded A open back rounded Q Other symbols

voiceless labial-velar fricative W voiced labial-palatal approx. H voiceless epiglottal fricative H\ voiced epiglottal fricative <\ epiglottal plosive >\

alveolo-palatal fricative, vl. s\ alveolo-palatal fricative, voiced z\ alveolar lateral flap l\ simultaneous S and x x\ tie bar _ Suprasegmentals

primary stress “ secondary stress % long : half-long :\ extra-short _X linking mark -
Tones and word accents

level extra high _T level high _H level mid _M level low _L level extra low _B downstep ! upstep ^ (caret, circumflex)

contour, rising contour, falling _F contour, high rising _H_T contour, low rising _B_L

contour, rising-falling _R_F (NB Instead of being written as diacritics with _, all prosodic marks can alternatively be placed in a separate tier, set off by < >, as recommended for the next two symbols.) global rise global fall Diacritics

voiceless 0 (0 = figure), e.g. n_0 voiced _v aspirated _h more rounded _O (O = letter) less rounded _c advanced _+ retracted _- centralized _” syllabic = (or _=) e.g. n= (or n=) non-syllabic _^ rhoticity `

breathy voiced _t creaky voiced _k linguolabial _N labialized _w palatalized ‘ (or _j) e.g. t’ (or t_j) velarized _G pharyngealized _?\

dental d apical _a laminal _m nasalized ~ (or _~) e.g. A~ (or A~) nasal release _n lateral release _l no audible release _}

velarized or pharyngealized _e velarized l, alternatively 5 raised _r lowered _o advanced tongue root _A retracted tongue root _q


This (part of) documentation was generated from src/fst/phonetics/txt2ipa.xfscript


src-fst-phonology-L2-from-branch.twolc.md

º is for CnsGrad of the lg:lgg and lºl:ll type ¤:0 prevents ConsGrad in certain words ' is the real apostroph

Remainder! Change all # to (Hyph) # in order to account for ealáhus- ja …

boah’tiY4t ! It seems it should be Q3. … both?!

čuor’vuY4t ! Q2, it seems.

Changed because:we get almmáj- and not almmái- Postvocalic j surfaces as i Is this what we want?? without right context??? postvoc j:i <=> Vow: ( :0 ) (Dummy: ) _ ;


This (part of) documentation was generated from src/fst/phonology-L2-from-branch.twolc


src-fst-phonology-L2.twolc.md

º is for CnsGrad of the lg:lgg and lºl:ll type ¤:0 prevents ConsGrad in certain words ' is the real apostroph

Remainder! Change all # to (Hyph) # in order to account for ealáhus- ja …

boah’tiY4t ! It seems it should be Q3. … both?!

čuor’vuY4t ! Q2, it seems.

Changed because:we get almmáj- and not almmái- Postvocalic j surfaces as i Is this what we want?? without right context??? postvoc j:i <=> Vow: ( :0 ) (Dummy: ) _ ;


This (part of) documentation was generated from src/fst/phonology-L2.twolc


src-fst-transcriptions-transcriptor-symbols2text.lexc.md

We describe here how abbreviations in Lule Sami are read out, e.g. for text-to-speech systems.

Miscellaneous symbols

Smileys


This (part of) documentation was generated from src/fst/transcriptions/transcriptor-symbols2text.lexc


tools-grammarcheckers-grammarchecker-resource.cg3.md


This (part of) documentation was generated from tools/grammarcheckers/grammarchecker-resource.cg3


tools-grammarcheckers-grammarchecker.cg3.md

Comp, both for adverbs and adjectives Superl, both for adverbs and adjectives

moadde kerdi > moddii


This (part of) documentation was generated from tools/grammarcheckers/grammarchecker.cg3


tools-grammarcheckers-grc-disambiguator.cg3.md

Comp, both for adverbs and adjectives Superl, both for adverbs and adjectives

Der/A Der/A* pga av bug i lookup2cg

V is all readings with a V tag in them, REAL-V should be the ones without an N tag following the V.
The REAL-V set thus awaits a fix to the preprocess V … N bug.

Guessing: Rule for adding Sem/Date as a tag to readings which looks like dates

Guessing: Rule for adding Adv Sem/Adr as a tag to readings which looks addresses

*Substitute PlcSur (Sem/Plc) (Sem/Sur)

Some propernouns have two parts and the first is not a genitive. We still have problems with abbr when these propernouns are inflected or are a part of a cmp. The copy rule adds Attr reading to names which not get it in the fst (Soria). The select rule selects Attr when the next word is e.g. Moria.

Rules for giving Attr to names, e.g. Ole Attr Kåven.

Remove unwanted analyses

Demonstrative pronouns - should have a look at these

Disambiguating adjectives

Attribute disambiguation

Rules for Attr between Dem and N

Other attribute rules

Special rules for ‘buorre’ (the only adjective showing case agreement)


This (part of) documentation was generated from tools/grammarcheckers/grc-disambiguator.cg3


tools-grammarcheckers-spellchecker.cg3.md


This (part of) documentation was generated from tools/grammarcheckers/spellchecker.cg3


tools-tokenisers-tokeniser-disamb-gt-desc.pmscript.md

Tokeniser for sme

Usage:

$ make
$ echo "ja, ja" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa boasttu olmmoš, man mielde lahtuid." | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "márffibiillagáffe" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst

Pmatch documentation: https://github.com/hfst/hfst/wiki/HfstPmatch

Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words:

Whitespace contains ASCII white space and the List contains some unicode white space characters

Apart from what’s in our morphology, there are

  1. unknown word-like forms, and
  2. unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a Unknowns are made of:
    • lower-case ASCII
    • upper-case ASCII
    • select extended latin symbols ASCII digits
    • select symbols
    • Combining diacritics as individual symbols,
    • various symbols from Private area (probably Microsoft), so far:
    • U+F0B7 for “x in box”

Unknown handling

Unknowns are tagged ?? and treated specially with hfst-tokenise hfst-tokenise –giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it’s safer to let hfst-tokenise handle them.

Finally we mark as a token any sequence making up a:


This (part of) documentation was generated from tools/tokenisers/tokeniser-disamb-gt-desc.pmscript


tools-tokenisers-tokeniser-gramcheck-gt-desc.pmscript.md

Grammar checker tokenisation for sme

Requires a recent version of HFST (3.10.0 / git revision>=3aecdbc) Then just:

$ make
$ echo "ja, ja" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst

More usage examples:

$ echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa boasttu olmmoš, man mielde lahtuid." | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "márffibiillagáffe" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst

Pmatch documentation: https://github.com/hfst/hfst/wiki/HfstPmatch

Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words:

Whitespace contains ASCII white space and the List contains some unicode white space characters

Apart from what’s in our morphology, there are 1) unknown word-like forms, and 2) unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a

TODO: Could use something like this, but built-in’s don’t include šžđčŋ:

Simply give an empty reading when something is unknown: hfst-tokenise –giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it’s safer to let hfst-tokenise handle them.

Finally we mark as a token any sequence making up a:


This (part of) documentation was generated from tools/tokenisers/tokeniser-gramcheck-gt-desc.pmscript


tools-tokenisers-tokeniser-tts-cggt-desc.pmscript.md

TTS tokenisation for smj

Requires a recent version of HFST (3.10.0 / git revision>=3aecdbc) Then just:

make
echo "ja, ja" \
| hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst

More usage examples:

echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa \
boasttu olmmoš, man mielde lahtuid." \
| hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" \
| hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
echo "márffibiillagáffe" \
| hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst

Pmatch documentation: https://kitwiki.csc.fi/twiki/bin/view/KitWiki/HfstPmatch

Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words:

Whitespace contains ASCII white space and the List contains some unicode white space characters

Apart from what’s in our morphology, there are 1) unknown word-like forms, and 2) unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a

TODO: Could use something like this, but built-in’s don’t include šžđčŋ:

Simply give an empty reading when something is unknown: hfst-tokenise –giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it’s safer to let hfst-tokenise handle them.

Needs hfst-tokenise to output things differently depending on the tag they get


This (part of) documentation was generated from tools/tokenisers/tokeniser-tts-cggt-desc.pmscript

Sitemap

Debugging site.pages:

URL: /assets/css/style.css - Title:

URL: /ConvertingToApertium.html - Title:

URL: /KompilereFST.html - Title: Oversyn

URL: /Links.html - Title:

URL: /Numerals.html - Title: Numerals

URL: /PXdiscussion.html - Title: Begrense generering av NPx?

URL: /SpellerConfiguration.html - Title:

URL: /TheL2Transducer.html - Title:

URL: /docu-mini-smi-grammartags.html - Title: The grammatical tags

URL: /docu-sme-about-disamb.html - Title: Intro

URL: /docu-sme-bugs.html - Title: Bug reports, errors

URL: /docu-sme-dis.html - Title: The disambiguation file itself

URL: /docu-sme-flag-diacritics.html - Title: Flag diacritics

URL: /docu-sme-flowchart.html - Title: A flowchart over the sme files for morphological parsing

URL: /docu-sme-grammartags.html - Title:

URL: /docu-sme-lex.html - Title: Introduction

URL: /docu-sme-makefile.html - Title: The sme Makefile and scripts

URL: /docu-sme-preprocessor.html - Title: Introduction

URL: /docu-sme-testplan.html - Title: Test plan for sme

URL: /docu-sme-twol.html - Title: Documenting the North Saami twol rules file

URL: /gramcheck/evaluation/2021-06-24.html - Title: GramDivvun evaluation 24.6.2021

URL: /gramcheck/ - Title: The North Sámi Grammar checker project

URL: /gramcheck/meetings/2020-12-04.html - Title: Møte 26.11., 4.12., 10.12.2020

URL: /gramcheck/meetings/2021-05-19.html - Title: GRAMDIVVUN-møte 19.05.2021 10:30-12:00

URL: /gramcheck/sme-grc-beta2019.html - Title: Documentation North Sámi grammar checker beta

URL: /index-header.html - Title: North Sami documentation

URL: / - Title: North Sami documentation

URL: /j-sme.html - Title:

URL: /newcg/Grammar_Homonymy.html - Title: Data from 2013-01-24

URL: /newcg/Verb_Homonymy.html - Title: Data from 2013-01-24

URL: /newcg/Words_Homonymy.html - Title: Data from 2013-01-24

URL: /newcg/cg50top.html - Title:

URL: /newcg/sets_not_in_use.html - Title: sme-dis.rle-seahtat mat eai leat geavahusas:

URL: /normativity-issues.html - Title: Background

URL: /old-documentation.html - Title: Obsolete North Sámi documetation

URL: /possessives.html - Title: Possessive suffikser i nordsamisk

URL: /preamble.html - Title: Free and Open source Northern Sami analyser giella-sme

URL: /sets.html - Title: Noun sets

URL: /sme-fst-guide.html - Title: North Sámi (sme) fst guide for beginners

URL: /sme-syn-open.html - Title: Disambiguation of grammatical properties

URL: /sme-testdiary.html - Title: Test results for the morphology and lexicon files

URL: /sme.html - Title: North Sami language model documentation

URL: /src-cg3-disambiguator.cg3.html - Title: DELIMITERS

URL: /src-cg3-semanticroles.cg3.html - Title:

URL: /src-cg3-speech_disambiguator.cg3.html - Title: DELIMITERS

URL: /src-cg3-valency.cg3.html - Title:

URL: /src-fst-morphology-affixes-abbreviations.lexc.html - Title: Continuation lexicons for abbreviations

URL: /src-fst-morphology-affixes-acronyms.lexc.html - Title: North Saami acronyms - affix part

URL: /src-fst-morphology-affixes-adjectives.lexc.html - Title:

URL: /src-fst-morphology-affixes-nouns.lexc.html - Title:

URL: /src-fst-morphology-affixes-numerals.lexc.html - Title: North Saami numerals

URL: /src-fst-morphology-affixes-possessive-suffixes.lexc.html - Title:

URL: /src-fst-morphology-affixes-pronouns.lexc.html - Title:

URL: /src-fst-morphology-affixes-propernouns.lexc.html - Title:

URL: /src-fst-morphology-affixes-symbols.lexc.html - Title: Symbol affixes

URL: /src-fst-morphology-affixes-verbs.lexc.html - Title:

URL: /src-fst-morphology-clitics.lexc.html - Title:

URL: /src-fst-morphology-compounding.lexc.html - Title:

URL: /src-fst-morphology-phonology.bergslan.twolc.html - Title:

URL: /src-fst-morphology-phonology.twolc.html - Title:

URL: /src-fst-morphology-root.lexc.html - Title:

URL: /src-fst-morphology-stems-adjectives.lexc.html - Title: North Sámi adjective lexicon

URL: /src-fst-morphology-stems-adpositions.lexc.html - Title:

URL: /src-fst-morphology-stems-adverbs.lexc.html - Title: North Saami adverbs

URL: /src-fst-morphology-stems-conjunctions.lexc.html - Title: North Saami Conjunctions

URL: /src-fst-morphology-stems-interjections.lexc.html - Title: North Saami Interjections

URL: /src-fst-morphology-stems-nouns.lexc.html - Title: North Sámi noun lexicon !

URL: /src-fst-morphology-stems-numerals.lexc.html - Title: North Saami numerals

URL: /src-fst-morphology-stems-particles.lexc.html - Title: This file contains the Particles

URL: /src-fst-morphology-stems-pronouns.lexc.html - Title: This file contains the Pronouns

URL: /src-fst-morphology-stems-sme-abbreviations.lexc.html - Title: File containing North Saami abbreviations

URL: /src-fst-morphology-stems-sme-propernouns.lexc.html - Title: The North Saami proper noun lexicon

URL: /src-fst-morphology-stems-sme-punctuation.lexc.html - Title: Punctuation symbols

URL: /src-fst-morphology-stems-subjunctions.lexc.html - Title: The North Saami Subjunctions

URL: /src-fst-morphology-stems-verbs.lexc.html - Title: North Saami verbs

URL: /src-fst-phonetics-text2tts-fin.xfscript.html - Title:

URL: /src-fst-phonetics-text2tts-nob.xfscript.html - Title:

URL: /src-fst-phonetics-text2tts-sme.xfscript.html - Title:

URL: /src-fst-phonetics-txt2ipa.xfscript.html - Title:

URL: /src-fst-phonology-L2-from-branch.twolc.html - Title:

URL: /src-fst-phonology-L2.twolc.html - Title:

URL: /src-fst-transcriptions-transcriptor-symbols2text.lexc.html - Title:

URL: /tools-grammarcheckers-grammarchecker-resource.cg3.html - Title:

URL: /tools-grammarcheckers-grammarchecker.cg3.html - Title:

URL: /tools-grammarcheckers-grc-disambiguator.cg3.html - Title:

URL: /tools-grammarcheckers-spellchecker.cg3.html - Title:

URL: /tools-tokenisers-tokeniser-disamb-gt-desc.pmscript.html - Title: Tokeniser for sme

URL: /tools-tokenisers-tokeniser-gramcheck-gt-desc.pmscript.html - Title: Grammar checker tokenisation for sme

URL: /tools-tokenisers-tokeniser-tts-cggt-desc.pmscript.html - Title: TTS tokenisation for smj

URL: /xerox-discussion.html - Title: Introduction

Root items:

URL: /ConvertingToApertium.html - Title: Convertingtoapertium

URL: /KompilereFST.html - Title: Oversyn

URL: /Links.html - Title: Links

URL: /Numerals.html - Title: Numerals

URL: /PXdiscussion.html - Title: Begrense generering av NPx?

URL: /SpellerConfiguration.html - Title: Spellerconfiguration

URL: /TheL2Transducer.html - Title: Thel2transducer

URL: /docu-mini-smi-grammartags.html - Title: The grammatical tags

URL: /docu-sme-about-disamb.html - Title: Intro

URL: /docu-sme-bugs.html - Title: Bug reports, errors

URL: /docu-sme-dis.html - Title: The disambiguation file itself

URL: /docu-sme-flag-diacritics.html - Title: Flag diacritics

URL: /docu-sme-flowchart.html - Title: A flowchart over the sme files for morphological parsing

URL: /docu-sme-grammartags.html - Title: Docu-sme-grammartags

URL: /docu-sme-lex.html - Title: Introduction

URL: /docu-sme-makefile.html - Title: The sme Makefile and scripts

URL: /docu-sme-preprocessor.html - Title: Introduction

URL: /docu-sme-testplan.html - Title: Test plan for sme

URL: /docu-sme-twol.html - Title: Documenting the North Saami twol rules file

URL: /gramcheck/ - Title: The North Sámi Grammar checker project

URL: /index-header.html - Title: North Sami documentation

URL: / - Title: North Sami documentation

URL: /j-sme.html - Title: J-sme

URL: /normativity-issues.html - Title: Background

URL: /old-documentation.html - Title: Obsolete North Sámi documetation

URL: /possessives.html - Title: Possessive suffikser i nordsamisk

URL: /preamble.html - Title: Free and Open source Northern Sami analyser giella-sme

URL: /sets.html - Title: Noun sets

URL: /sme-fst-guide.html - Title: North Sámi (sme) fst guide for beginners

URL: /sme-syn-open.html - Title: Disambiguation of grammatical properties

URL: /sme-testdiary.html - Title: Test results for the morphology and lexicon files

URL: /sme.html - Title: North Sami language model documentation

URL: /src-cg3-disambiguator.cg3.html - Title: DELIMITERS

URL: /src-cg3-semanticroles.cg3.html - Title: Src-cg3-semanticroles.cg3

URL: /src-cg3-speech_disambiguator.cg3.html - Title: DELIMITERS

URL: /src-cg3-valency.cg3.html - Title: Src-cg3-valency.cg3

URL: /src-fst-morphology-affixes-abbreviations.lexc.html - Title: Continuation lexicons for abbreviations

URL: /src-fst-morphology-affixes-acronyms.lexc.html - Title: North Saami acronyms - affix part

URL: /src-fst-morphology-affixes-adjectives.lexc.html - Title: Src-fst-morphology-affixes-adjectives.lexc

URL: /src-fst-morphology-affixes-nouns.lexc.html - Title: Src-fst-morphology-affixes-nouns.lexc

URL: /src-fst-morphology-affixes-numerals.lexc.html - Title: North Saami numerals

URL: /src-fst-morphology-affixes-possessive-suffixes.lexc.html - Title: Src-fst-morphology-affixes-possessive-suffixes.lexc

URL: /src-fst-morphology-affixes-pronouns.lexc.html - Title: Src-fst-morphology-affixes-pronouns.lexc

URL: /src-fst-morphology-affixes-propernouns.lexc.html - Title: Src-fst-morphology-affixes-propernouns.lexc

URL: /src-fst-morphology-affixes-symbols.lexc.html - Title: Symbol affixes

URL: /src-fst-morphology-affixes-verbs.lexc.html - Title: Src-fst-morphology-affixes-verbs.lexc

URL: /src-fst-morphology-clitics.lexc.html - Title: Src-fst-morphology-clitics.lexc

URL: /src-fst-morphology-compounding.lexc.html - Title: Src-fst-morphology-compounding.lexc

URL: /src-fst-morphology-phonology.bergslan.twolc.html - Title: Src-fst-morphology-phonology.bergslan.twolc

URL: /src-fst-morphology-phonology.twolc.html - Title: Src-fst-morphology-phonology.twolc

URL: /src-fst-morphology-root.lexc.html - Title: Src-fst-morphology-root.lexc

URL: /src-fst-morphology-stems-adjectives.lexc.html - Title: North Sámi adjective lexicon

URL: /src-fst-morphology-stems-adpositions.lexc.html - Title: Src-fst-morphology-stems-adpositions.lexc

URL: /src-fst-morphology-stems-adverbs.lexc.html - Title: North Saami adverbs

URL: /src-fst-morphology-stems-conjunctions.lexc.html - Title: North Saami Conjunctions

URL: /src-fst-morphology-stems-interjections.lexc.html - Title: North Saami Interjections

URL: /src-fst-morphology-stems-nouns.lexc.html - Title: North Sámi noun lexicon !

URL: /src-fst-morphology-stems-numerals.lexc.html - Title: North Saami numerals

URL: /src-fst-morphology-stems-particles.lexc.html - Title: This file contains the Particles

URL: /src-fst-morphology-stems-pronouns.lexc.html - Title: This file contains the Pronouns

URL: /src-fst-morphology-stems-sme-abbreviations.lexc.html - Title: File containing North Saami abbreviations

URL: /src-fst-morphology-stems-sme-propernouns.lexc.html - Title: The North Saami proper noun lexicon

URL: /src-fst-morphology-stems-sme-punctuation.lexc.html - Title: Punctuation symbols

URL: /src-fst-morphology-stems-subjunctions.lexc.html - Title: The North Saami Subjunctions

URL: /src-fst-morphology-stems-verbs.lexc.html - Title: North Saami verbs

URL: /src-fst-phonetics-text2tts-fin.xfscript.html - Title: Src-fst-phonetics-text2tts-fin.xfscript

URL: /src-fst-phonetics-text2tts-nob.xfscript.html - Title: Src-fst-phonetics-text2tts-nob.xfscript

URL: /src-fst-phonetics-text2tts-sme.xfscript.html - Title: Src-fst-phonetics-text2tts-sme.xfscript

URL: /src-fst-phonetics-txt2ipa.xfscript.html - Title: Src-fst-phonetics-txt2ipa.xfscript

URL: /src-fst-phonology-L2-from-branch.twolc.html - Title: Src-fst-phonology-l2-from-branch.twolc

URL: /src-fst-phonology-L2.twolc.html - Title: Src-fst-phonology-l2.twolc

URL: /src-fst-transcriptions-transcriptor-symbols2text.lexc.html - Title: Src-fst-transcriptions-transcriptor-symbols2text.lexc

URL: /tools-grammarcheckers-grammarchecker-resource.cg3.html - Title: Tools-grammarcheckers-grammarchecker-resource.cg3

URL: /tools-grammarcheckers-grammarchecker.cg3.html - Title: Tools-grammarcheckers-grammarchecker.cg3

URL: /tools-grammarcheckers-grc-disambiguator.cg3.html - Title: Tools-grammarcheckers-grc-disambiguator.cg3

URL: /tools-grammarcheckers-spellchecker.cg3.html - Title: Tools-grammarcheckers-spellchecker.cg3

URL: /tools-tokenisers-tokeniser-disamb-gt-desc.pmscript.html - Title: Tokeniser for sme

URL: /tools-tokenisers-tokeniser-gramcheck-gt-desc.pmscript.html - Title: Grammar checker tokenisation for sme

URL: /tools-tokenisers-tokeniser-tts-cggt-desc.pmscript.html - Title: TTS tokenisation for smj

URL: /xerox-discussion.html - Title: Introduction

Directory items:

URL: /gramcheck/evaluation/2021-06-24.html - Title: GramDivvun evaluation 24.6.2021

URL: /gramcheck/meetings/2020-12-04.html - Title: Møte 26.11., 4.12., 10.12.2020

URL: /gramcheck/meetings/2021-05-19.html - Title: GRAMDIVVUN-møte 19.05.2021 10:30-12:00

URL: /gramcheck/sme-grc-beta2019.html - Title: Documentation North Sámi grammar checker beta

URL: /newcg/Grammar_Homonymy.html - Title: Data from 2013-01-24

URL: /newcg/Verb_Homonymy.html - Title: Data from 2013-01-24

URL: /newcg/Words_Homonymy.html - Title: Data from 2013-01-24

URL: /newcg/cg50top.html - Title: Cg50top

URL: /newcg/sets_not_in_use.html - Title: sme-dis.rle-seahtat mat eai leat geavahusas: