Norwegian Bokmål morphological analyser

this documents the symbols and intro lexicon of Norwegian Bokmål.

Multichar_Symbols

Here we declare the tags and all other multicharacter symbols.

Grammatical tags

Part of speech

+N +A +Adv +V = Open parts of speech
+CS +CC +Interj +Pcle +Pr +IM = Closed POS (IM = å)
+Pers +Dem +Interr +Refl +Recipr +Rel +Qnt

Subtags

+ING = ing-derivation
+Indef +Def +Poss +Indcl =
+Sg +Pl =
+Sg1 +Sg2 +Sg3 +Pl1 +Pl2 +Pl3 =
+Pron +Nom +Acc +Dat +Det =
+Msc +Fem +Neu +MF = Gender. MF = Masc or Fem (used for adjs, not nouns)
+Pos +Comp +Superl = For adjectives
+Clt = the so-called “genitive s”
+Dat = for fixed expressions i live
+Pass +Ind +Prs +Prt +Imp = for verb voice, mood, tense
+Inf +PrsPrc +PrfPrc = for infinite verbs
+Prop = Propernouns are tagged +N+Prop
+Qnt = quantifier noen, begge
+Intens = hmm, what is this…

Other tags

+CLB +PUNCT +HYPH +LEFT +RIGHT
+CLBfinal Sentence final abbreviated expression ending in full stop, so that the full stop is ambiguous
+Cmp
+Cmp/e declaring both awaiting cleanup
+Cmp/s
+Cmp/null declaring both awaiting cleanup
+Symbol = independent symbols in the text stream, like £, €, ©
+Ex/V for derivation
+Ord
+Prdt
+Qst

NDS analyser tags

+Nynorsk For dictionary use., Nynorsk only
+Radical For dictionary testing, Radical Bokmål
+X denoting not-checked.
+1 +2 +3 not in use??

Morphophonology

Triggers

X1 X2 X3 X4 X5 X6 = Nominal stems
Q1 Q2 Q3 = Verbal stems
Z1 Z2 = Both verbal and nominal stems
%^NYNAG = Nynorsk agens lærar / lærer

Special symbols

e7 = always e (ide - ideen)
l7 = always l
+Use/Circ = circular string

Derivation

+Der/AAdv = Adjectives are also adverbs
+Der/NomAct = verb +ing
+Der1 = derivation position
+Der = mark derivation

Normativity and other usage tags

+Err/Orth For speller use
+Err/Hyph
+Err/Lex
+Err/SpaceCmp
+Err/MissingSpace
+Use/NG
+Use/-Spell
+Use/-PLX
+Use/SpellNoSugg
+Use/NG not-generate, for ped generation isme-ped.fst and MT
+Use/PMatch means that the following is only used in the analyser feeding the disambiguator
+Use/-PMatch Do not include in fst’s made for hfst-pmatch
+Use/GC only retained in the HFST Grammar Checker disambiguation analyser
+Use/-GC never retained in the HFST Grammar Checker disambiguation analyser
+Use/TTS – only retained in the HFST Text-To-Speech disambiguation tokeniser
+Use/-TTS – never retained in the HFST Text-To-Speech disambiguation tokeniser
+MWESplit Split point for MWE
+Span - used for numerical expressions denoting spans or intervals, like 5-10, 2012-2015, etc

Paradigm generation

+v1
+v2

Tags for abbreviation handling

+Gram/IAbbr
+Gram/TAbbr
+Gram/TNumAbbr
+Gram/NumNoAbbr

Semantic tags

Semtags

+Sem/Amount
+Sem/Ani
+Sem/Atr
+Sem/Build
+Sem/Build-room
+Sem/Cat
+Sem/Curr
+Sem/Date
+Sem/Domain
+Sem/Domain_Hum
+Sem/Dummytag
+Sem/Edu
+Sem/Edu_Hum
+Sem/Event
+Sem/Fem
+Sem/Food-med
+Sem/Group_Hum
+Sem/Hum
+Sem/ID
+Sem/Lang
+Sem/Mal
+Sem/Mat
+Sem/Measr
+Sem/Money
+Sem/Obj
+Sem/Obj-el
+Sem/Obj-ling
+Sem/Org
+Sem/Org_Prod-audio
+Sem/Org_Prod-vis
+Sem/Part
+Sem/Plc
+Sem/Prod-vis
+Sem/Route
+Sem/Rule
+Sem/Sign
+Sem/State
+Sem/State-sick
+Sem/Substnc
+Sem/Sur
+Sem/Time
+Sem/Time-clock
+Sem/Tool-it
+Sem/Txt
+Sem/Veh
+Sem/Year

Preprocessing

+Use/PMatch
+Use/-PMatch

Symbols that need to be escaped on the lower side (towards twolc):

»7: Literal »
«7: Literal «
```
%[%>%] - Literal >
%[%<%] - Literal <
```

Compounding

+Cmp/Hyph -

Language codes

+OLang/SME - North Sámi
+OLang/SMJ - Lule Sámi
+OLang/SMA - South Sámi
+OLang/FIN - Finnish
+OLang/SWE - Swedish
+OLang/NOB - Norw. bokmål
+OLang/NNO - Norw. nynorsk
+OLang/ENG - English
+OLang/RUS - Russian
+OLang/UND - Undefined

Flag diacritics

Flags for ErrOrth

@C.ErrOrth@ -
@D.ErrOrth.ON@ -
@P.ErrOrth.ON@ -
@R.ErrOrth.ON@ -

Flags for compounding

We have manually optimised the structure of our lexicon using following flag diacritics to restrict morhpological combinatorics - only allow compounds with verbs if the verb is further derived into a noun again:

Flag	Comment
@P.NeedNoun.ON@	(Dis)allow compounds with verbs unless nominalised
@D.NeedNoun.ON@	(Dis)allow compounds with verbs unless nominalised
@C.NeedNoun@	(Dis)allow compounds with verbs unless nominalised

For languages that allow compounding, the following flag diacritics are needed to control position-based compounding restrictions for nominals. Their use is handled automatically if combined with +CmpN/xxx tags. If not used, they will do no harm.

Flag	Comment
@P.CmpFrst.FALSE@	Require that words tagged as such only appear first
@D.CmpPref.TRUE@	Block such words from entering ENDLEX
@P.CmpPref.FALSE@	Block these words from making further compounds
@D.CmpLast.TRUE@	Block such words from entering R
@D.CmpNone.TRUE@	Combines with the next tag to prohibit compounding
@U.CmpNone.FALSE@	Combines with the prev tag to prohibit compounding
@P.CmpOnly.TRUE@	Sets a flag to indicate that the word has passed R
@D.CmpOnly.FALSE@	Disallow words coming directly from root.

The tags are of the following form:

+CmpNP/xxx - Normative (N), Position (P), ie the tag describes what position the tagged word can be in in a compound
+CmpN/xxx - Normative (N) form ie the tag describes what form the tagged word should use when making compounds
+Cmp/xxx - Descriptive compounding tags, ie tags that describes what form a word actually is using in a compound

This entry / word should be in the following position(s):

+CmpNP/All - … in all positions, default, this tag does not have to be written
+CmpNP/First - … only be first part in a compound or alone
+CmpNP/Pref - … only first part in a compound, NEVER alone
+CmpNP/Last - … only be last part in a compound or alone
+CmpNP/Suff - … only last part in a compound, NEVER alone
+CmpNP/None - … does not take part in compounds
+CmpNP/Only - … only be part of a compound, i.e. can never be used alone, but can appear in any position

Flags for governing initial capital

Use the following flag diacritics to control downcasing of derived proper nouns (e.g. Finnish Pariisi -> pariisilainen). See e.g. North Sámi for how to use these flags. There exists a ready-made regex that will do the actual down-casing given the proper use of these flags.

Flag	Comment
@U.Cap.Obl@	Allowing downcasing of derived names: deatnulasj.
@U.Cap.Opt@	Allowing downcasing of derived names: deatnulasj.

Flag diacritic	Explanation
@U.number.one@	Flag used to give arabic numerals in smj different cases ;
@U.number.two@	Flag used to give arabic numerals in smj different cases ;
@U.number.three@	Flag used to give arabic numerals in smj different cases ;
@U.number.four@	Flag used to give arabic numerals in smj different cases ;
@U.number.five@	Flag used to give arabic numerals in smj different cases ;
@U.number.six@	Flag used to give arabic numerals in smj different cases ;
@U.number.seven@	Flag used to give arabic numerals in smj different cases ;
@U.number.eight@	Flag used to give arabic numerals in smj different cases ;
@U.number.nine@	Flag used to give arabic numerals in smj different cases ;
@U.number.zero@	Flag used to give arabic numerals in smj different cases ;

Flags for preprocessing

@P.Pmatch.Loc@ -
@P.Pmatch.Backtrack@ -
@PMATCH_BACKTRACK@ -

Basic lexica, pointing to the other lexicon files

LEXICON Root

FinalNoun ; for -skap etc. that is affix rather than compound
ShortNounRoot ; 2- and 3-letter words
NounRoot ; The rest
ProperNoun ;
AdjectivePrefix ; = kjempeinteressant leksikon, sjå nedanfor.
VerbRoot ;
Adverb ;
Subjunction ;
Conjunction ;
Preposition ;
Interjection ;
Pronoun ;
Numeral ;
Punctuation ;
Symbols ;
Abbreviation ;
Acronym-smi ;
Nynorsk ; Accepts nno forms, does not generate, changed from Use/NG to have speller work.

Other lexica

LEXICON AdjectivePrefix pointing to:

kjempe AdjectiveRoot ; -
super AdjectiveRoot ; -
AdjectiveRoot ; -

LEXICON Abbreviation pointing to:

Abbreviation-nob ; -
Abbreviation-smi ; -

LEXICON ProperNoun pointing to:

@U.CmpHyph.TRUE@ ProperNoun-smi-nocomp ; = Lexicon for short names - always require hyphen
ProperNoun-smi ; = SMI proper nouns
ProperNoun-nob ; = contains the full nob name list

Sublexica for NounRoot

This table shows the codes for nominal and verbal inflection. Irregular inflection has separate codes:

kode	sg.ind.	sg.def	pl.ind.	pl.def.
f1	bru	brua	bruer	bruene
f2	pumpe	pumpa	pumper	pumpene
m1	stol	stolen	stoler	stolene
	bakke	bakken	bakker	bakkene
	pumpe	pumpen	pumper	pumpene
m2	lærer	læreren	lærere	lærerne
m3	bever	beveren	bevere	beverne
			bevre(r)	bevrene
n1	slott	slottet	slott	slotta/slottene
n2	eple	eplet	epler	epla/eplene
	salt	saltet	salter	salta/saltene
n3	kontor	kontoret	kontor	kontora
			kontorer	kontorene
	høve	høvet	høve/høver	høva/høvene

a1	god	god	godt	gode
a2	norsk	norsk	norsk	norske
a3	ekte	ekte	ekte	ekte
a4	oppskjørtet	oppskjørtet	oppskjørtet	oppskjørtede/oppskjørtete
a5	makaber	makaber	makabert	makabre
	lunken	lunken	lunkent	lunkne

v1	kaste	kaster	kasta	kasta
			kastet	kastet
v2	lyse	lyser	lyste	lyst
v3	leve	lever	levde	levd
v4	nå	når	nådde	nådd
v4	bie	bier	bidde	bidd

Clitics

K pointing nouns here to get “genitive” -s

+Clt:%>s ENDLEX ;
ENDLEX ;

Lexicon ENDLEX

And this is the ENDLEX of everything:

@D.CmpOnly.FALSE@@D.CmpPref.TRUE@@D.NeedNoun.ON@ # ;

The @D.CmpOnly.FALSE@ flag diacritic is ued to disallow words tagged with +CmpNP/Only to end here. The @D.NeedNoun.ON@ flag diacritic is used to block illegal compounds.

This (part of) documentation was generated from src/fst/morphology/root.lexc

Norwegian Bokmål NLP Grammar

Page Content

Norwegian Bokmål morphological analyser

Grammatical tags

Part of speech

Subtags

Other tags

Morphophonology

Triggers

Special symbols

Derivation

Normativity and other usage tags

Other tags

Paradigm generation

Tags for abbreviation handling

Semantic tags

Semtags

Preprocessing

Symbols that need to be escaped on the lower side (towards twolc):

Compounding

Language codes

Flag diacritics

Flags for ErrOrth

Flags for compounding

Flags for governing initial capital

Flags for preprocessing

Basic lexica, pointing to the other lexicon files

LEXICON Root

Other lexica

LEXICON AdjectivePrefix pointing to:

LEXICON Abbreviation pointing to:

LEXICON ProperNoun pointing to:

Sublexica for NounRoot

Clitics

K pointing nouns here to get “genitive” -s

Lexicon ENDLEX