Norwegian Bokmål NLP Grammar

Finite state and Constraint Grammar based analysers, proofing tools and other resources

View the project on GitHub giellalt/lang-nob

Norwegian Bokmål morphological analyser

this documents the symbols and intro lexicon of Norwegian Bokmål.

Multichar_Symbols

Here we declare the tags and all other multicharacter symbols.

Grammatical tags

Part of speech

Subtags

Other tags

NDS analyser tags

Morphophonology

Triggers

Special symbols

Derivation

Normativity and other usage tags

Other tags

Paradigm generation

Tags for abbreviation handling

Semantic tags

Semtags

Preprocessing

Symbols that need to be escaped on the lower side (towards twolc):

Compounding

Language codes

Flag diacritics

Flags for ErrOrth

Flags for compounding

We have manually optimised the structure of our lexicon using following flag diacritics to restrict morhpological combinatorics - only allow compounds with verbs if the verb is further derived into a noun again:

Flag Comment
@P.NeedNoun.ON@ (Dis)allow compounds with verbs unless nominalised
@D.NeedNoun.ON@ (Dis)allow compounds with verbs unless nominalised
@C.NeedNoun@ (Dis)allow compounds with verbs unless nominalised

For languages that allow compounding, the following flag diacritics are needed to control position-based compounding restrictions for nominals. Their use is handled automatically if combined with +CmpN/xxx tags. If not used, they will do no harm.

Flag Comment
@P.CmpFrst.FALSE@ Require that words tagged as such only appear first
@D.CmpPref.TRUE@ Block such words from entering ENDLEX
@P.CmpPref.FALSE@ Block these words from making further compounds
@D.CmpLast.TRUE@ Block such words from entering R
@D.CmpNone.TRUE@ Combines with the next tag to prohibit compounding
@U.CmpNone.FALSE@ Combines with the prev tag to prohibit compounding
@P.CmpOnly.TRUE@ Sets a flag to indicate that the word has passed R
@D.CmpOnly.FALSE@ Disallow words coming directly from root.

The tags are of the following form:

This entry / word should be in the following position(s):

Flags for governing initial capital

Use the following flag diacritics to control downcasing of derived proper nouns (e.g. Finnish Pariisi -> pariisilainen). See e.g. North Sámi for how to use these flags. There exists a ready-made regex that will do the actual down-casing given the proper use of these flags.

Flag Comment
@U.Cap.Obl@ Allowing downcasing of derived names: deatnulasj.
@U.Cap.Opt@ Allowing downcasing of derived names: deatnulasj.
Flag diacritic Explanation
@U.number.one@ Flag used to give arabic numerals in smj different cases ;
@U.number.two@ Flag used to give arabic numerals in smj different cases ;
@U.number.three@ Flag used to give arabic numerals in smj different cases ;
@U.number.four@ Flag used to give arabic numerals in smj different cases ;
@U.number.five@ Flag used to give arabic numerals in smj different cases ;
@U.number.six@ Flag used to give arabic numerals in smj different cases ;
@U.number.seven@ Flag used to give arabic numerals in smj different cases ;
@U.number.eight@ Flag used to give arabic numerals in smj different cases ;
@U.number.nine@ Flag used to give arabic numerals in smj different cases ;
@U.number.zero@ Flag used to give arabic numerals in smj different cases ;

Flags for preprocessing

Basic lexica, pointing to the other lexicon files

LEXICON Root

Other lexica

LEXICON AdjectivePrefix pointing to:

LEXICON Abbreviation pointing to:

LEXICON ProperNoun pointing to:

Sublexica for NounRoot

This table shows the codes for nominal and verbal inflection. Irregular inflection has separate codes:

kode sg.ind. sg.def pl.ind.  pl.def.
f1 bru brua bruer bruene
f2 pumpe pumpa pumper pumpene
m1 stol stolen stoler stolene
  bakke bakken bakker bakkene
  pumpe pumpen pumper pumpene
m2 lærer læreren lærere lærerne
m3 bever beveren bevere beverne
      bevre(r) bevrene
n1 slott slottet slott slotta/slottene
n2 eple eplet epler epla/eplene
  salt saltet salter salta/saltene
n3 kontor kontoret kontor kontora
      kontorer kontorene
  høve høvet høve/høver høva/høvene
         
a1 god god godt gode
a2 norsk norsk norsk norske
a3 ekte ekte ekte ekte
a4 oppskjørtet oppskjørtet oppskjørtet oppskjørtede/oppskjørtete
a5 makaber makaber makabert makabre
  lunken lunken lunkent lunkne
         
v1 kaste kaster kasta kasta
      kastet kastet
v2 lyse lyser lyste lyst
v3 leve lever levde levd
v4 når nådde nådd
v4 bie bier bidde bidd

Clitics

K pointing nouns here to get “genitive” -s

Lexicon ENDLEX

And this is the ENDLEX of everything:

@D.CmpOnly.FALSE@@D.CmpPref.TRUE@@D.NeedNoun.ON@ # ;

The @D.CmpOnly.FALSE@ flag diacritic is ued to disallow words tagged with +CmpNP/Only to end here. The @D.NeedNoun.ON@ flag diacritic is used to block illegal compounds.


This (part of) documentation was generated from src/fst/morphology/root.lexc