Norwegian Bokmål NLP Grammar

Finite state and Constraint Grammar based analysers, proofing tools and other resources

View the project on GitHub giellalt/lang-nob

Page Content

  • Basic tags
  • Tags for special use
  • Sublexica for regular verbs
  • Conjugation sublexica
  • Sets and definitions
  • Rule section
  • Grammatical tags
  • Morphophonology
  • Other tags
  • Paradigm generation
  • Preprocessing
  • Symbols that need to be escaped on the lower side (towards twolc):
  • Compounding
  • Language codes
  • Flag diacritics
  • Basic lexica, pointing to the other lexicon files
  • Other lexica
  • Sublexica for NounRoot
  • Clitics
  • Overview of the declension classes
  • Intransitive abbreviations
  • Overview of the declension classes
  • The lexica themselves
  • Overview of the declension classes
  • The entries
  • Delimiters
  • Grammatical tags
  • Syntactic tags
  • Sets
  • Sets of tags
  • Sets of elements with common syntactic behaviour
  • For ADDRELATION rules (perhaps not in use)
  • Speller rules
  • NP internal agreement rules
  • Quantifier phrases
  • Predicative gender agreement
  • Case errors
  • Finite verb errors
  • Infinitive
  • Adverb errors
  • Word order errors
  • og/å errors
  • Punctuation rules
  • Delimiters and sets
  • Rule section
  • Mostly OBT Rules
  • Giellatekno late rules
  • Unknown handling
  • Norwegian Bokmål language model documentation

    All doc-comment documentation in one large file.


    src-cg3-disambiguator.cg3.md

    The OBT-Giellatekno Bokmål Norwegian disambiguator

    This disambiguator is based upon the disambiguator from OBT (Oslo-Bergen-taggeren), hereafter OBT-cg. It is adjusted to the GiellaLT FST and extended with several rules. It contains the morphological rules only.

    The original OBT disambiguator was written in CG-1 by Kristin Hagen and Anders Nøklestad at UiO. It was translated to CG-2 by Lars Nygård. The conversion to CG-3 and the Tromsø format was done by Trond Trosterud.

    Delimiters and sets

    The tagsets are a superset of the OBT and GiellaLT tags, so that the labels are kept from OBT-cg, but GiellaLT content is added when needed.

    Rule section

    Giellatekno early rules

    NotAbbr removes abbreviations whenever alternatives

    AbbrBeforePara removes CLB before CLB

    Nynorsk removes all +Nynorsk forms (they are in use only for the dictionary interface, and that does not use disambiguation).

    aa

    aaIM selects +IM for å

    Numerals

    Compounds

    Mostly OBT Rules

    The bulk of the file contains rules from the original OBT file.

    Giellatekno late rules

    Neuter sg pl

    Pronouns

    Det rules

    V and not N

    Prepositions

    Late rules, Gt

    Rules with weights

    minweight selects reading with lowest weight.


    This (part of) documentation was generated from src/cg3/disambiguator.cg3


    src-cg3-functions.cg3.md

    The set NOT-NPMOD is used to find barriers between NPs. Typical usage: … (*1 N BARRIER NPT-NPMOD) … meaning: Scan to the first noun, ignoring anything that can be part of the noun phrase of that noun (i.e., “scan to the next NP head”)

    Syntactic tags

    These were the set types.

    These were the set types.

    Numeral outside the sentence

    Finite verbs

    Numeral outside the sentence

    HNOUN MAPPING

    Complex sentences

    missingX adds @X to all missings

    therestX adds @X to all what is left, often errouneus disambiguated forms

    For Apertium:

    The analysis give double analysis because of optional semtags. We go for the one with semtag.


    This (part of) documentation was generated from src/cg3/functions.cg3


    src-cg3-nob-functions.cg3.md

    These sets model noun phrases (NPs). The idea is to first define whatever can occur in front of the head of the NP, and thereafter negate that with the expression WORD - premodifiers.

    The set NOT-NPMOD is used to find barriers between NPs. Typical usage: … (*1 N BARRIER NPT-NPMOD) … meaning: Scan to the first noun, ignoring anything that can be part of the noun phrase of that noun (i.e., “scan to the next NP head”)

    These were the set types.

    Some particular subjunctions

    Adverb rules

    MAPPING OF COMP-CS< , COMPLEMENTS OF PARTICLES IN COMPARISON

    First map all COMP-CS<, then remove the other readings

    MAPPING OF CC AND CS

    Mostly we map both @CNP and @CVP, then we select @CNP, after that we remove them so @CVP remains

    VERB MAPPINGS

    Verbs as predicatives (@SPRED>) and (@<OPRED)

    The tags (@SPRED>) and (@<OPRED) target PrfPrc

    The rules are no documented yet

    Passive verbs often have

    Verbs as prenominal participles (@>N):

    (@<SUBJ) target Inf

    (@+FMAINV) and (@+FAUXV) and (@-FAUXV)

    (@-FMAINV) and (@-FAUXV)

    And then we remove the verbs which didn’t get any syntactic tag, in favour of verbs with syntactic tags.

    killifVinCohort This rule removes all other readings, if there is a mapped V reading in the same cohort. Every case which this goes wrong, should be fixed in mapping rules or previous disrules.

    NOUNS

    CASE DISAMBIGUATION

    HNOUN MAPPING


    This (part of) documentation was generated from src/cg3/nob-functions.cg3


    src-fst-morphology-affixes-abbreviations.lexc.md

    Continuation lexicons for abbreviations

    Lexica for adding tags and periods

    The sublexica

    Continuation lexicons for abbrs both with and witout final period

    Lexicons without final period

    Lexicons with final period


    This (part of) documentation was generated from src/fst/morphology/affixes/abbreviations.lexc


    src-fst-morphology-affixes-adjectives.lexc.md

    Sublexica for adjective roots

    Basic paradigms

    a23

    Sublexica


    This (part of) documentation was generated from src/fst/morphology/affixes/adjectives.lexc


    src-fst-morphology-affixes-nouns.lexc.md

    Bokmål noun morphology


    Declension classes Main types, from Bokmålsordboka

    f1 bru brua bruer bruene f2 pumpe pumpa pumper pumpene f3 søster søstera søstre/søstrer søstrene m1 stol stolen stoler stolene bakke bakken bakker bakkene pumpe pumpen pumper pumpene m2 lærer læreren lærere lærerne m3 bever beveren bevere beverne bevrer bevrene bevre bevrene m4 longs longsen longs/longser longsene n1 slott slottet slott slotta/slottene n2 eple eplet epler epla/eplene salt saltet salter salta/saltene n3 kontor kontoret kontor/kontorer kontora/kontorene høve høvet HØVE/høver høva/høvene middel midlet MIDDEL/midler midla/midlene n4 salt saltet salter salta/saltene ?? n5 middel midlet midler midla/midlene ?? n6 kammer kammeret kamre/kammer kamra/kamrene

    Subtypes, mainly from Finsk-norsk ordbok, also system-specific

    x unclassified, to m1 by default mX indecl m1sg sg only m1pl pl only m1b dam m1b fe, komite m1V sko pl. sko, skoa/skoene m1Vb byte, pl. byte/byter, bytene m1Vc glipp, pl. glipp, glippene m3V meter pl. meter m3b finger pl. fingrer/fingre m3c forelder pl. foreldre ma alliert, alierte, allierte, allierte KOLLEGA kollegaer, kolleger KONTO kontoer, konti RADIUS radiuser, radii BROR brødre FAR fedre MANN menn mD gårde, garde, dage (av gårde) fD tide (i tide) nD live (i live) DATTER døtre f1b skam f1X bok, pl. bøker f1V mus, pl. mus fGLO glo, pl. glør f3b lever. def. levra n1b rom, def. rommet n1n1b publikum, def. publikumet/publikummet n1s sg only n2b program, pl. programmer n2c kontor, pl. kontor, kontorer n2s mørke, not pl. n3b lager, def. lageret n3c fe, feet n3d søppel, søppelet/søplet, søppel/søpler, søpla/søplene n4b faktum, pl. fakta FORUM forum, forumet, fora/forumer, foraene/forumene LEKSIKON leksikon, pl. leksika nMUSEUM museum, museet, museer nØYE

    Basic paradigms

    Irregulars

    +N+Fem+Sg+Def+Radical:datra K ; +N: R ;

    NO CODE for nynorsk only.

    NO CODE for nynorsk only.


    This (part of) documentation was generated from src/fst/morphology/affixes/nouns.lexc


    src-fst-morphology-affixes-numerals.lexc.md

    Tags for numerals (number words)

    Basic tags

    numtag for all numerals

    numtagsg for en

    Tags for special use


    This (part of) documentation was generated from src/fst/morphology/affixes/numerals.lexc


    src-fst-morphology-affixes-propernouns.lexc.md

    Propernoun morphology

    FirstTag

    PROP

    PROP-surmal

    PROP-malfem

    … one lexicon for each combined tag,to split them.


    This (part of) documentation was generated from src/fst/morphology/affixes/propernouns.lexc


    src-fst-morphology-affixes-symbols.lexc.md

    Symbol affixes

    Noun_symbols_possibly_inflected

    Noun_symbols_never_inflected

    SYMBOL_connector

    SYMBOL_NO_suff


    This (part of) documentation was generated from src/fst/morphology/affixes/symbols.lexc


    src-fst-morphology-affixes-verbs.lexc.md

    Sublexica for verb roots


    Main types, from Bokmålsordboka v1 kaste kaster kasta kasta kastet kastet v2 lyse lyser lyste lyst reparere reparerer reparerte reparert v3 leve lever levde levd v4 nå når nådde nådd v4 bie bier bidde bidd

    Subtypes v12 v1 or v2 v13 v1 or v3 v14 v1 or v4 v1-s passive v1 verbs v2-s passive v2 verbs v3-s passive v3 verbs

    Sublexica for regular verbs

    Preliminary lexica

    LEXICON vx points to v1.

    LEXICON v12 for both v1 and v2 past forms, or: score -> scoret, scorte (NG = do not generate)

    LEXICON v12et for verbs with v2 and the -et forms of v1, like “skynde” (but not “tilskynde”, “framskynde” etc.)

    LEXICON v13 also here: v1, v3: sveve -> svevet (NG), svevde.

    LEXICON v13et for verbs with v3 and the -et forms of v1, like “tygge”

    LEXICON v23

    LEXICON v14 where v4 is NG

    LEXICON v1 = kaste

    LEXICON v2 = blåse, studere

    LEXICON v3 = leve

    LEXICON v4 = ro, bie

    LEXICON v1-s = undres

    LEXICON v2-s = føles, synes

    LEXICON v3-s = trives

    Conjugation sublexica

    LEXICON inf-prsptc =

    LEXICON regpres =

    LEXICON r-pres =

    LEXICON a-et-pret =

    LEXICON et-pret =

    LEXICON te-pret =

    LEXICON de-pret =

    LEXICON dde-pret =

    LEXICON prsptcsuff =

    Sublexica for irregular verbs


    This (part of) documentation was generated from src/fst/morphology/affixes/verbs.lexc


    src-fst-morphology-compounding.lexc.md

    Norwegian Bokmål compounding


    This (part of) documentation was generated from src/fst/morphology/compounding.lexc


    src-fst-morphology-phonology.twolc.md

    Morphophonological rules for Bokmål

    This file documents the phonology.twolc file

    Sets and definitions

    Alphabet

    We declare both the a-å letters and all other possible letters.

    Boundary symbols

    Morpheme boundaries and escaped quotes - do not delete in twolc, they will be converted to zero/the real thing at a later stage.

    Morphophonological triggers

    These symbols cause the twolc rules to work.

    Triggers for nominal rules

    Trigers for verbal rules

    Triggers for common rules (both for N and V)

    Nynorsk trigger

    Sets

    Rule section

    This section shows the twolc rules and the tests used to check whether they work

    Umlaut

    Umlaut Rule for bok : bøker etc. It shifts the vowels u, o, a, å to y, ø, e, e, respectively when Z1 is found after the stem.

    Vowel deletions rules

    Epenthetic Deletion Rule is actually 3 rules in one: 1) it deletes -e- in moden : modne etc, 2) it deletes the stem -e in hare + -er and 3) it delets suffix -e in ærlig + est > ærligst

    Tests: (star denotes negativ test, test that is supposed to fail)

    Delete foreign vowel Rule for deleting final a or o in words like kollega : kolleger. Trigger symbol to the right is X2.

    Tests:

    Consonant deletion

    Consonant shortening before deletion Rule

    Tests:

    Geminate deletion in front of -t and -d Rule deletes: 1) before Q3 and d or t (kaller:kalte) 2) before passive Q1 t (lykkes:lyktes) and 3) before epenthetic -e- and l, n or r (sikker:sikre)

    Tests:

    Delete r Rule deletes r in plural -er to get -er + -ne = plural -ene

    Delete m Rule for kam:kammen, here we delete the second m when word-final.

    um Deletion 1 Rule (um Deletion 2 is now part of the Delete m Rule)

    Tests:

    t weakening Rule

    Tests:

    Double t deletion Rule

    Tests:

    Insertion rules

    Insert t in passives Rule

    Compound rule

    Tests:

    Clitics

    Clitic after s-final Rule for changing the so-called genitive -s to for s-final stems: huss -> hus’

    Nynorsk dictionary rules

    Change -er stem to -ar in Nynorsk

    This rule is for dictionary use only. The idea is to be able to click on words in a Nynorsk text and get translation to North Sámi. Therefore, the Bokmål analyser is able to give an analysis to Nynorsk words as well. The Nynorsk-only forms are removed from all other transducers than the -dict transducer.

    Test to have an error


    This (part of) documentation was generated from src/fst/morphology/phonology.twolc


    src-fst-morphology-root.lexc.md

    Norwegian Bokmål morphological analyser

    this documents the symbols and intro lexicon of Norwegian Bokmål.

    Multichar_Symbols

    Here we declare the tags and all other multicharacter symbols.

    Grammatical tags

    Part of speech

    Subtags

    Other tags

    NDS analyser tags

    Morphophonology

    Triggers

    Special symbols

    Derivation

    Normativity and other usage tags

    Other tags

    Paradigm generation

    Tags for abbreviation handling

    Semantic tags

    Semtags

    Preprocessing

    Symbols that need to be escaped on the lower side (towards twolc):

    Compounding

    Language codes

    Flag diacritics

    Flags for ErrOrth

    Flags for compounding

    We have manually optimised the structure of our lexicon using following flag diacritics to restrict morhpological combinatorics - only allow compounds with verbs if the verb is further derived into a noun again:

    Flag Comment
    @P.NeedNoun.ON@ (Dis)allow compounds with verbs unless nominalised
    @D.NeedNoun.ON@ (Dis)allow compounds with verbs unless nominalised
    @C.NeedNoun@ (Dis)allow compounds with verbs unless nominalised

    For languages that allow compounding, the following flag diacritics are needed to control position-based compounding restrictions for nominals. Their use is handled automatically if combined with +CmpN/xxx tags. If not used, they will do no harm.

    Flag Comment
    @P.CmpFrst.FALSE@ Require that words tagged as such only appear first
    @D.CmpPref.TRUE@ Block such words from entering ENDLEX
    @P.CmpPref.FALSE@ Block these words from making further compounds
    @D.CmpLast.TRUE@ Block such words from entering R
    @D.CmpNone.TRUE@ Combines with the next tag to prohibit compounding
    @U.CmpNone.FALSE@ Combines with the prev tag to prohibit compounding
    @P.CmpOnly.TRUE@ Sets a flag to indicate that the word has passed R
    @D.CmpOnly.FALSE@ Disallow words coming directly from root.

    The tags are of the following form:

    This entry / word should be in the following position(s):

    Flags for governing initial capital

    Use the following flag diacritics to control downcasing of derived proper nouns (e.g. Finnish Pariisi -> pariisilainen). See e.g. North Sámi for how to use these flags. There exists a ready-made regex that will do the actual down-casing given the proper use of these flags.

    Flag Comment
    @U.Cap.Obl@ Allowing downcasing of derived names: deatnulasj.
    @U.Cap.Opt@ Allowing downcasing of derived names: deatnulasj.
    Flag diacritic Explanation
    @U.number.one@ Flag used to give arabic numerals in smj different cases ;
    @U.number.two@ Flag used to give arabic numerals in smj different cases ;
    @U.number.three@ Flag used to give arabic numerals in smj different cases ;
    @U.number.four@ Flag used to give arabic numerals in smj different cases ;
    @U.number.five@ Flag used to give arabic numerals in smj different cases ;
    @U.number.six@ Flag used to give arabic numerals in smj different cases ;
    @U.number.seven@ Flag used to give arabic numerals in smj different cases ;
    @U.number.eight@ Flag used to give arabic numerals in smj different cases ;
    @U.number.nine@ Flag used to give arabic numerals in smj different cases ;
    @U.number.zero@ Flag used to give arabic numerals in smj different cases ;

    Flags for preprocessing

    Basic lexica, pointing to the other lexicon files

    LEXICON Root

    Other lexica

    LEXICON AdjectivePrefix pointing to:

    LEXICON Abbreviation pointing to:

    LEXICON ProperNoun pointing to:

    Sublexica for NounRoot

    This table shows the codes for nominal and verbal inflection. Irregular inflection has separate codes:

    kode sg.ind. sg.def pl.ind.  pl.def.
    f1 bru brua bruer bruene
    f2 pumpe pumpa pumper pumpene
    m1 stol stolen stoler stolene
      bakke bakken bakker bakkene
      pumpe pumpen pumper pumpene
    m2 lærer læreren lærere lærerne
    m3 bever beveren bevere beverne
          bevre(r) bevrene
    n1 slott slottet slott slotta/slottene
    n2 eple eplet epler epla/eplene
      salt saltet salter salta/saltene
    n3 kontor kontoret kontor kontora
          kontorer kontorene
      høve høvet høve/høver høva/høvene
             
    a1 god god godt gode
    a2 norsk norsk norsk norske
    a3 ekte ekte ekte ekte
    a4 oppskjørtet oppskjørtet oppskjørtet oppskjørtede/oppskjørtete
    a5 makaber makaber makabert makabre
      lunken lunken lunkent lunkne
             
    v1 kaste kaster kasta kasta
          kastet kastet
    v2 lyse lyser lyste lyst
    v3 leve lever levde levd
    v4 når nådde nådd
    v4 bie bier bidde bidd

    Clitics

    K pointing nouns here to get “genitive” -s

    Lexicon ENDLEX

    And this is the ENDLEX of everything:

    @D.CmpOnly.FALSE@@D.CmpPref.TRUE@@D.NeedNoun.ON@ # ;
    

    The @D.CmpOnly.FALSE@ flag diacritic is ued to disallow words tagged with +CmpNP/Only to end here. The @D.NeedNoun.ON@ flag diacritic is used to block illegal compounds.


    This (part of) documentation was generated from src/fst/morphology/root.lexc


    src-fst-morphology-stems-adjectives.lexc.md

    Norwegian Bokmål Adjectives

    This file documents the Bokmål adjective stem file stems/adjectives.lexc.

    Overview of the declension classes


    Main types, from Bokmålsordboka

    a1 god god godt gode a2 billig billig billig billige a3 ekte ekte ekte ekte a4 oppskjørtet oppskjørtet oppskjørtet oppskjørtede/oppskjørtete a5 makaber makaber makabert makabre a5 lunken lunken lunkent lunkne aV blå blå blått blå … and some irregular ones

    AdjectiveRoot is the list of adjectives (some 5500 stems)


    This (part of) documentation was generated from src/fst/morphology/stems/adjectives.lexc


    src-fst-morphology-stems-adverbs.lexc.md

    Bokmål adverbs

    This file documents the Bokmål adverb stem file stems/adverbs.lexc.

    LEXICON adv adds the tag +Adv

    LEXICON dt also ads +Adv perhaps unify, perhaps not.

    Adverb lists some 600 Norwegian adverbs, including MWE such as “i live”


    This (part of) documentation was generated from src/fst/morphology/stems/adverbs.lexc


    src-fst-morphology-stems-conjunctions.lexc.md

    Bokmål conjunctions

    This file documents the Bokmål conjunctions stem file stems/conjunctions.lexc.

    conj for the tag +CC

    Conjunction både, og, ..


    This (part of) documentation was generated from src/fst/morphology/stems/conjunctions.lexc


    src-fst-morphology-stems-interjections.lexc.md

    Bokmål interjections

    This file documents the Bokmål interjections stem file stems/interjections.lexc.

    LEXICON ij adds the tag +Interj

    LEXICON Interjection lists folkens, heisann, pokker and some 60 more interjections.


    This (part of) documentation was generated from src/fst/morphology/stems/interjections.lexc


    src-fst-morphology-stems-nob-abbreviations.lexc.md

    File containing abbreviations for Norwegian Bokmål

    This file documents the Bokmål abbrevioations stem file stems/nob-abbreviations.lexc.

    Abbreviation-nob

    Intransitive abbreviations

    These give clause boundaries before capital letters and numbers, but not elsewhere.


    Vi bor i Sth. CLB 10 av oss er innflyttere. Vi bor i Sth. CLB Saara er også innflytter. Vi vet at Sth. er en fin by.

    ITRAB

    Transitive number-related abbreviations !

    These ones are transitive when followed by numbers or singleton letters, and intransitive elsewhere.


    Gården har Gnr. 10. Gården har Gnr. 5. a. Alle gårder har ikke Gnr. CLB Det er et problem. Alle gårder har ikke Gnr. og det er et problem. ————————————————–

    TRNUMAB

    Transitive abbreviations

    TRAB

    dot% noStb.db Abbreviations that never induce sentence boundaries The file is too large and should be shrinked


    This (part of) documentation was generated from src/fst/morphology/stems/nob-abbreviations.lexc


    src-fst-morphology-stems-nob-propernouns.lexc.md

    Bokmål propernouns

    This file documents the Bokmål proper nouns stem file stems/nob-propernouns.lexc.

    LEXICON ProperNoun-nob-nocomp contains some acronyms

    LEXICON ProperNoun-nob contains the list of 2200 or so names. The rest come from common files.

    Adjectives

    Nouns


    This (part of) documentation was generated from src/fst/morphology/stems/nob-propernouns.lexc


    src-fst-morphology-stems-nouns.lexc.md

    Bokmål noun lexicon

    This file documents the Bokmål noun stem file stems/nouns.lexc.

    Overview of the declension classes


    Main types, from Bokmålsordboka

    f1 bru brua bruer bruene f2 pumpe pumpa pumper pumpene f3 søster søstera søstre/søstrer søstrene m1 stol stolen stoler stolene bakke bakken bakker bakkene pumpe pumpen pumper pumpene m2 lærer læreren lærere lærerne m3 bever beveren bevere beverne bevrer bevrene bevre bevrene m4 longs longsen longs/longser longsene m5 handelsreisende … n1 slott slottet slott slotta/slottene n2 eple eplet epler epla/eplene salt saltet salter salta/saltene n3 kontor kontoret kontor/kontorer kontora/kontorene høve høvet HØVE/høver høva/høvene n4 salt saltet salter salta/saltene ?? n5 middel midlet MIDDEL/midler midla/midlene ?? n6 kammer kammeret kamre/kammer kamra/kamrene

    Subtypes

    mx unclassified, to m1 by default mX indecl m1sg sg only m1pl pl only m1b dam m1b fe, komité m1V sko pl. sko, skoa/skoene m1Vb byte, pl. byte/byter, bytene m1Vc glipp, pl. glipp, glippene m3V meter pl. meter m3r sykkel, vinkel vinkelen, vinkler, vinklene ma alliert, alierte, allierte, allierte KOLLEGA kollegaer, kolleger mKONTO kontoer, konti mRADIUS radiuser, radii mBROR brødre mFAR fedre mMANN menn mD gårde, garde, dage (av gårde) fD tide (i tide) nD live (i live)

    fDATTER døtre f1b skam f1X bok pl. bøker f1V mus, pl. mus

    nX styrbord, zoo. indecl. n1b rom pl. rom n1sg sg only n2b program pl. programmer n2c kontor pl. kontor, kontorer n2s mørke, not pl. n3b lager def. lageret n3c fe, feet n4b faktum, faktumet, fakta, faktaene FORUM forum, forumet, fora/forumer, foraene/forumene nLEKSIKON leksikon, pl. leksika nMUSEUM museum, museet, museer n1pl odds, oddsene

    The lexica themselves

    LEXICON FinalNoun is a separate lexicon to point to. For now it contains only -skap.

    LEXICON NounRoot is the lexicon pointed to from root.lexc It points to Noun ; HyphNouns ;

    LEXICON HyphNouns contains forms only in used in first part of compounds, like barne. TODO: Kanskje desse ikkje bör bli lista.

    LEXICON ShortNounRoot The lexicon points to two lexica which are kept separate in order not to allow them in compounding (rusle = rus + le) 2_letter ; 3_letter ;

    LEXICON 2_letter is stems with two lettes.

    LEXICON 3_letter is stems with 3 letters

    LEXICON Noun here come the long list of stems (tens of thousands)

    TODO: Gå gjennom mx.


    This (part of) documentation was generated from src/fst/morphology/stems/nouns.lexc


    src-fst-morphology-stems-numerals.lexc.md

    Bokmål numerals (number words)

    This file documents the Bokmål numerals stem file stems/numerals.lexc.

    LEXICON Numeral

    LEXICON Textual

    LEXICON TEXTTHOUSANDS

    LEXICON 1000CONT

    LEXICON TEXTHUNDREDS

    LEXICON 100CONT

    LEXICON TEXTTENS

    LEXICON TEXTTENSCONT

    LEXICON TEXTTEENS

    LEXICON TEXTONES

    LEXICON 2-9

    LEXICON ORDTEXT


    This (part of) documentation was generated from src/fst/morphology/stems/numerals.lexc


    src-fst-morphology-stems-nynorsk-stems.lexc.md

    Nynorsk stems

    for tolerant dictionary reading

    This file documents the nynorsk stem file for the bokmål analyzer stems/nynorsk-stems.lexc.

    LEXICON Prnyn

    LEXICON Advnyn

    LEXICON Anyn

    LEXICON Vnyn

    LEXICON Propnyn

    LEXICON Pronnyn

    LEXICON nnnb

    LEXICON Nynorsk her kjem alle orda


    This (part of) documentation was generated from src/fst/morphology/stems/nynorsk-stems.lexc


    src-fst-morphology-stems-prepositions.lexc.md

    Bokmål prepositions

    This file documents the Bokmål prepositions stem file stems/prepositions.lexc.

    LEXICON p gives tag +Pr

    LEXICON Preposition list (appr 90 prepositions)


    This (part of) documentation was generated from src/fst/morphology/stems/prepositions.lexc


    src-fst-morphology-stems-pronouns.lexc.md

    Bokmål pronoun stems

    This file documents the Bokmål pronouns stem file stems/pronouns.lexc.

    LEXICON Pronoun

    LEXICON Personal

    LEXICON Reflexive

    LEXICON Reciprocal

    LEXICON Interrogative

    LEXICON Possessive

    LEXICON Other_Pronouns


    This (part of) documentation was generated from src/fst/morphology/stems/pronouns.lexc


    src-fst-morphology-stems-subjunctions.lexc.md

    Bokmål subjunctions

    This file documents the Bokmål subjunctions stem file stems/subjunctions.lexc.

    LEXICON Subjunction

    LEXICON subj gives tag +CS


    This (part of) documentation was generated from src/fst/morphology/stems/subjunctions.lexc


    src-fst-morphology-stems-verbs.lexc.md

    Norwegian Bokmål verb stems

    This file documents the Bokmål verb stem file stems/verbs.lexc.

    Overview of the declension classes


    Main types, from Bokmålsordboka

    v1 kaste kaster kasta kasta kastet kastet v2 lyse lyser lyste lyst reparere reparerer reparerte reparert v3 leve lever levde levd v4 nå når nådde nådd v4 bie bier bidde bidd

    Subtypes v12 v1 or v2 v13 v1 or v3 v1-s passive v1 verbs v2-s passive v2 verbs v3-s passive v3 verbs Strong verbs have verb-specific lexica

    The entries

    LEXICON VerbRoot contains the 5700 or so verbs


    This (part of) documentation was generated from src/fst/morphology/stems/verbs.lexc


    src-fst-phonetics-txt2ipa.xfscript.md

    retroflex plosive, voiceless t ʈ 0288, 648 ( = ASCII 096) retroflex plosive, voiced d ɖ 0256, 598 labiodental nasal F ɱ 0271, 625 retroflex nasal n ɳ 0273, 627 palatal nasal J ɲ 0272, 626 velar nasal N ŋ 014B, 331 uvular nasal N\ ɴ 0274, 628

    bilabial trill B\ ʙ 0299, 665 uvular trill R\ ʀ 0280, 640 alveolar tap 4 ɾ 027E, 638 retroflex flap r ɽ 027D, 637 bilabial fricative, voiceless p\ ɸ 0278, 632 bilabial fricative, voiced B β 03B2, 946 dental fricative, voiceless T θ 03B8, 952 dental fricative, voiced D ð 00F0, 240 postalveolar fricative, voiceless S ʃ 0283, 643 postalveolar fricative, voiced Z ʒ 0292, 658 retroflex fricative, voiceless s ʂ 0282, 642 retroflex fricative, voiced z` ʐ 0290, 656 palatal fricative, voiceless C ç 00E7, 231 palatal fricative, voiced j\ ʝ 029D, 669 velar fricative, voiced G ɣ 0263, 611 uvular fricative, voiceless X χ 03C7, 967 uvular fricative, voiced R ʁ 0281, 641 pharyngeal fricative, voiceless X\ ħ 0127, 295 pharyngeal fricative, voiced ?\ ʕ 0295, 661 glottal fricative, voiced h\ ɦ 0266, 614

    alveolar lateral fricative, vl. K alveolar lateral fricative, vd. K\

    labiodental approximant P (or v) alveolar approximant r\ retroflex approximant r` velar approximant M\

    retroflex lateral approximant l` palatal lateral approximant L velar lateral approximant L
    Clicks

    bilabial O\ (O = capital letter) dental |
    (post)alveolar !\ palatoalveolar =\ alveolar lateral ||
    Ejectives, implosives

    ejective > e.g. ejective p p> implosive < e.g. implosive b b< Vowels

    close back unrounded M close central unrounded 1 close central rounded } lax i I lax y Y lax u U

    close-mid front rounded 2 close-mid central unrounded @\ close-mid central rounded 8 close-mid back unrounded 7

    schwa ə @

    open-mid front unrounded E open-mid front rounded 9 open-mid central unrounded 3 open-mid central rounded 3\ open-mid back unrounded V open-mid back rounded O

    ash (ae digraph) { open schwa (turned a) 6

    open front rounded & open back unrounded A open back rounded Q Other symbols

    voiceless labial-velar fricative W voiced labial-palatal approx. H voiceless epiglottal fricative H\ voiced epiglottal fricative <\ epiglottal plosive >\

    alveolo-palatal fricative, vl. s\ alveolo-palatal fricative, voiced z\ alveolar lateral flap l\ simultaneous S and x x\ tie bar _ Suprasegmentals

    primary stress “ secondary stress % long : half-long :\ extra-short _X linking mark -
    Tones and word accents

    level extra high _T level high _H level mid _M level low _L level extra low _B downstep ! upstep ^ (caret, circumflex)

    contour, rising contour, falling _F contour, high rising _H_T contour, low rising _B_L

    contour, rising-falling _R_F (NB Instead of being written as diacritics with _, all prosodic marks can alternatively be placed in a separate tier, set off by < >, as recommended for the next two symbols.) global rise global fall Diacritics

    voiceless 0 (0 = figure), e.g. n_0 voiced _v aspirated _h more rounded _O (O = letter) less rounded _c advanced _+ retracted _- centralized _” syllabic = (or _=) e.g. n= (or n=) non-syllabic _^ rhoticity `

    breathy voiced _t creaky voiced _k linguolabial _N labialized _w palatalized ‘ (or _j) e.g. t’ (or t_j) velarized _G pharyngealized _?\

    dental d apical _a laminal _m nasalized ~ (or _~) e.g. A~ (or A~) nasal release _n lateral release _l no audible release _}

    velarized or pharyngealized _e velarized l, alternatively 5 raised _r lowered _o advanced tongue root _A retracted tongue root _q


    This (part of) documentation was generated from src/fst/phonetics/txt2ipa.xfscript


    src-fst-transcriptions-transcriptor-abbrevs2text.lexc.md

    We describe here how abbreviations are in Norwegian Bokmål are read out, e.g. for text-to-speech systems.

    For example:


    This (part of) documentation was generated from src/fst/transcriptions/transcriptor-abbrevs2text.lexc


    src-fst-transcriptions-transcriptor-numbers-digit2text.lexc.md

    Numbers to digits for Norwegian Bokmål


    This (part of) documentation was generated from src/fst/transcriptions/transcriptor-numbers-digit2text.lexc


    tools-grammarcheckers-grammarchecker.cg3.md

    Bokmål Norwegian Grammar Checker

    This file contains two parts: Definitions and rules

    Definition section

    Delimiters

    DELIMITERS = “<.>” “<!>” “<?>” “<…>” “<¶>”;

    Grammatical tags

    Here we declare all grammatical tags

    Parts of speech tags

    Sets for POS sub-categories

    Boundary tags

    Sets for Semantic tags

    Sets for Morphosyntactic properties

    Syntactic tags

    Initials

    INITIAL = small letters, *CAP-INITIAL** = capital letters

    Sets

    Sets of tags

    Word or not

    Noun sets

    Verb sets

    Pronoun sets

    Numeral sets

    Adjectival sets and their complements

    Adverbial sets and their complements

    Introduce finite clauses.

    Coordinators

    Sets of elements with common syntactic behaviour

    Sets for verbs

    All active verbs with a TV tag, including V:

    NP sets defined according to their morphosyntactic features

    These sets model noun phrases (NPs). The idea is to first define whatever can occur in front of the head of the NP, and thereafter negate that with the expression WORD - premodifiers.

    The strict version of items that can only be premodifiers, not parts of the predicate

    to be used together with PRE-NP-HEAD before @>N is disambiguated

    The set NOT-NPMOD is used to find barriers between NPs. Typical usage: … (*1 N BARRIER NOT-NPMOD) … meaning: Scan to the first noun, ignoring anything that can be part of the noun phrase of that noun (i.e., “scan to the next NP head”)

    Miscellaneous sets

    Border sets and their complements

    Syntactic sets

    These were the set types.

    Grammarchecker sets

    There are 20 or so different rule tags, see the rule section below.

    For ADDRELATION rules (perhaps not in use)

    Rule section

    Speller rules

    Speller suggestions rule – add &SUGGESTWF to any spelling suggestion that we actually want to suggest to the user.

    Speller rule: Add typo to misspelled words The simplest is to just add it to all spelled words:

    Speller rule: Do not mark misspelled words in quotes But perhaps you want to only suggest spellings of words that are not inside “quotes”:

    NP internal agreement rules

    Ensure preceding adjective agrees with noun

    Agreement rule: masculine adjectives should be neuter (msyn-agr-adjmsc-adjneu). Context: Et fin/fint hus.

    Agreement rule: Singular adjectives should be plural (msyn-agr-adjsg-adjpl). Context: mange organisert/organiserte fritidsaktiviteter.

    Agreement rule: Neuter adjectives shoul be masculine (msyn-agr-adjneu-adjmsc). Context: En fint/fin båt.

    Agreement rule: Masculine definite determiners should be neuter (msyn-agr-detmsc-detneu). Context: den/det huset.

    Agreement rule: Masculine indefinite determiners should be neuter (msyn-agr-detmsc-detneu). Context: en/et land.

    Agreement rule: Neuter definite determiners should be feminine (msyn-agr-detneu-detfem). Context: det/den boka.

    Agreement rule: Neuter indefinite determiners should be feminine (msyn-agr-detneu-detfem). Context: et/ei bok.

    Agreement rule: Neuter indefinite determiners should be feminine (msyn-agr-detneu-detfem). Context: et/ei realitetens kvinne.

    Agreement rule: Neuter indefinite determiners should be feminine (msyn-agr-detneu-detfem). Context: et/ei realitetens kvinne.

    Agreement rule: Neuter indefinite determiners should be masculine (msyn-agr-detneu-detmsc). Context: et/en studie.

    Agreement rule: Neuter indefinite determiners should be masculine (msyn-agr-detneu-detmsc). Context: et/en studie.

    Agreement rule: Neuter adjectives should be masculine (msyn-agr-detneu-detmsc). Context: et/en … båt.

    Agreement rule: same rule but for Pron

    Definiteness rule: Double definiteness. Context: disse grunner/grunnene

    Definiteness rule: Double definiteness. Context: de sosiale aspekter/aspektene The rule gave too many false alarms, we skip it.

    Definite adjectives

    Quantifier phrases

    Agreement rule: Indef after quantifier. (msyn-qucompl-def-indef). Context: Vi har mange bøkene/bøker.

    Agreement rule: Pl instead of Sg after quantifier. (msyn-qucompl-sg-pl). Context: Vi har mange ulike utfordring

    Comparative rule: Quantor in superlative: de flere/fleste ulike kulturene

    Predicative gender agreement

    Predicative: neuter adjective should be masculine (msyn-pred-adjneu-adjmsc). Context: Båten var fint/fin.

    Predicative: msculine adjective should be neuter (msyn-pred-adjmsc-adjneu). Context: Eplet var god/godt.

    Agreement rule:. Context: Eplet var god/godt.

    Agreement rule: Context: Eplet var god/godt.

    Agreement rule: Context: Eplene var god/gode.

    Agreement rule: Context: Jeg spiste et eple som var god/godt.

    Agreement rule: Context: Jeg har en bil som er rødt/rød.

    Agreement rule: Context: Jeg har ei hytte som er rødt/rød.

    Agreement rule: Context: Jeg har biler som er fin

    Agreement rule: Context: Eplet som jeg spiste var grønn/grønt

    Agreement rule: Context: Bilen som jeg kjørte var grønt.

    Agreement rule: Context: Hytta som jeg eier er fint.

    Agreement rule: with relative clause Context: Bilene som jeg kjørte var grønt/grønn

    Case errors

    Case rules so far: Nominative pronouns should be accusative

    Agreement rule: The context is P-complement. (msyn-pron-nom-acc). Context: Vi snakker om du.

    Finite verb errors

    Verb rule: Infinitive and no finite form in the sentence (msyn-v-inf-pres). Context: Jeg like/liker peanøtter.

    Infinitive

    Verb rule: Verb error: Present tense should be infinitive (msyn-v-pres-inf). Context: Jeg vil skriver et brev.

    Adverb errors

    Word order errors

    V3 -> V2 in main clause

    V2 to V3 in embedded clauses

    og/å errors

    The og -> å rules

    Realword rule: og should be å real-og-aa. Context: Det er ikke til og holde ut.

    Realword rule: og should be aa between Ind and Inf (real-og-aa). Context: Vi prøver og gå.

    The å -> og rules

    Realword rule: å should be og between nouns (real-aa-og). Context: Det var Trond å Kari.

    Realword rule: å should be og between similar verbforms except 2nd V = obj (real-aa-og). Context: Vi må lese å skrive lyrikk.

    Realword rule: å should be og between similar verbforms except 2nd V = obj (real-aa-og). Not: Det er ikke så lett som man skulle tro å skrive lyrikk.

    Realword rule: å should be og between similar verbforms except 2nd V = obj (real-aa-og). Context: Vi vil hoppe å/og sprette.

    Realword rule: å should be og between similar verbforms except 2nd V = obj (real-aa-og). Context: Vi hopper å/og spretter.

    Punctuation rules

    Simple punctuation rules showing how to change the lemma in the suggestions:

    Quotation mark rule: Use correct quotation mark.

    Ellipsis rule: Ellipsis … for … (use-ellipsis)


    This (part of) documentation was generated from tools/grammarcheckers/grammarchecker.cg3


    tools-grammarcheckers-grc-disambiguator.cg3.md

    The grammarchecker disambiguator for Norwegian Bokmål

    This disambiguator is based upon the disambiguator from OBT (Oslo-Bergen-taggeren), hereafter OBT-cg. It is adjusted to the GiellaLT FST and extended with several rules. It contains the morphological rules only.

    The original OBT disambiguator was written in CG-1 by Kristin Hagen and Anders Nøklestad at UiO. It was translated to CG-2 by Lars Nygård. The conversion to CG-3 and the Tromsø format was done by Trond Trosterud.

    This particular file (grc-disambiguator.cg3) is a version of the above adjusted to grammar checker needs. Mainly, disambiguation rules are relaxed or even commented out.

    NOTE! For reference, removed rules should be marked with the searchable tag grcremoval

    Delimiters and sets

    The tagsets are a superset of the OBT and GiellaLT tags, so that the labels are kept from OBT-cg, but GiellaLT content is added when needed.

    These sets model noun phrases (NPs). The idea is to first define whatever can occur in front of the head of the NP, and thereafter negate that with the expression WORD - premodifiers.

    The strict version of items that can only be premodifiers, not parts of the predicate

    to be used together with PRE-NP-HEAD before @>N is disambiguated

    The set NOT-NPMOD is used to find barriers between NPs. Typical usage: … (*1 N BARRIER NOT-NPMOD) … meaning: Scan to the first noun, ignoring anything that can be part of the noun phrase of that noun (i.e., “scan to the next NP head”)

    GRADE-ADV

    Rule section

    Giellatekno early rules

    NotAbbr removes abbreviations whenever alternatives

    AbbrBeforePara removes CLB before CLB

    Nynorsk removes all +Nynorsk forms (they are in use only for the dictionary interface, and that does not use disambiguation).

    aa

    aaIM selects +IM for å

    Numerals

    Compounds

    Mostly OBT Rules

    The bulk of the file contains rules from the original OBT file.

    Giellatekno late rules

    Neuter sg pl

    Pronouns

    Det rules

    V and not N

    Prepositions

    Late rules, Gt

    Rules with weights

    minweight selects reading with lowest weight.


    This (part of) documentation was generated from tools/grammarcheckers/grc-disambiguator.cg3


    tools-tokenisers-tokeniser-disamb-gt-desc.pmscript.md

    Tokeniser for nob

    Usage:

    $ make
    $ echo "ja, ja" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
    $ echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa boasttu olmmoš, man mielde lahtuid." | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
    $ echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
    $ echo "márffibiillagáffe" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
    

    Pmatch documentation: https://github.com/hfst/hfst/wiki/HfstPmatch

    Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words:

    Whitespace contains ASCII white space and the List contains some unicode white space characters

    Apart from what’s in our morphology, there are

    1. unknown word-like forms, and
    2. unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a Unknowns are made of:
      • lower-case ASCII
      • upper-case ASCII
      • select extended latin symbols ASCII digits
      • select symbols
      • Combining diacritics as individual symbols,
      • various symbols from Private area (probably Microsoft), so far:
      • U+F0B7 for “x in box”

    Unknown handling

    Unknowns are tagged ?? and treated specially with hfst-tokenise hfst-tokenise –giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it’s safer to let hfst-tokenise handle them.

    Finally we mark as a token any sequence making up a:


    This (part of) documentation was generated from tools/tokenisers/tokeniser-disamb-gt-desc.pmscript


    tools-tokenisers-tokeniser-gramcheck-gt-desc.pmscript.md

    Grammar checker tokenisation for nob

    Requires a recent version of HFST (3.10.0 / git revision>=3aecdbc) Then just:

    $ make
    $ echo "ja, ja" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
    

    More usage examples:

    $ echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa boasttu olmmoš, man mielde lahtuid." | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
    $ echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
    $ echo "márffibiillagáffe" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
    

    Pmatch documentation: https://github.com/hfst/hfst/wiki/HfstPmatch

    Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words:

    Whitespace contains ASCII white space and the List contains some unicode white space characters

    Apart from what’s in our morphology, there are 1) unknown word-like forms, and 2) unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a

    TODO: Could use something like this, but built-in’s don’t include šžđčŋ:

    Simply give an empty reading when something is unknown: hfst-tokenise –giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it’s safer to let hfst-tokenise handle them.

    Finally we mark as a token any sequence making up a:


    This (part of) documentation was generated from tools/tokenisers/tokeniser-gramcheck-gt-desc.pmscript


    tools-tokenisers-tokeniser-tts-cggt-desc.pmscript.md

    TTS tokenisation for smj

    Requires a recent version of HFST (3.10.0 / git revision>=3aecdbc) Then just:

    make
    echo "ja, ja" \
    | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
    

    More usage examples:

    echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa \
    boasttu olmmoš, man mielde lahtuid." \
    | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
    echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" \
    | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
    echo "márffibiillagáffe" \
    | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
    

    Pmatch documentation: https://kitwiki.csc.fi/twiki/bin/view/KitWiki/HfstPmatch

    Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words:

    Whitespace contains ASCII white space and the List contains some unicode white space characters

    Apart from what’s in our morphology, there are 1) unknown word-like forms, and 2) unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a

    TODO: Could use something like this, but built-in’s don’t include šžđčŋ:

    Simply give an empty reading when something is unknown: hfst-tokenise –giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it’s safer to let hfst-tokenise handle them.

    Needs hfst-tokenise to output things differently depending on the tag they get


    This (part of) documentation was generated from tools/tokenisers/tokeniser-tts-cggt-desc.pmscript