Finite state and Constraint Grammar based analysers, proofing tools and other resources
View the project on GitHub giellalt/lang-nob
All doc-comment documentation in one large file.
This disambiguator is based upon the disambiguator from OBT (Oslo-Bergen-taggeren), hereafter OBT-cg. It is adjusted to the GiellaLT FST and extended with several rules. It contains the morphological rules only.
The original OBT disambiguator was written in CG-1 by Kristin Hagen and Anders Nøklestad at UiO. It was translated to CG-2 by Lars Nygård. The conversion to CG-3 and the Tromsø format was done by Trond Trosterud.
The tagsets are a superset of the OBT and GiellaLT tags, so that the labels are kept from OBT-cg, but GiellaLT content is added when needed.
NotAbbr removes abbreviations whenever alternatives
AbbrBeforePara removes CLB before CLB
Nynorsk removes all +Nynorsk forms (they are in use only for the dictionary interface, and that does not use disambiguation).
aa
aaIM selects +IM for å
The bulk of the file contains rules from the original OBT file.
minweight selects reading with lowest weight.
This (part of) documentation was generated from src/cg3/disambiguator.cg3
The set NOT-NPMOD is used to find barriers between NPs. Typical usage: … (*1 N BARRIER NPT-NPMOD) … meaning: Scan to the first noun, ignoring anything that can be part of the noun phrase of that noun (i.e., “scan to the next NP head”)
@X : The function is unknown, e.g. because of that the word is unknown
These were the set types.
Border sets and their complements
Syntactic sets
These were the set types.
Finite verbs
n<titel1 (@N<) for (“jr”) or (“sr”); if first one to the left is Prop
n<titel2 (@N<) for INITIAL; if first one to the left is a noun, or if to the left of you is a single letter which is part of a noun conjunction bustávas e ja f gáibiduvvo
advlPoPr> (@<ADVL) for Po or Pr; if mainverb to the right.
BOSPo> (@ADVL>) for Po; if trapped between BOS to the right and S-BOUNDARY OR COMMA to the left, because the main verb will then automatically be on your right side.
advl>inbetween (@ADVL>) for Adv; if inbetween two sentenceboundaries where no mainverb is present.
The analysis give double analysis because of optional semtags. We go for the one with semtag.
This (part of) documentation was generated from src/cg3/functions.cg3
Sets for POS sub-categories
Sets for Semantic tags
Sets for Morphosyntactic properties
Sets for verbs
V is all readings with a V tag in them, REAL-V should
be the ones without an N tag following the V.
The REAL-V set thus awaits a fix to the preprocess V … N bug.
The set COPULAS is for predicative constructions
NP sets defined according to their morphosyntactic features
The PRE-NP-HEAD family of sets
These sets model noun phrases (NPs). The idea is to first define whatever can occur in front of the head of the NP, and thereafter negate that with the expression WORD - premodifiers.
The set NOT-NPMOD is used to find barriers between NPs. Typical usage: … (*1 N BARRIER NPT-NPMOD) … meaning: Scan to the first noun, ignoring anything that can be part of the noun phrase of that noun (i.e., “scan to the next NP head”)
Miscellaneous sets
Border sets and their complements
Syntactic sets
These were the set types.
First map all COMP-CS<, then remove the other readings
killAllnotComp Removes analysis which are not @COMP-CS<
This was the kill all not Comp rule!!
Mostly we map both @CNP and @CVP, then we select @CNP, after that we remove them so @CVP remains
cnpCompSC Map @CNP if @COMP-CS< or COMPAR ahte
cnpCompSpec special rule because of PrfPrc = VFIN
CSasCVP Map @CVP to CS
CCasCNPCVP Map (@CNP @CVP) to CC
killAllCNP removes all remaining @CNP
XCC-CS removes CC and CS with no synttag
The rules are no documented yet
-FAUXVaux AUX verbs
+FMAINVCop copulas even if PrfPrc coming after
+FAUXVCop copulas coming before the mainverb
+FMAINVAux1
+FAUXVCop copulas coming after the mainverb
+FMAINVCop copulas
+FMAINV to the remaining finite verbs which are not AUX
+FMAINV to finite verb after mainverb
-FAUXVPrfPrcAux to PrfPrc AUX before Inf or Actio Ess
-FMAINVPrfPrc to PrfPrc
-FMAINVPrfPrccoord to PrfPrc coordination
-FMAINVPrfPrccoord to PrfPrc coordination
-FMAINVPrfbeforeAux to PrfPrc before the Aux
-FMAINVPrfafterMan to PrfPrc before the Aux
-FMAINVInf to Inf
+FAUXV to Aux
PrfPrcEllipsis being verbal head when finite verb is missing
subj>Sgnr2 (@SUBJ>) for Nom Sg; if VFIN + Sg3 to the right.
<subjSg (@<SUBJ) for Nom Sg; if VFIN Sg3 or Du2 to the left (no HAB allowed to the left).
This (part of) documentation was generated from src/cg3/nob-functions.cg3
**LEXICON ab-noun **
**LEXICON ab-adj **
**LEXICON ab-adv **
**LEXICON ab-num **
**LEXICON ab-nodot-noun ** The bulk
**LEXICON ab-nodot-adj **
**LEXICON ab-nodot-adv **
**LEXICON ab-nodot-num **
**LEXICON ab-dot-noun ** This is the lexicon for abbrs that must have a period.
**LEXICON ab-dot-adj ** This is the lexicon for abbrs that must have a period.
**LEXICON ab-dot-adv ** This is the lexicon for abbrs that must have a period.
**LEXICON ab-dot-num ** This is the lexicon for abbrs that must have a period.
**LEXICON ab-dot-cc **
**LEXICON ab-dot-verb **
**LEXICON ab-dot-IVprfprc **
**LEXICON DOT ** - Adds the dot to dotted abbreviations.
This (part of) documentation was generated from src/fst/morphology/affixes/abbreviations.lexc
a23
This (part of) documentation was generated from src/fst/morphology/affixes/adjectives.lexc
Declension classes Main types, from Bokmålsordboka
f1 bru brua bruer bruene f2 pumpe pumpa pumper pumpene f3 søster søstera søstre/søstrer søstrene m1 stol stolen stoler stolene bakke bakken bakker bakkene pumpe pumpen pumper pumpene m2 lærer læreren lærere lærerne m3 bever beveren bevere beverne bevrer bevrene bevre bevrene m4 longs longsen longs/longser longsene n1 slott slottet slott slotta/slottene n2 eple eplet epler epla/eplene salt saltet salter salta/saltene n3 kontor kontoret kontor/kontorer kontora/kontorene høve høvet HØVE/høver høva/høvene middel midlet MIDDEL/midler midla/midlene n4 salt saltet salter salta/saltene ?? n5 middel midlet midler midla/midlene ?? n6 kammer kammeret kamre/kammer kamra/kamrene
Subtypes, mainly from Finsk-norsk ordbok, also system-specific
x unclassified, to m1 by default mX indecl m1sg sg only m1pl pl only m1b dam m1b fe, komite m1V sko pl. sko, skoa/skoene m1Vb byte, pl. byte/byter, bytene m1Vc glipp, pl. glipp, glippene m3V meter pl. meter m3b finger pl. fingrer/fingre m3c forelder pl. foreldre ma alliert, alierte, allierte, allierte KOLLEGA kollegaer, kolleger KONTO kontoer, konti RADIUS radiuser, radii BROR brødre FAR fedre MANN menn mD gårde, garde, dage (av gårde) fD tide (i tide) nD live (i live) DATTER døtre f1b skam f1X bok, pl. bøker f1V mus, pl. mus fGLO glo, pl. glør f3b lever. def. levra n1b rom, def. rommet n1n1b publikum, def. publikumet/publikummet n1s sg only n2b program, pl. programmer n2c kontor, pl. kontor, kontorer n2s mørke, not pl. n3b lager, def. lageret n3c fe, feet n3d søppel, søppelet/søplet, søppel/søpler, søpla/søplene n4b faktum, pl. fakta FORUM forum, forumet, fora/forumer, foraene/forumene LEKSIKON leksikon, pl. leksika nMUSEUM museum, museet, museer nØYE
+N+Fem+Sg+Def+Radical:datra K ; +N: R ;
NO CODE
for nynorsk only.
NO CODE
for nynorsk only.
This (part of) documentation was generated from src/fst/morphology/affixes/nouns.lexc
numtag for all numerals
numtagsg for en
LEXICON ARABICCASEORD adds +Arab
LEXICON ARABICCASE adds +Arab
LEXICON ARABICCASES adds +Arab
LEXICON ARABICCOMPOUNDS ! arabic as first part,
LEXICON ARABICCASECOLL adds +Arab
LEXICON ARABICCASE0 adds +Arab
This (part of) documentation was generated from src/fst/morphology/affixes/numerals.lexc
FirstTag
PROP
PROP-surmal
PROP-malfem
… one lexicon for each combined tag,to split them.
This (part of) documentation was generated from src/fst/morphology/affixes/propernouns.lexc
Noun_symbols_possibly_inflected
Noun_symbols_never_inflected
SYMBOL_connector
SYMBOL_NO_suff
This (part of) documentation was generated from src/fst/morphology/affixes/symbols.lexc
Main types, from Bokmålsordboka v1 kaste kaster kasta kasta kastet kastet v2 lyse lyser lyste lyst reparere reparerer reparerte reparert v3 leve lever levde levd v4 nå når nådde nådd v4 bie bier bidde bidd
Subtypes v12 v1 or v2 v13 v1 or v3 v14 v1 or v4 v1-s passive v1 verbs v2-s passive v2 verbs v3-s passive v3 verbs
LEXICON vx points to v1.
LEXICON v12 for both v1 and v2 past forms, or: score -> scoret, scorte (NG = do not generate)
LEXICON v12et for verbs with v2 and the -et forms of v1, like “skynde” (but not “tilskynde”, “framskynde” etc.)
LEXICON v13 also here: v1, v3: sveve -> svevet (NG), svevde.
LEXICON v13et for verbs with v3 and the -et forms of v1, like “tygge”
LEXICON v23
LEXICON v14 where v4 is NG
LEXICON v1 = kaste
LEXICON v2 = blåse, studere
LEXICON v3 = leve
LEXICON v4 = ro, bie
LEXICON v1-s = undres
LEXICON v2-s = føles, synes
LEXICON v3-s = trives
LEXICON inf-prsptc =
LEXICON regpres =
LEXICON r-pres =
LEXICON a-et-pret =
LEXICON et-pret =
LEXICON te-pret =
LEXICON de-pret =
LEXICON dde-pret =
LEXICON prsptcsuff =
This (part of) documentation was generated from src/fst/morphology/affixes/verbs.lexc
This (part of) documentation was generated from src/fst/morphology/compounding.lexc
This file documents the phonology.twolc file
We declare both the a-å letters and all other possible letters.
Morpheme boundaries and escaped quotes - do not delete in twolc, they will be converted to zero/the real thing at a later stage.
These symbols cause the twolc rules to work.
This section shows the twolc rules and the tests used to check whether they work
Umlaut Rule for bok : bøker etc. It shifts the vowels u, o, a, å to y, ø, e, e, respectively when Z1 is found after the stem.
Epenthetic Deletion Rule is actually 3 rules in one: 1) it deletes -e- in moden : modne etc, 2) it deletes the stem -e in hare + -er and 3) it delets suffix -e in ærlig + est > ærligst
Tests: (star denotes negativ test, test that is supposed to fail)
Delete foreign vowel Rule for deleting final a or o in words like kollega : kolleger. Trigger symbol to the right is X2.
Tests:
Consonant shortening before deletion Rule
Tests:
Geminate deletion in front of -t and -d Rule deletes: 1) before Q3 and d or t (kaller:kalte) 2) before passive Q1 t (lykkes:lyktes) and 3) before epenthetic -e- and l, n or r (sikker:sikre)
Tests:
Delete r Rule deletes r in plural -er to get -er + -ne = plural -ene
Delete m Rule for kam:kammen, here we delete the second m when word-final.
um Deletion 1 Rule (um Deletion 2 is now part of the Delete m Rule)
Tests:
t weakening Rule
Tests:
Double t deletion Rule
Tests:
Insert t in passives Rule
Tests:
Clitic after s-final Rule for changing the so-called genitive -s to ‘ for s-final stems: huss -> hus’
Change -er stem to -ar in Nynorsk
This rule is for dictionary use only. The idea is to be able to click on words in a Nynorsk text and get translation to North Sámi. Therefore, the Bokmål analyser is able to give an analysis to Nynorsk words as well. The Nynorsk-only forms are removed from all other transducers than the -dict
transducer.
Test to have an error
This (part of) documentation was generated from src/fst/morphology/phonology.twolc
this documents the symbols and intro lexicon of Norwegian Bokmål.
Multichar_Symbols
Here we declare the tags and all other multicharacter symbols.
NDS analyser tags
+Radical For dictionary testing, Radical Bokmål
%[%>%] - Literal >
%[%<%] - Literal <
We have manually optimised the structure of our lexicon using following flag diacritics to restrict morhpological combinatorics - only allow compounds with verbs if the verb is further derived into a noun again:
Flag | Comment |
---|---|
@P.NeedNoun.ON@ | (Dis)allow compounds with verbs unless nominalised |
@D.NeedNoun.ON@ | (Dis)allow compounds with verbs unless nominalised |
@C.NeedNoun@ | (Dis)allow compounds with verbs unless nominalised |
For languages that allow compounding, the following flag diacritics are needed to control position-based compounding restrictions for nominals. Their use is handled automatically if combined with +CmpN/xxx tags. If not used, they will do no harm.
Flag | Comment |
---|---|
@P.CmpFrst.FALSE@ | Require that words tagged as such only appear first |
@D.CmpPref.TRUE@ | Block such words from entering ENDLEX |
@P.CmpPref.FALSE@ | Block these words from making further compounds |
@D.CmpLast.TRUE@ | Block such words from entering R |
@D.CmpNone.TRUE@ | Combines with the next tag to prohibit compounding |
@U.CmpNone.FALSE@ | Combines with the prev tag to prohibit compounding |
@P.CmpOnly.TRUE@ | Sets a flag to indicate that the word has passed R |
@D.CmpOnly.FALSE@ | Disallow words coming directly from root. |
The tags are of the following form:
This entry / word should be in the following position(s):
Use the following flag diacritics to control downcasing of derived proper nouns (e.g. Finnish Pariisi -> pariisilainen). See e.g. North Sámi for how to use these flags. There exists a ready-made regex that will do the actual down-casing given the proper use of these flags.
Flag | Comment |
---|---|
@U.Cap.Obl@ | Allowing downcasing of derived names: deatnulasj. |
@U.Cap.Opt@ | Allowing downcasing of derived names: deatnulasj. |
Flag diacritic | Explanation |
---|---|
@U.number.one@ | Flag used to give arabic numerals in smj different cases ; |
@U.number.two@ | Flag used to give arabic numerals in smj different cases ; |
@U.number.three@ | Flag used to give arabic numerals in smj different cases ; |
@U.number.four@ | Flag used to give arabic numerals in smj different cases ; |
@U.number.five@ | Flag used to give arabic numerals in smj different cases ; |
@U.number.six@ | Flag used to give arabic numerals in smj different cases ; |
@U.number.seven@ | Flag used to give arabic numerals in smj different cases ; |
@U.number.eight@ | Flag used to give arabic numerals in smj different cases ; |
@U.number.nine@ | Flag used to give arabic numerals in smj different cases ; |
@U.number.zero@ | Flag used to give arabic numerals in smj different cases ; |
This table shows the codes for nominal and verbal inflection. Irregular inflection has separate codes:
kode | sg.ind. | sg.def | pl.ind. | pl.def. |
---|---|---|---|---|
f1 | bru | brua | bruer | bruene |
f2 | pumpe | pumpa | pumper | pumpene |
m1 | stol | stolen | stoler | stolene |
bakke | bakken | bakker | bakkene | |
pumpe | pumpen | pumper | pumpene | |
m2 | lærer | læreren | lærere | lærerne |
m3 | bever | beveren | bevere | beverne |
bevre(r) | bevrene | |||
n1 | slott | slottet | slott | slotta/slottene |
n2 | eple | eplet | epler | epla/eplene |
salt | saltet | salter | salta/saltene | |
n3 | kontor | kontoret | kontor | kontora |
kontorer | kontorene | |||
høve | høvet | høve/høver | høva/høvene | |
a1 | god | god | godt | gode |
a2 | norsk | norsk | norsk | norske |
a3 | ekte | ekte | ekte | ekte |
a4 | oppskjørtet | oppskjørtet | oppskjørtet | oppskjørtede/oppskjørtete |
a5 | makaber | makaber | makabert | makabre |
lunken | lunken | lunkent | lunkne | |
v1 | kaste | kaster | kasta | kasta |
kastet | kastet | |||
v2 | lyse | lyser | lyste | lyst |
v3 | leve | lever | levde | levd |
v4 | nå | når | nådde | nådd |
v4 | bie | bier | bidde | bidd |
And this is the ENDLEX of everything:
@D.CmpOnly.FALSE@@D.CmpPref.TRUE@@D.NeedNoun.ON@ # ;
The @D.CmpOnly.FALSE@
flag diacritic is ued to disallow words tagged
with +CmpNP/Only to end here.
The @D.NeedNoun.ON@
flag diacritic is used to block illegal compounds.
This (part of) documentation was generated from src/fst/morphology/root.lexc
This file documents the Bokmål adjective stem file stems/adjectives.lexc.
Main types, from Bokmålsordboka
a1 god god godt gode a2 billig billig billig billige a3 ekte ekte ekte ekte a4 oppskjørtet oppskjørtet oppskjørtet oppskjørtede/oppskjørtete a5 makaber makaber makabert makabre a5 lunken lunken lunkent lunkne aV blå blå blått blå … and some irregular ones
AdjectiveRoot is the list of adjectives (some 5500 stems)
vond: VOND ;
bundet a4 ;
This (part of) documentation was generated from src/fst/morphology/stems/adjectives.lexc
This file documents the Bokmål adverb stem file stems/adverbs.lexc.
LEXICON adv adds the tag +Adv
LEXICON dt also ads +Adv perhaps unify, perhaps not.
Adverb lists some 600 Norwegian adverbs, including MWE such as “i live”
This (part of) documentation was generated from src/fst/morphology/stems/adverbs.lexc
This file documents the Bokmål conjunctions stem file stems/conjunctions.lexc.
conj for the tag +CC
Conjunction både, og, ..
This (part of) documentation was generated from src/fst/morphology/stems/conjunctions.lexc
This file documents the Bokmål interjections stem file stems/interjections.lexc.
LEXICON ij adds the tag +Interj
LEXICON Interjection lists folkens, heisann, pokker and some 60 more interjections.
This (part of) documentation was generated from src/fst/morphology/stems/interjections.lexc
This file documents the Bokmål abbrevioations stem file stems/nob-abbreviations.lexc.
Abbreviation-nob
These give clause boundaries before capital letters and numbers, but not elsewhere.
Vi bor i Sth. CLB 10 av oss er innflyttere. Vi bor i Sth. CLB Saara er også innflytter. Vi vet at Sth. er en fin by.
ITRAB
Transitive number-related abbreviations !
These ones are transitive when followed by numbers or singleton letters, and intransitive elsewhere.
Gården har Gnr. 10. Gården har Gnr. 5. a. Alle gårder har ikke Gnr. CLB Det er et problem. Alle gårder har ikke Gnr. og det er et problem. ————————————————–
TRNUMAB
TRAB
dot% noStb.db Abbreviations that never induce sentence boundaries The file is too large and should be shrinked
This (part of) documentation was generated from src/fst/morphology/stems/nob-abbreviations.lexc
This file documents the Bokmål proper nouns stem file stems/nob-propernouns.lexc.
LEXICON ProperNoun-nob-nocomp contains some acronyms
LEXICON ProperNoun-nob contains the list of 2200 or so names. The rest come from common files.
Adjectives
Nouns
This (part of) documentation was generated from src/fst/morphology/stems/nob-propernouns.lexc
This file documents the Bokmål noun stem file stems/nouns.lexc.
Main types, from Bokmålsordboka
f1 bru brua bruer bruene f2 pumpe pumpa pumper pumpene f3 søster søstera søstre/søstrer søstrene m1 stol stolen stoler stolene bakke bakken bakker bakkene pumpe pumpen pumper pumpene m2 lærer læreren lærere lærerne m3 bever beveren bevere beverne bevrer bevrene bevre bevrene m4 longs longsen longs/longser longsene m5 handelsreisende … n1 slott slottet slott slotta/slottene n2 eple eplet epler epla/eplene salt saltet salter salta/saltene n3 kontor kontoret kontor/kontorer kontora/kontorene høve høvet HØVE/høver høva/høvene n4 salt saltet salter salta/saltene ?? n5 middel midlet MIDDEL/midler midla/midlene ?? n6 kammer kammeret kamre/kammer kamra/kamrene
Subtypes
mx unclassified, to m1 by default mX indecl m1sg sg only m1pl pl only m1b dam m1b fe, komité m1V sko pl. sko, skoa/skoene m1Vb byte, pl. byte/byter, bytene m1Vc glipp, pl. glipp, glippene m3V meter pl. meter m3r sykkel, vinkel vinkelen, vinkler, vinklene ma alliert, alierte, allierte, allierte KOLLEGA kollegaer, kolleger mKONTO kontoer, konti mRADIUS radiuser, radii mBROR brødre mFAR fedre mMANN menn mD gårde, garde, dage (av gårde) fD tide (i tide) nD live (i live)
fDATTER døtre f1b skam f1X bok pl. bøker f1V mus, pl. mus
nX styrbord, zoo. indecl. n1b rom pl. rom n1sg sg only n2b program pl. programmer n2c kontor pl. kontor, kontorer n2s mørke, not pl. n3b lager def. lageret n3c fe, feet n4b faktum, faktumet, fakta, faktaene FORUM forum, forumet, fora/forumer, foraene/forumene nLEKSIKON leksikon, pl. leksika nMUSEUM museum, museet, museer n1pl odds, oddsene
LEXICON FinalNoun is a separate lexicon to point to. For now it contains only -skap.
LEXICON NounRoot is the lexicon pointed to from root.lexc
It points to
Noun ;
HyphNouns ;
LEXICON HyphNouns contains forms only in used in first part of compounds, like barne. TODO: Kanskje desse ikkje bör bli lista.
LEXICON ShortNounRoot The lexicon points to two lexica which are kept separate in order not to allow them in compounding (rusle = rus + le) 2_letter ; 3_letter ;
LEXICON 2_letter is stems with two lettes.
LEXICON 3_letter is stems with 3 letters
LEXICON Noun here come the long list of stems (tens of thousands)
TODO: Gå gjennom mx.
This (part of) documentation was generated from src/fst/morphology/stems/nouns.lexc
This file documents the Bokmål numerals stem file stems/numerals.lexc.
LEXICON Numeral
LEXICON Textual
LEXICON TEXTTHOUSANDS
LEXICON 1000CONT
LEXICON TEXTHUNDREDS
LEXICON 100CONT
LEXICON TEXTTENS
LEXICON TEXTTENSCONT
LEXICON TEXTTEENS
LEXICON TEXTONES
LEXICON 2-9
LEXICON ORDTEXT
This (part of) documentation was generated from src/fst/morphology/stems/numerals.lexc
for tolerant dictionary reading
This file documents the nynorsk stem file for the bokmål analyzer stems/nynorsk-stems.lexc.
LEXICON Prnyn
LEXICON Advnyn
LEXICON Anyn
LEXICON Vnyn
LEXICON Propnyn
LEXICON Pronnyn
LEXICON nnnb
LEXICON Nynorsk her kjem alle orda
This (part of) documentation was generated from src/fst/morphology/stems/nynorsk-stems.lexc
This file documents the Bokmål prepositions stem file stems/prepositions.lexc.
LEXICON p gives tag +Pr
LEXICON Preposition list (appr 90 prepositions)
This (part of) documentation was generated from src/fst/morphology/stems/prepositions.lexc
This file documents the Bokmål pronouns stem file stems/pronouns.lexc.
LEXICON Pronoun
LEXICON Personal
LEXICON Reflexive
LEXICON Reciprocal
LEXICON Interrogative
LEXICON Possessive
LEXICON Other_Pronouns
This (part of) documentation was generated from src/fst/morphology/stems/pronouns.lexc
This file documents the Bokmål subjunctions stem file stems/subjunctions.lexc.
LEXICON Subjunction
LEXICON subj gives tag +CS
This (part of) documentation was generated from src/fst/morphology/stems/subjunctions.lexc
This file documents the Bokmål verb stem file stems/verbs.lexc.
Main types, from Bokmålsordboka
v1 kaste kaster kasta kasta kastet kastet v2 lyse lyser lyste lyst reparere reparerer reparerte reparert v3 leve lever levde levd v4 nå når nådde nådd v4 bie bier bidde bidd
Subtypes v12 v1 or v2 v13 v1 or v3 v1-s passive v1 verbs v2-s passive v2 verbs v3-s passive v3 verbs Strong verbs have verb-specific lexica
LEXICON VerbRoot contains the 5700 or so verbs
tilslutte v1 ;
This (part of) documentation was generated from src/fst/morphology/stems/verbs.lexc
retroflex plosive, voiceless t ʈ 0288, 648 (
= ASCII 096)
retroflex plosive, voiced d ɖ 0256, 598
labiodental nasal F ɱ 0271, 625
retroflex nasal n
ɳ 0273, 627
palatal nasal J ɲ 0272, 626
velar nasal N ŋ 014B, 331
uvular nasal N\ ɴ 0274, 628
bilabial trill B\ ʙ 0299, 665
uvular trill R\ ʀ 0280, 640
alveolar tap 4 ɾ 027E, 638
retroflex flap r ɽ 027D, 637
bilabial fricative, voiceless p\ ɸ 0278, 632
bilabial fricative, voiced B β 03B2, 946
dental fricative, voiceless T θ 03B8, 952
dental fricative, voiced D ð 00F0, 240
postalveolar fricative, voiceless S ʃ 0283, 643
postalveolar fricative, voiced Z ʒ 0292, 658
retroflex fricative, voiceless s
ʂ 0282, 642
retroflex fricative, voiced z` ʐ 0290, 656
palatal fricative, voiceless C ç 00E7, 231
palatal fricative, voiced j\ ʝ 029D, 669
velar fricative, voiced G ɣ 0263, 611
uvular fricative, voiceless X χ 03C7, 967
uvular fricative, voiced R ʁ 0281, 641
pharyngeal fricative, voiceless X\ ħ 0127, 295
pharyngeal fricative, voiced ?\ ʕ 0295, 661
glottal fricative, voiced h\ ɦ 0266, 614
alveolar lateral fricative, vl. K alveolar lateral fricative, vd. K\
labiodental approximant P (or v) alveolar approximant r\ retroflex approximant r` velar approximant M\
retroflex lateral approximant l`
palatal lateral approximant L
velar lateral approximant L
Clicks
bilabial O\ (O = capital letter)
dental |
(post)alveolar !\
palatoalveolar =\
alveolar lateral ||
Ejectives, implosives
ejective > e.g. ejective p p> implosive < e.g. implosive b b< Vowels
close back unrounded M close central unrounded 1 close central rounded } lax i I lax y Y lax u U
close-mid front rounded 2 close-mid central unrounded @\ close-mid central rounded 8 close-mid back unrounded 7
schwa ə @
open-mid front unrounded E open-mid front rounded 9 open-mid central unrounded 3 open-mid central rounded 3\ open-mid back unrounded V open-mid back rounded O
ash (ae digraph) { open schwa (turned a) 6
open front rounded & open back unrounded A open back rounded Q Other symbols
voiceless labial-velar fricative W voiced labial-palatal approx. H voiceless epiglottal fricative H\ voiced epiglottal fricative <\ epiglottal plosive >\
alveolo-palatal fricative, vl. s\ alveolo-palatal fricative, voiced z\ alveolar lateral flap l\ simultaneous S and x x\ tie bar _ Suprasegmentals
primary stress “
secondary stress %
long :
half-long :\
extra-short _X
linking mark -
Tones and word accents
level extra high _T level high _H level mid _M level low _L level extra low _B downstep ! upstep ^ (caret, circumflex)
contour, rising contour, falling _F contour, high rising _H_T contour, low rising _B_L
contour, rising-falling _R_F
(NB Instead of being written as diacritics with _, all prosodic
marks can alternatively be placed in a separate tier, set off
by < >, as recommended for the next two symbols.)
global rise
voiceless 0 (0 = figure), e.g. n_0 voiced _v aspirated _h more rounded _O (O = letter) less rounded _c advanced _+ retracted _- centralized _” syllabic = (or _=) e.g. n= (or n=) non-syllabic _^ rhoticity `
breathy voiced _t creaky voiced _k linguolabial _N labialized _w palatalized ‘ (or _j) e.g. t’ (or t_j) velarized _G pharyngealized _?\
dental d apical _a laminal _m nasalized ~ (or _~) e.g. A~ (or A~) nasal release _n lateral release _l no audible release _}
velarized or pharyngealized _e velarized l, alternatively 5 raised _r lowered _o advanced tongue root _A retracted tongue root _q
This (part of) documentation was generated from src/fst/phonetics/txt2ipa.xfscript
We describe here how abbreviations are in Norwegian Bokmål are read out, e.g. for text-to-speech systems.
For example:
This (part of) documentation was generated from src/fst/transcriptions/transcriptor-abbrevs2text.lexc
This (part of) documentation was generated from src/fst/transcriptions/transcriptor-numbers-digit2text.lexc
This file contains two parts: Definitions and rules
DELIMITERS = “<.>” “<!>” “<?>” “<…>” “<¶>”;
Here we declare all grammatical tags
SET FAUXV = @-FAUXV OR @+FAUXV ;
INITIAL = small letters, *CAP-INITIAL** = capital letters
LIST hj-tv-V = “ha” “få” ;
LIST WORD = N A Adv V Pron CS CC Po Pr Interj Pcle Num ABBR ACR ? ;
SET NOT-VERB = WORD - V ;
SET VFIN = V-MOOD ;
LIST QUASIAUX = “akte” “anbefale” “begynne” “behøve” “bli” “forsøke” “fortsette” “forvente” “gidde” “glemme” “huske” “klare” “like” “lære” “nekte” “orke” “prøve” “risikere” “slippe” “slutte” “synes” “søke” “tenke” “trenge” “tørre” “unngå” “velge” “vurdere” “være” “ønske” ;
SET NOT-PRFPRC = WORD - PrfPrc ;
All active verbs with a TV tag, including V:
SET NP-HEAD-ACC = (Pron Acc) OR N OR A - RCmpnd ;
These sets model noun phrases (NPs). The idea is to first define whatever can occur in front of the head of the NP, and thereafter negate that with the expression WORD - premodifiers.
The strict version of items that can only be premodifiers, not parts of the predicate
to be used together with PRE-NP-HEAD before @>N is disambiguated
The set NOT-NPMOD is used to find barriers between NPs. Typical usage: … (*1 N BARRIER NOT-NPMOD) … meaning: Scan to the first noun, ignoring anything that can be part of the noun phrase of that noun (i.e., “scan to the next NP head”)
SET NOT-NPMOD = WORD - PRE-NP-HEAD ;
SET NOT-NPMOD-ACC-ADV = NOT-NPMOD - Acc - Adv OR ABBR ;
SET S-BOUNDARY = CP OR BOUNDARYSYMBOLS OR @CVP ;
SET APP-BOUNDARY = REAL-CLB OR VFIN OR Inf OR Recipr OR Pr OR Pcle OR Interj OR CS OR CP OR PrfPrc - @>N ; A special barrier used with mapping of appositions.
SET SVF-BOUNDARY = S-BOUNDARY OR VFIN ; This set is ment to use in rules for disambiguating due to verbs or verbsets. Here we search for either an S-BOUNDARY or a finite verb, either aux or main.
These were the set types.
There are 20 or so different rule tags, see the rule section below.
Speller suggestions rule – add &SUGGESTWF to any spelling suggestion that we actually want to suggest to the user.
Speller rule: Add typo to misspelled words The simplest is to just add it to all spelled words:
Speller rule: Do not mark misspelled words in quotes But perhaps you want to only suggest spellings of words that are not inside “quotes”:
Ensure preceding adjective agrees with noun
Agreement rule: masculine adjectives should be neuter (msyn-agr-adjmsc-adjneu). Context: Et fin/fint hus.
Agreement rule: Singular adjectives should be plural (msyn-agr-adjsg-adjpl). Context: mange organisert/organiserte fritidsaktiviteter.
Agreement rule: Neuter adjectives shoul be masculine (msyn-agr-adjneu-adjmsc). Context: En fint/fin båt.
Agreement rule: Masculine definite determiners should be neuter (msyn-agr-detmsc-detneu). Context: den/det huset.
Agreement rule: Masculine indefinite determiners should be neuter (msyn-agr-detmsc-detneu). Context: en/et land.
Agreement rule: Neuter definite determiners should be feminine (msyn-agr-detneu-detfem). Context: det/den boka.
Agreement rule: Neuter indefinite determiners should be feminine (msyn-agr-detneu-detfem). Context: et/ei bok.
Agreement rule: Neuter indefinite determiners should be feminine (msyn-agr-detneu-detfem). Context: et/ei realitetens kvinne.
Agreement rule: Neuter indefinite determiners should be feminine (msyn-agr-detneu-detfem). Context: et/ei realitetens kvinne.
Agreement rule: Neuter indefinite determiners should be masculine (msyn-agr-detneu-detmsc). Context: et/en studie.
Agreement rule: Neuter indefinite determiners should be masculine (msyn-agr-detneu-detmsc). Context: et/en studie.
Agreement rule: Neuter adjectives should be masculine (msyn-agr-detneu-detmsc). Context: et/en … båt.
Agreement rule: same rule but for Pron
Definiteness rule: Double definiteness. Context: disse grunner/grunnene
Definiteness rule: Double definiteness. Context: de sosiale aspekter/aspektene The rule gave too many false alarms, we skip it.
Agreement rule: Indef after quantifier. (msyn-qucompl-def-indef). Context: Vi har mange bøkene/bøker.
Agreement rule: Pl instead of Sg after quantifier. (msyn-qucompl-sg-pl). Context: Vi har mange ulike utfordring
Comparative rule: Quantor in superlative: de flere/fleste ulike kulturene
Predicative: neuter adjective should be masculine (msyn-pred-adjneu-adjmsc). Context: Båten var fint/fin.
Predicative: msculine adjective should be neuter (msyn-pred-adjmsc-adjneu). Context: Eplet var god/godt.
Agreement rule:. Context: Eplet var god/godt.
Agreement rule: Context: Eplet var god/godt.
Agreement rule: Context: Eplene var god/gode.
Agreement rule: Context: Jeg spiste et eple som var god/godt.
Agreement rule: Context: Jeg har en bil som er rødt/rød.
Agreement rule: Context: Jeg har ei hytte som er rødt/rød.
Agreement rule: Context: Jeg har biler som er fin
Agreement rule: Context: Eplet som jeg spiste var grønn/grønt
Agreement rule: Context: Bilen som jeg kjørte var grønt.
Agreement rule: Context: Hytta som jeg eier er fint.
Agreement rule: with relative clause Context: Bilene som jeg kjørte var grønt/grønn
Case rules so far: Nominative pronouns should be accusative
Agreement rule: The context is P-complement. (msyn-pron-nom-acc). Context: Vi snakker om du.
Verb rule: Infinitive and no finite form in the sentence (msyn-v-inf-pres). Context: Jeg like/liker peanøtter.
Verb rule: Verb error: Present tense should be infinitive (msyn-v-pres-inf). Context: Jeg vil skriver et brev.
Realword rule: og should be å real-og-aa. Context: Det er ikke til og holde ut.
Realword rule: og should be aa between Ind and Inf (real-og-aa). Context: Vi prøver og gå.
Realword rule: å should be og between nouns (real-aa-og). Context: Det var Trond å Kari.
Realword rule: å should be og between similar verbforms except 2nd V = obj (real-aa-og). Context: Vi må lese å skrive lyrikk.
Realword rule: å should be og between similar verbforms except 2nd V = obj (real-aa-og). Not: Det er ikke så lett som man skulle tro å skrive lyrikk.
Realword rule: å should be og between similar verbforms except 2nd V = obj (real-aa-og). Context: Vi vil hoppe å/og sprette.
Realword rule: å should be og between similar verbforms except 2nd V = obj (real-aa-og). Context: Vi hopper å/og spretter.
Simple punctuation rules showing how to change the lemma in the suggestions:
Quotation mark rule: Use correct quotation mark.
Ellipsis rule: Ellipsis … for … (use-ellipsis)
This (part of) documentation was generated from tools/grammarcheckers/grammarchecker.cg3
This disambiguator is based upon the disambiguator from OBT (Oslo-Bergen-taggeren), hereafter OBT-cg. It is adjusted to the GiellaLT FST and extended with several rules. It contains the morphological rules only.
The original OBT disambiguator was written in CG-1 by Kristin Hagen and Anders Nøklestad at UiO. It was translated to CG-2 by Lars Nygård. The conversion to CG-3 and the Tromsø format was done by Trond Trosterud.
This particular file (grc-disambiguator.cg3) is a version of the above adjusted to grammar checker needs. Mainly, disambiguation rules are relaxed or even commented out.
NOTE! For reference, removed rules should be marked with the searchable tag grcremoval
The tagsets are a superset of the OBT and GiellaLT tags, so that the labels are kept from OBT-cg, but GiellaLT content is added when needed.
Amount sets
The PRE-NP-HEAD family of sets
These sets model noun phrases (NPs). The idea is to first define whatever can occur in front of the head of the NP, and thereafter negate that with the expression WORD - premodifiers.
LIST RCmpnd = RCmpnd ;
SET PRE-NP-HEAD = (Prop Attr) OR A OR ABBR OR Num OR RCmpnd OR CC OR (Pron Dem) OR (Pron Ref) OR Indef OR
The strict version of items that can only be premodifiers, not parts of the predicate
to be used together with PRE-NP-HEAD before @>N is disambiguated
The set NOT-NPMOD is used to find barriers between NPs. Typical usage: … (*1 N BARRIER NOT-NPMOD) … meaning: Scan to the first noun, ignoring anything that can be part of the noun phrase of that noun (i.e., “scan to the next NP head”)
SET NOT-NPMOD = WORD - PRE-NP-HEAD OR ABBR ;
SET NOT-NPMOD-ACC-ADV = NOT-NPMOD - Acc - Adv OR ABBR ;
GRADE-ADV
NotAbbr removes abbreviations whenever alternatives
AbbrBeforePara removes CLB before CLB
Nynorsk removes all +Nynorsk forms (they are in use only for the dictionary interface, and that does not use disambiguation).
aa
aaIM selects +IM for å
The bulk of the file contains rules from the original OBT file.
minweight selects reading with lowest weight.
This (part of) documentation was generated from tools/grammarcheckers/grc-disambiguator.cg3
Usage:
$ make
$ echo "ja, ja" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa boasttu olmmoš, man mielde lahtuid." | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "márffibiillagáffe" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
Pmatch documentation: https://github.com/hfst/hfst/wiki/HfstPmatch
Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words:
U+00AD
U+FEFF
.Whitespace contains ASCII white space and the List contains some unicode white space characters
Apart from what’s in our morphology, there are
hfst-tokenise -a
Unknowns are made of:
Unknowns are tagged ?? and treated specially with hfst-tokenise
hfst-tokenise –giella-cg will treat such empty analyses as unknowns, and
remove empty analyses from other readings. Empty readings are also
legal in CG, they get a default baseform equal to the wordform, but
no tag to check, so it’s safer to let hfst-tokenise handle them.
Finally we mark as a token any sequence making up a:
This (part of) documentation was generated from tools/tokenisers/tokeniser-disamb-gt-desc.pmscript
Requires a recent version of HFST (3.10.0 / git revision>=3aecdbc) Then just:
$ make
$ echo "ja, ja" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
More usage examples:
$ echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa boasttu olmmoš, man mielde lahtuid." | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "márffibiillagáffe" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
Pmatch documentation: https://github.com/hfst/hfst/wiki/HfstPmatch
Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words:
U+00AD
U+FEFF
.Whitespace contains ASCII white space and the List contains some unicode white space characters
Apart from what’s in our morphology, there are 1) unknown word-like forms, and 2) unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a
TODO: Could use something like this, but built-in’s don’t include šžđčŋ:
Simply give an empty reading when something is unknown: hfst-tokenise –giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it’s safer to let hfst-tokenise handle them.
Finally we mark as a token any sequence making up a:
This (part of) documentation was generated from tools/tokenisers/tokeniser-gramcheck-gt-desc.pmscript
Requires a recent version of HFST (3.10.0 / git revision>=3aecdbc) Then just:
make
echo "ja, ja" \
| hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
More usage examples:
echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa \
boasttu olmmoš, man mielde lahtuid." \
| hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" \
| hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
echo "márffibiillagáffe" \
| hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
Pmatch documentation: https://kitwiki.csc.fi/twiki/bin/view/KitWiki/HfstPmatch
Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words:
U+00AD
U+FEFF
.Whitespace contains ASCII white space and the List contains some unicode white space characters
Apart from what’s in our morphology, there are 1) unknown word-like forms, and 2) unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a
TODO: Could use something like this, but built-in’s don’t include šžđčŋ:
Simply give an empty reading when something is unknown: hfst-tokenise –giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it’s safer to let hfst-tokenise handle them.
Needs hfst-tokenise to output things differently depending on the tag they get
This (part of) documentation was generated from tools/tokenisers/tokeniser-tts-cggt-desc.pmscript