Finite state and Constraint Grammar based analysers, proofing tools and other resources
+CLBfinal Sentence final abbreviated expression ending in full stop, so that the full stop is ambiguous
+Sg3 : This is inherited from common files, should be changed to +3Sg.
+Arab sub-pos
+Coll sub-pos
+Ine samiske kasus, skal bort
+MWE multiword expression
+Rom sjekk desse XXX
+Der/Adv derivation to Adverb
+Sem/Fem
+Sem/Year - year (i.e. 1000 - 2999), used only for numerals
+Sem/Txt
a3 This is for a special a Umlaut case a3:ø (normal: a:o)
%^PASS : todo ,
%> : Suffix boundary ,
Language tags
The tags are of the following form:
This entry / word should be in the following position(s):
+Use/Circ = for compound restrictions
+Use/PMatch means that the following is only used in the analyser feeding the disambiguator. This is missing.
+Use/-TTS – never retained in the HFST Text-To-Speech disambiguation tokeniser
Todo: Check whether these can be removed. They are probably obsolete.
%[%>%] - Literal >
%[%<%] - Literal <
We have manually optimised the structure of our lexicon using following flag diacritics to restrict morhpological combinatorics - only allow compounds with verbs if the verb is further derived into a noun again:
@P.NeedNoun.ON@ | (Dis)allow compounds with verbs unless nominalised |
@D.NeedNoun.ON@ | (Dis)allow compounds with verbs unless nominalised |
@C.NeedNoun@ | (Dis)allow compounds with verbs unless nominalised |
@D.ErrOrth.ON@ |
@C.ErrOrth@ |
@P.ErrOrth.ON@ |
@R.ErrOrth.ON@ |
Set flag for compounds
Flag | Example word |
---|---|
@P.Case.MscNom@ | fyrstiflokkur |
@P.Case.MscObl@ | fyrstaflokk |
@P.Case.FemNom@ | lítlasystir |
@P.Case.FemObl@ | lítluusystur |
@P.Case.Neu@ | breiðaskarð |
@P.Case.Pl@ | fyrstuflokkar, lítlusystrar, breiðuskørð |
Control flag values for compounds
Flag | Example word |
---|---|
@R.Case.MscNom@ | fyrstiflokkur |
@R.Case.MscObl@ | fyrstaflokk |
@R.Case.FemNom@ | lítlasystir |
@R.Case.FemObl@ | lítluusystur |
@R.Case.Neu@ | breiðaskarð |
@R.Case.Pl@ | fyrstuflokkar, lítlusystrar, breiðuskørð |
Control flag values for compounds
Flag | Example word |
---|---|
@U.Case.MscNom@ | fyrstiflokkur |
@U.Case.MscObl@ | fyrstaflokk |
@U.Case.FemNom@ | lítlasystir |
@U.Case.FemObl@ | lítluusystur |
@U.Case.Neu@ | breiðaskarð |
@U.Case.Pl@ | fyrstuflokkar, lítlusystrar, breiðuskørð |
Flag diacritic look-alikes for grammar checker & tokenisation purposes
Flag | Explanation |
---|---|
@P.Pmatch.Loc@ | Location in string used or parsed by hfst-pmatch |
@P.Pmatch.Backtrack@ | Also for hfst-pmatch |
For languages that allow compounding, the following flag diacritics are needed to control position-based compounding restrictions for nominals. Their use is handled automatically if combined with +CmpN/xxx tags. If not used, they will do no harm.
Flag | Explanation |
---|---|
@P.CmpFrst.FALSE@ | Require that words tagged as such only appear first |
@D.CmpPref.TRUE@ | Block such words from entering ENDLEX |
@P.CmpPref.FALSE@ | Block these words from making further compounds |
@D.CmpLast.TRUE@ | Block such words from entering R |
@D.CmpNone.TRUE@ | Combines with the next tag to prohibit compounding |
@U.CmpNone.FALSE@ | Combines with the prev tag to prohibit compounding |
@P.CmpOnly.TRUE@ | Sets a flag to indicate that the word has passed R |
@D.CmpOnly.FALSE@ | Disallow words coming directly from root. |
Use the following flag diacritics to control downcasing of derived proper nouns (e.g. Finnish Pariisi -> pariisilainen). See e.g. North Sámi for how to use these flags. There exists a ready-made regex that will do the actual down-casing given the proper use of these flags. | @U.Cap.Obl@ | Allowing downcasing of derived names: deatnulasj. | @U.Cap.Opt@ | Allowing downcasing of derived names: deatnulasj.
Flag diacritic | Explanation |
---|---|
@U.number.one@ | Flag used to give arabic numerals in smj different cases ; |
@U.number.two@ | Flag used to give arabic numerals in smj different cases ; |
@U.number.three@ | Flag used to give arabic numerals in smj different cases ; |
@U.number.four@ | Flag used to give arabic numerals in smj different cases ; |
@U.number.five@ | Flag used to give arabic numerals in smj different cases ; |
@U.number.six@ | Flag used to give arabic numerals in smj different cases ; |
@U.number.seven@ | Flag used to give arabic numerals in smj different cases ; |
@U.number.eight@ | Flag used to give arabic numerals in smj different cases ; |
@U.number.nine@ | Flag used to give arabic numerals in smj different cases ; |
@U.number.zero@ | Flag used to give arabic numerals in smj different cases ; |
This is the beginning of everything. The Root lexicon is reserved in the LexC language, and must be the first lexicon defined.
Lexicon Acronyms is split in two:
And this is the ENDLEX of everything:
@D.CmpOnly.FALSE@@D.CmpPref.TRUE@@D.NeedNoun.ON@ ENDLEX2 ;
The @D.CmpOnly.FALSE@
flag diacritic is ued to disallow words tagged
with +CmpNP/Only to end here.
The @D.NeedNoun.ON@
flag diacritic is used to block illegal compounds.
This (part of) documentation was generated from src/fst/morphology/root.lexc