Faroese NLP Grammar

Finite state and Constraint Grammar based analysers, proofing tools and other resources

View the project on GitHub giellalt/lang-fao

Faroese morphological analyser

Definitions for Multichar_Symbols

Tags for POS

Semantic tags

Non-changing letters

Triggers for Morphophonology

Language tags

Non-ascii letters, perhaps needed as multichar symbols

Compounding tags

The tags are of the following form:

This entry / word should be in the following position(s):

Usage tags

Symbols that need to be escaped on the lower side (towards twolc):

Todo: Check whether these can be removed. They are probably obsolete.

Flag diacritics

We have manually optimised the structure of our lexicon using following flag diacritics to restrict morhpological combinatorics - only allow compounds with verbs if the verb is further derived into a noun again:

@P.NeedNoun.ON@ (Dis)allow compounds with verbs unless nominalised
@D.NeedNoun.ON@ (Dis)allow compounds with verbs unless nominalised
@C.NeedNoun@ (Dis)allow compounds with verbs unless nominalised

Flags for speller suggestions

@D.ErrOrth.ON@
@C.ErrOrth@
@P.ErrOrth.ON@
@R.ErrOrth.ON@

Flag for case harmony in compounds

Set flag for compounds

Flag Example word
@P.Case.MscNom@ fyrstiflokkur
@P.Case.MscObl@ fyrstaflokk
@P.Case.FemNom@ lítlasystir
@P.Case.FemObl@ lítluusystur
@P.Case.Neu@ breiðaskarð
@P.Case.Pl@ fyrstuflokkar, lítlusystrar, breiðuskørð

Control flag values for compounds

Flag Example word
@R.Case.MscNom@ fyrstiflokkur
@R.Case.MscObl@ fyrstaflokk
@R.Case.FemNom@ lítlasystir
@R.Case.FemObl@ lítluusystur
@R.Case.Neu@ breiðaskarð
@R.Case.Pl@ fyrstuflokkar, lítlusystrar, breiðuskørð

Control flag values for compounds

Flag Example word
@U.Case.MscNom@ fyrstiflokkur
@U.Case.MscObl@ fyrstaflokk
@U.Case.FemNom@ lítlasystir
@U.Case.FemObl@ lítluusystur
@U.Case.Neu@ breiðaskarð
@U.Case.Pl@ fyrstuflokkar, lítlusystrar, breiðuskørð

Flag diacritic look-alikes for grammar checker & tokenisation purposes

Flag Explanation
@P.Pmatch.Loc@ Location in string used or parsed by hfst-pmatch
@P.Pmatch.Backtrack@ Also for hfst-pmatch

Flags for compound restriction

For languages that allow compounding, the following flag diacritics are needed to control position-based compounding restrictions for nominals. Their use is handled automatically if combined with +CmpN/xxx tags. If not used, they will do no harm.

Flag Explanation
@P.CmpFrst.FALSE@ Require that words tagged as such only appear first
@D.CmpPref.TRUE@ Block such words from entering ENDLEX
@P.CmpPref.FALSE@ Block these words from making further compounds
@D.CmpLast.TRUE@ Block such words from entering R
@D.CmpNone.TRUE@ Combines with the next tag to prohibit compounding
@U.CmpNone.FALSE@ Combines with the prev tag to prohibit compounding
@P.CmpOnly.TRUE@ Sets a flag to indicate that the word has passed R
@D.CmpOnly.FALSE@ Disallow words coming directly from root.

Use the following flag diacritics to control downcasing of derived proper nouns (e.g. Finnish Pariisi -> pariisilainen). See e.g. North Sámi for how to use these flags. There exists a ready-made regex that will do the actual down-casing given the proper use of these flags. | @U.Cap.Obl@ | Allowing downcasing of derived names: deatnulasj. | @U.Cap.Opt@ | Allowing downcasing of derived names: deatnulasj.

Flag diacritic Explanation
@U.number.one@ Flag used to give arabic numerals in smj different cases ;
@U.number.two@ Flag used to give arabic numerals in smj different cases ;
@U.number.three@ Flag used to give arabic numerals in smj different cases ;
@U.number.four@ Flag used to give arabic numerals in smj different cases ;
@U.number.five@ Flag used to give arabic numerals in smj different cases ;
@U.number.six@ Flag used to give arabic numerals in smj different cases ;
@U.number.seven@ Flag used to give arabic numerals in smj different cases ;
@U.number.eight@ Flag used to give arabic numerals in smj different cases ;
@U.number.nine@ Flag used to give arabic numerals in smj different cases ;
@U.number.zero@ Flag used to give arabic numerals in smj different cases ;

Lexicon Root

This is the beginning of everything. The Root lexicon is reserved in the LexC language, and must be the first lexicon defined.

Lexicon Acronyms is split in two:

Lexicon ENDLEX

And this is the ENDLEX of everything:

@D.CmpOnly.FALSE@@D.CmpPref.TRUE@@D.NeedNoun.ON@ ENDLEX2 ;

The @D.CmpOnly.FALSE@ flag diacritic is ued to disallow words tagged with +CmpNP/Only to end here. The @D.NeedNoun.ON@ flag diacritic is used to block illegal compounds.


This (part of) documentation was generated from src/fst/morphology/root.lexc