Faroese NLP Grammar

Finite state and Constraint Grammar based analysers, proofing tools and other resources

View the project on GitHub giellalt/lang-fao

Page Content

Faroese language model documentation

All doc-comment documentation in one large file.


src-cg3-disambiguator.cg3.md

Faroese disambiguator

Usage, in lang-fao: cat text.txt|hfst-tokenize -cg tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst |vislcg3 -g src/cg3/disambiguator.cg3

This file documents the Faroese disambiguator file .

Delimiters, tags and sets

Test: Go for minimal weight. This rules gives priority to lexicalised forms.

MAPPING OF CC AND CS

Mostly we map both @CNP and @CVP, then we select @CNP, after that we remove them so @CVP remains


This (part of) documentation was generated from src/cg3/disambiguator.cg3


src-cg3-functions.cg3.md

S Y N T A C T I C F U N C T I O N S F O R F A R O E S E

Sámi language technology project 2003-2014, University of Tromsø #

This file adds syntactic functions. It was copied from sme.

!! Syntactic sets

These sets model noun phrases (NPs). The idea is to first define whatever can occur in front of the head of the NP, and thereafter negate that with the expression WORD - premodifiers.

The set NOT-NPMOD is used to find barriers between NPs. Typical usage: … (*1 N BARRIER NPT-NPMOD) … meaning: Scan to the first noun, ignoring anything that can be part of the noun phrase of that noun (i.e., “scan to the next NP head”)

!!HNOUN MAPPING

!! The leftovers are tagged @X

! missingX adds @X to all missings

! therestX adds @X to all what is left, often errouneus disambiguated forms


This (part of) documentation was generated from src/cg3/functions.cg3


src-cg3-korp.cg3.md

S Y N T A C T I C F U N C T I O N S F O R F A R O E S E

Sámi language technology project 2003-2014, University of Tromsø #

This file adds syntactic functions. It was copied from sme.

!! Syntactic sets

These sets model noun phrases (NPs). The idea is to first define whatever can occur in front of the head of the NP, and thereafter negate that with the expression WORD - premodifiers.

The set NOT-NPMOD is used to find barriers between NPs. Typical usage: … (*1 N BARRIER NPT-NPMOD) … meaning: Scan to the first noun, ignoring anything that can be part of the noun phrase of that noun (i.e., “scan to the next NP head”)

!!HNOUN MAPPING

!! The leftovers are tagged @X

! missingX adds @X to all missings

! therestX adds @X to all what is left, often errouneus disambiguated forms


This (part of) documentation was generated from src/cg3/korp.cg3


src-fst-morphology-affixes-abbreviations.lexc.md

Abbreviation affixes

Now splitting according to POS, and according to dot or not

First collecting POS info, *-noun, *-adv, etc. Also splitting when in doubt: -noun-adj => -noun and -adj Then pointing to two contlexes, a dot-one and a non-dot-one.

Lexicons without final period

Lexicons with final period


This (part of) documentation was generated from src/fst/morphology/affixes/abbreviations.lexc


src-fst-morphology-affixes-acronyms.lexc.md

North Saami acronyms - affix part

The lexica giving tags and suffixes to the acronyms


This (part of) documentation was generated from src/fst/morphology/affixes/acronyms.lexc


src-fst-morphology-affixes-adjectives.lexc.md

Adjective morphology !

Ad hoc lexica

The lexicons

Irregular adjectives

Irregular comparatives

Intermediate adjectival lexica

Adjectival case lexica

Msc

Neu

Definite declension

Positiv, def, u-umlj Msc

Fem

Neu

Positiv, def, ø-umlj Msc

Fem Neu

Gender tags

Case tags

Compound flags

Comparative

Superlative


This (part of) documentation was generated from src/fst/morphology/affixes/adjectives.lexc


src-fst-morphology-affixes-nouns.lexc.md

Faroese Noun morphology

This file contains the inflection suffixes for the Faroese nowns The infection classes are identical to the ones in Føroysk orðabók.

The morphology is ordered in three layers.

Layer 1: Basic noun lexica

The nominal morphology is added in three layers. In this first layer we add gender tags and morphophonological diacritics. The next two layers are for indefinite and definite suffixes, respectively.

Lexicons still to be allocated

We first list 4 lexica for words waiting to be checked.!

Irregular nouns

These are lexica with number 0, they have no inflectional morphology.!

Lexica for words belonging to two paradigms.

These are simply split (h11/12 to h11 and h12, etc).!

The ordinary lexica

These lexica split into sg and pl lexica, and add +N and gender tags. Thereafter it points to Layer 2, the case suffixes

Lexica for weak masculines.

Lexica for strong masculines

Lexica for feminines

Lexica for Neuter nouns

Layer 2: Case inflection

This is the second layer. Here we do indefinite forms and compounds.

Lexica for masculine nouns

Lexica for weak case suffixes.

Singular

Plural

Strong case suffixes

Nominative Sg

Accusative Sg

Dative Sg

Genitive Sg

Plural forms

Nominative

Accusative

Dative

Genitive

Feminine forms

Singular case suffixes.

Nominative

Oblique

Plural case suffixes

Neuter forms

Singular

Layer 3: Definite inflection

This is the third layer. Here we do the indefinite and definite forms. These are common to (almost) all different paradigms, hence they are gathered here.

Masculine forms

Masc def sg

Masc def pl

Feminine forms

Fem Sg

Feminine plural forms

Neuter forms

Neuter sg

This concludes the nominal morphology.

Compound flags

The rest of the file contains flags, that govern the ways stems may be combined.


This (part of) documentation was generated from src/fst/morphology/affixes/nouns.lexc


src-fst-morphology-affixes-numerals.lexc.md

Numeral affixess

This lexicon just goes to #, this in order to coexist with number files in giella-shared. They are relevant for Sámi, not for faroese.

Lexica:


This (part of) documentation was generated from src/fst/morphology/affixes/numerals.lexc


src-fst-morphology-affixes-propernouns.lexc.md

Proper nouns

Table of content

The morphological tags

For each group, the maltag etc. lexicon functions as a default lexicon. The other lexica are there for specific subgroups of the names.

Indeclineables

Male first names

Female first names

Surnames

Place names and other names


This (part of) documentation was generated from src/fst/morphology/affixes/propernouns.lexc


src-fst-morphology-affixes-symbols.lexc.md

Symbol affixes


This (part of) documentation was generated from src/fst/morphology/affixes/symbols.lexc


src-fst-morphology-affixes-verbs.lexc.md

Verb morphology !

s1 nevna = riggar!

s2 keypa = riggar!

SETA seta = riggar!

s3 leiða = riggar!

s4 frøa = riggar!

s5 senda = riggar!

s6 hirða = riggar!

s7 gista = riggar!

s8 kenna = riggar!

s9 klippa = riggar!

s10 fylgja = riggar!

s11 roykja = riggar!

s12 boyggja = riggar!

s13 søkkja = riggar!

s14 heingja = riggar!

s15 skeinkja = riggar!

s15_2 steikja = riggar!

s16 flekja = riggar!

s17 berja = riggar!

s18 krevja = riggar!

s19 dylja = riggar!

s20 leggja = riggar!

s21 selja = riggar!

s22 ryðja = riggar ikki í sup og prfptc!

s22_1 ýðja = riggar!

s23 smyrja = riggar!

s24 flysa = riggar ikki í pass!

s25 liva = riggar!

s26 plaga = riggar (formurin plagdur manglar)!

s26_1 mála->máldi

s27 spáa = riggar!

s28 skaða = riggar ikki í prfptc!

s29 brúka = riggar!

s30 kalla = riggar!

s31 only gera and *gera = riggar!

s32/30 útbúgva = riggar!

s32 búgva = riggar!

s33 rógva

s34 goyggja = riggar!

Strong verbs starting here

s35 bíta riggar!

s36 svíkja riggar!

s37 bróta riggar!

s38 skjóta riggar!

s39d

s39s

s39

s40 fúka

s41 flúgva

s42 klúgva

s44 finna

s45 binda = riggar!

s46 stinga = riggar!

s47 svimja = riggar … men kanska skal tað ikki hava passiv

s48 drekka = riggar ikki í adj pga dpkons

s48_2 renna = riggar ikki í adj pga dpkons

s49 detta = riggar ikki í adj pga dpkons

s49_2 treffa = riggar ikki í adj pga dpkons

s49_3 sleppa = riggar ikki í adj pga dpkons

s49_4 verpa = riggar!

s50 røkka = riggar ikki í adj pga dpkons

s51 ganga = riggar!

s52 veva = riggar!

s53 leypa = riggar!

s54 bera = riggar!

s55 fara = riggar!

s56 geva = riggar!

s57 sita = riggar ikki + skal nokk ikki hava passiv

s58 mala

s59 stjala

s60 taka, aka

s61 halda

s62 sova

s63 koma

s64 lata

s64_1 láta

s65 standa

s66 biðja

s67 draga

s68 hvørva

s69 sláa

s70 siga

s71 skerja

s72 eta

s73 læa

Ad hoc, irregular

BLÍVA

EIGA

EITA

GRÁTA

HAVA

KUNNA

MEGA

MUNNA

SKULA

TYKJA

VERA

VERÐA

VILJA

VITA

SÍGGJA

FÁA

NÁA XXX check

LIGGJA

RADA

BURDA

GJALDA

VALDA

FALLA

GJALLA

BREGDA

SYNGJA XXX check

HOGGA høgga

KVODA

FLYGGJA

VAKSA

VEKSA

s30/26_1 dáma

HYGGJA

TYGGJA

MYLA

BLASA

TYSJA

GROA

KVOTTA

GALDA

TAKAST

LOYPAST loypast

sxrefl This is an ad hoc lexicon

s74 grindast

s75 balast

s76 ræðast

s77 skiftast

s78 farast

s79 skjótast

s80 trivast

s81 kíkjast

s82 fýlast

s83 samsinnast

FYRIB kopi, s83

Split lexica

s8/48_2 s9/30

Intermediate lexicon groups

standard_ir

standard_ir_t

ir_verb

ir_verb_t

Suffix lexica

Infinitive

jinf

inf

reflinf

Present

pres_ir

pres_ir_j2

pres_jir

pres_ir_sg

pres_ar

pres_ur

pres_iur

pres_ur_j

pres_ur_j2

pres_strong_s1

pres_strong_s23

pres_strong_s23_t

pres_strong_s23_t0

pres_strong_s23_t1

pres_pl

pres_ast

pres_ist

pres_1ist

pres_23st

pres_plast

pret_adist

pret_dist

pret_tist

pret_ist

pret_st

pret_plust

pret_pltust

Preterite

prt_d

prt_ð

prt_t

prt_ði

prt_ti

prt_du

prt_tu

prt_ðu

prt_dd

prt_a

prt_null

prt_null_s

prt_null_s2

prt_null_s2_t

prt_u_p

Passive lexica

Imperative and present participle

imp_prsptc

imp_prsptc_j

imp

imp_j

impsg

imppl

imppl_j

prsptc

Supine and preterite participle

sup

sup_t

sup_tt

sup_a kalla

sup_null stungið

sup_in kalla

sup_ið_in stungið

Middle lexicon

VANDI

Perfect Participles !

p18

p26

p26_2

p34_6

p34_7

p32

p39

p5pos

p5

p6

p7

p8


This (part of) documentation was generated from src/fst/morphology/affixes/verbs.lexc


src-fst-morphology-compounding.lexc.md

Compounding morphology

Lexicon R gets flags and sends compounds over to RReal

@P.CmpFrst.FALSE@@P.CmpPref.FALSE@@D.CmpLast.TRUE@@D.CmpNone.TRUE@@U.CmpNone.FALSE@@P.CmpOnly.TRUE@ RReal ; are Flags to control compounding

Lexicon RReal is the lexicon for the Cmp tag and resending to N, A

Lexicon R- for compounds with hyphen

+Cmp#:%- Nouns ;
+Cmp#:%- Adjectives ;

Lexicon RNum for compounds numeral + noun

  +Use/SpellNoSugg+Cmp/Hyph+Cmp#:-# Nouns ;    For Num Cmp Noun, vi vil ikke ha Num Cmp Num

This (part of) documentation was generated from src/fst/morphology/compounding.lexc


src-fst-morphology-phonology.twolc.md

The Faroese morphophonological file

This file documents the phonology.twolc file

Alphabet

Here we declare all symbols.

Sets

Here we define some convenient sets.

Rules

These are the rules. After each rule (or rather: after many of the rules) there are test cases that are there to test whether the rules work.

Verschärfung

Rule: Deleting g

Rule: ng to kk Part 1 changes n to k in ng:kk before ^NGKK trigger

Rule: ng to kk Part 2 changes g to k in ng:kk before ^NGKK trigger

Rule: Deleting v in gv sequences Verschärverung II gives v:0 for gv:00 before ^GVDEL and in some other contexts

Verschärfung tests:*

Rule: Deleting r in Genitive of ur stems

Rule: **Deleting m in um%>num **

Tests:

Rule: Deleting Double Consonant in Front of Consonant

The preceeding rule is fishy - the test cases below don’t fit the context requirements, and the >s# in the right context seems to indicate passive. The rule conflicts with the “Cns Deletion in front of Pass” rule at the end of the file - but only when using the Xerox tools! XXX - please have a look!

Tests:

Verbal Sandhi rules

Rule: Geminate Assimilation in Past Tense d

Rule: Geminate Assimilation in Past Tense t

Tests:

Rule: ð Assimilation in Front of Dental Past Suffix -d(i)

Tests:

Rule: Deleting Double Consonant in Front of Epenthesis mark

Tests:

Rule: Deleting stem-final s in s genitive

Tests:

Rule: Double ð Deletion

Rule: ð Assimilation in Front of Supine Suffix -t

Tests:

Rule: Adjusting Dental Past Suffix -d(i)

Tests:

Adjectival sandhi rules

Rule: Adjective neuter after nlr 1

Rule: Adjective neuter after nlr 2

Tests:

Rule: t Deletion in Neuter

j rules

Rule: Deleting j

Tests:

Rule: Realising j in front of vowels

Tests:

Vowel rules

Rule: Realising i2 as i

Tests:

Epenthetic vowel rules

Rule: Epenthetic deletion

Tests:

Rule: U-umlaut of Epenthetic vowel

Tests:

Umlaut rules

Rule: U-umlaut in Front of Nasal

Tests:

Rule: General U-umlaut

Tests:

Rule: U-umlaut for akur

Tests:

Rule: I-umlaut

Tests:

Rule: eI-umlaut for o:e, á:e, i:e

Rule: **I-umlaut for bróðir **

Rule: Inverted U-umlaut from ø

Tests:

Rule: Inverted U-umlaut from o

Tests:

Rule: o/ei-Umlaut I

Rule: o/ei-Umlaut II

Tests:

Vowel deletion rules

Rule: Vowel deletion in front of na

Verbal vowel alternation rules

Rule: Stem vowel change in Weak Verbs

Tests:

Rule: Stem Vowel Shortening in Supine and Participle

Tests:

Rule: Past tense singular diphthongs I

Rule: Past tense singular diphthongs II

Tests:

Rule: Past tense singular monophthongs

Tests:

Rule: Past tense plural monophthongs

Rule: Past tense plural monophthongs to a

Rule: Supine u

Rule: Supine o

Rule: Supine i

Rule: Present tense ý

Adjectival Sandhi rule

Rule: Vowel shortening in Neuter

Tests:

Other rules

Morphological passive rules

Rule: u in ur Deletion in front of Pass

Rule: r Deletion in front of Pass

Rule: ð Deletion in front of Pass


This (part of) documentation was generated from src/fst/morphology/phonology.twolc


src-fst-morphology-root.lexc.md

Faroese morphological analyser

Definitions for Multichar_Symbols

Tags for POS

Semantic tags

Non-changing letters

Triggers for Morphophonology

Language tags

Non-ascii letters, perhaps needed as multichar symbols

Compounding tags

The tags are of the following form:

This entry / word should be in the following position(s):

Usage tags

Symbols that need to be escaped on the lower side (towards twolc):

Todo: Check whether these can be removed. They are probably obsolete.

Flag diacritics

We have manually optimised the structure of our lexicon using following flag diacritics to restrict morhpological combinatorics - only allow compounds with verbs if the verb is further derived into a noun again:

@P.NeedNoun.ON@ (Dis)allow compounds with verbs unless nominalised
@D.NeedNoun.ON@ (Dis)allow compounds with verbs unless nominalised
@C.NeedNoun@ (Dis)allow compounds with verbs unless nominalised

Flags for speller suggestions

@D.ErrOrth.ON@
@C.ErrOrth@
@P.ErrOrth.ON@
@R.ErrOrth.ON@

Flag for case harmony in compounds

Set flag for compounds

Flag Example word
@P.Case.MscNom@ fyrstiflokkur
@P.Case.MscObl@ fyrstaflokk
@P.Case.FemNom@ lítlasystir
@P.Case.FemObl@ lítluusystur
@P.Case.Neu@ breiðaskarð
@P.Case.Pl@ fyrstuflokkar, lítlusystrar, breiðuskørð

Control flag values for compounds

Flag Example word
@R.Case.MscNom@ fyrstiflokkur
@R.Case.MscObl@ fyrstaflokk
@R.Case.FemNom@ lítlasystir
@R.Case.FemObl@ lítluusystur
@R.Case.Neu@ breiðaskarð
@R.Case.Pl@ fyrstuflokkar, lítlusystrar, breiðuskørð

Control flag values for compounds

Flag Example word
@U.Case.MscNom@ fyrstiflokkur
@U.Case.MscObl@ fyrstaflokk
@U.Case.FemNom@ lítlasystir
@U.Case.FemObl@ lítluusystur
@U.Case.Neu@ breiðaskarð
@U.Case.Pl@ fyrstuflokkar, lítlusystrar, breiðuskørð

Flag diacritic look-alikes for grammar checker & tokenisation purposes

Flag Explanation
@P.Pmatch.Loc@ Location in string used or parsed by hfst-pmatch
@P.Pmatch.Backtrack@ Also for hfst-pmatch

Flags for compound restriction

For languages that allow compounding, the following flag diacritics are needed to control position-based compounding restrictions for nominals. Their use is handled automatically if combined with +CmpN/xxx tags. If not used, they will do no harm.

Flag Explanation
@P.CmpFrst.FALSE@ Require that words tagged as such only appear first
@D.CmpPref.TRUE@ Block such words from entering ENDLEX
@P.CmpPref.FALSE@ Block these words from making further compounds
@D.CmpLast.TRUE@ Block such words from entering R
@D.CmpNone.TRUE@ Combines with the next tag to prohibit compounding
@U.CmpNone.FALSE@ Combines with the prev tag to prohibit compounding
@P.CmpOnly.TRUE@ Sets a flag to indicate that the word has passed R
@D.CmpOnly.FALSE@ Disallow words coming directly from root.

Use the following flag diacritics to control downcasing of derived proper nouns (e.g. Finnish Pariisi -> pariisilainen). See e.g. North Sámi for how to use these flags. There exists a ready-made regex that will do the actual down-casing given the proper use of these flags. | @U.Cap.Obl@ | Allowing downcasing of derived names: deatnulasj. | @U.Cap.Opt@ | Allowing downcasing of derived names: deatnulasj.

Flag diacritic Explanation
@U.number.one@ Flag used to give arabic numerals in smj different cases ;
@U.number.two@ Flag used to give arabic numerals in smj different cases ;
@U.number.three@ Flag used to give arabic numerals in smj different cases ;
@U.number.four@ Flag used to give arabic numerals in smj different cases ;
@U.number.five@ Flag used to give arabic numerals in smj different cases ;
@U.number.six@ Flag used to give arabic numerals in smj different cases ;
@U.number.seven@ Flag used to give arabic numerals in smj different cases ;
@U.number.eight@ Flag used to give arabic numerals in smj different cases ;
@U.number.nine@ Flag used to give arabic numerals in smj different cases ;
@U.number.zero@ Flag used to give arabic numerals in smj different cases ;

Lexicon Root

This is the beginning of everything. The Root lexicon is reserved in the LexC language, and must be the first lexicon defined.

Lexicon Acronyms is split in two:

Lexicon ENDLEX

And this is the ENDLEX of everything:

@D.CmpOnly.FALSE@@D.CmpPref.TRUE@@D.NeedNoun.ON@ ENDLEX2 ;

The @D.CmpOnly.FALSE@ flag diacritic is ued to disallow words tagged with +CmpNP/Only to end here. The @D.NeedNoun.ON@ flag diacritic is used to block illegal compounds.


This (part of) documentation was generated from src/fst/morphology/root.lexc


src-fst-morphology-stems-abbreviations.lexc.md

File containing Faroese abbreviations

Lexica for adding tags and periods

The idea is (or may be) to use both common and language-speicfic abbreviations.

Splitting in 3 groups, because of the preprocessor

Abbreviation

dot% noStb.db Abbreviations that never induce sentence boundaries The file is too large and should be shrinked


This (part of) documentation was generated from src/fst/morphology/stems/abbreviations.lexc


src-fst-morphology-stems-adjectives.lexc.md

Faroese adjectives

The adjectives and their inflectional codes are taken from “Føroysk orðabók”.

The list of ajectives

Adjectives for the list of adjectives

Irregular comparatives and superlatives

Prefixed present participles

Regular adjectives, systematic list


This (part of) documentation was generated from src/fst/morphology/stems/adjectives.lexc


src-fst-morphology-stems-adpositions.lexc.md

Faroese prepositions

We should eventually have syntactic tags here…

Tags

p for the tag +Pr

The list of prepositions

Preposition for the list of prepositions, ordered according to case they select for.

Foreign

Several cases

Accusative or dative

| —

Accusative or genitive

Accusative

Dative


This (part of) documentation was generated from src/fst/morphology/stems/adpositions.lexc


src-fst-morphology-stems-adverbs.lexc.md

Faroese adverbs

adv for the tag +Adv

advcomp for the tag +Adv+Cmp

advsuperl for the tag +Adv+Superl

Adverb for the list of appr 1000 adverbs


This (part of) documentation was generated from src/fst/morphology/stems/adverbs.lexc


src-fst-morphology-stems-conjunctions.lexc.md

The Faroese conjunctions

The file stems/conjunctions.lexc contains two lexica:

LEXICON CCtag for assigning the +CC tag to all the conjunctions below. It has one entry:

LEXICON Conjunction for the list of 10 or so conjunctions that are found in the file. Here are the first entries:


This (part of) documentation was generated from src/fst/morphology/stems/conjunctions.lexc


src-fst-morphology-stems-determiners.lexc.md

Faroese deternminers


This (part of) documentation was generated from src/fst/morphology/stems/determiners.lexc


src-fst-morphology-stems-fao-acronyms.lexc.md

Akronymer

This documents the stems/fao-acronyms.lexc file. Most acronyms are taken from a common generated file, this file is for the Faroese-specific acronyms.

LEXICON Acronym-fao pointing to the lexica

LEXICON Acronym-fao-list for selve listen, i øjeblikket 2:

Akronymnumeralier for 0-9

anl send numvers too letterloops – this might be too liberal.


This (part of) documentation was generated from src/fst/morphology/stems/fao-acronyms.lexc


src-fst-morphology-stems-interjections.lexc.md

Interjections

The tag +Interj

Interj

The words

Interjection okey, ááá, aj, huff, …


This (part of) documentation was generated from src/fst/morphology/stems/interjections.lexc


src-fst-morphology-stems-nouns.lexc.md

Faroese noun stem file

The lexicon names are taken from Føroysk orðabók I-II (FO). Reference is made to Thráinsson & al (“fg”).

Note that in some cases, the lexicon names and stems here deviate from FO. In that case the lexica have names ending in wordforms, written in capital lettes.

Short lexica

Shortnouns for 1, 2 and 3 letter nouns excluded from compounding

These are now always excluded from lastpart compound and in norm from first-part compounding as well

The main list of nouns

Her kjem alle substantiva. Dei er baklengssortert. leksikon som byrjar med x er ikkje manuelt sjekka.

Nouns

Fila inneheld i underkant av 50000 lemma.


This (part of) documentation was generated from src/fst/morphology/stems/nouns.lexc


src-fst-morphology-stems-numerals.lexc.md

Faroese Numerals

Numeral splitting in types

1-9

TRÝsplit

nsplit

TEXTTENS

TEXTTEENS

basic

EITT

TVEY

TRÝ

PAIRNUM

n

Ordinals

ordinals

ord_decl

ANNAR

ANNARMORPH


This (part of) documentation was generated from src/fst/morphology/stems/numerals.lexc


src-fst-morphology-stems-pronouns.lexc.md

Faroese pronouns

Pronoun splitting into 3 sublexica:

  1. Personal ;
  2. Reflexive ;
  3. Interrogative ;
  4. Indefinite ;

Personal for the personal pronouns

egtu-obl

okkumtykkum

S_okkumtykkum

3obl

Reflexive

Interrogative

EINHVOR

ANNARHVOR

HANNSJALVUR

Indefinite

ONKUR

NAKAR

BADIR

HVORGIN

EINGIN


This (part of) documentation was generated from src/fst/morphology/stems/pronouns.lexc


src-fst-morphology-stems-propernouns.lexc.md

Proper nouns

Table of content

Splitting into name types

Propernouns splitting in 3 lexica: multipartnames, names, guess

multipartnames contains only 3 names for now

names gives the list of names.


This (part of) documentation was generated from src/fst/morphology/stems/propernouns.lexc


src-fst-morphology-stems-subjunctions.lexc.md

Faroese subjunctions

The file stems/subjunctions.lexc contains three lexica:

LEXICON CStag assigns the +CS TAG. It has one entry: +CS: # ;

LEXICON IMtag assigns the +IM tag for the infinitive marker. The entry is: +IM: # ;

LEXICON Subjunction contains the list of some 10-20 CSs. Here are the first 4:


This (part of) documentation was generated from src/fst/morphology/stems/subjunctions.lexc


src-fst-morphology-stems-verbs.lexc.md

Faroese verb stems

This file documents the file stems/verbs.lexc

The file contains one lexicon:

LEXICON Verbs = the lexicon containing all verb stems

Some irregular verbs

mega, eiga, eita, gráta, liggja, … and 15 more

some irregular passive verbs

The long verb list

The lexica listed here represent the declension patterns presented in Føroysk orðabók. The lexicon names correspond to the declension codes in the dictionary.

Simple declension class verbs

Still to be classified

Double declension class verbs

Finally some candidates to be considered for verb compounding.


This (part of) documentation was generated from src/fst/morphology/stems/verbs.lexc


src-fst-phonetics-txt2ipa.xfscript.md

Phonological converter for Faroese

Table below taken from:

Standardising Pronunciation for a Grapheme-to-Phoneme Converter for Faroese

FARSAMPA/IPA table

Phoneme class Orthography FARSAMPA IPA
Stops p p
  b b p
  t t
  d d t
  k k
  g g k
Fricatives f f f
  v v v
  ? 4 ð
  ? 5 θ
  s s s
  s S ʃ
  ? z ʂ
  h h h
Affricates b tS tʃʰ
  b dZ
Nasals m m m
  m M
  n n n
  n x
  n N ŋ
  n X ŋ̊
Laterals l l l
  l L
Approximants ð w w
  ð j j
  r r ɹ
Monophthongs i i i
  i? I ɪ
  e e e
  e? E ɛ
  a a a
  y y y
  ? Y ʏ
  ø 2 ø
  ? 9 œ
  ú? u u
  ? U ʊ
  ? o o
  ? O ɔ
  ? 8 ə
Diphthongs æ? EA ɛa
  á OA ɔa
  oy OJ ʊi
  ? UJ ɛi
  ei EJ ai
  ei? aJ ai
  ? aW au
  ? OJ ɔi
  ? OW ɔu
  ? 3W ʉu
  ? EW ɛu
  ? 9W œu
  ? 9J œi
Diacritics ? H ʰ
Others (length) : ː
  (prim. stress % ˈ
  (sec. stress) ~ ˌ

For reference: The SAMPA - IPA correspondence

SAMPA IPA Description
p p voiceless bilabial stop
b b voiced bilabial stop
t t voiceless alveolar or dental stop
d d voiced alveolar or dental stop
ts ʦ voiceless alveolar affricate
dz ʣ voiced alveolar affricate
tS ʧ voiceless postalveolar affricate
dZ ʤ voiced postalveolar affricate
c c voiceless palatal stop
J\ ɟ (overstroked j) voiced palatal stop
k k voiceless velar stop
g g voiced velar stop
q q voiceless uvular stop
p\ ɸ (Greek phi) voiceless bilabial fricative
B β (Greek beta) voiced bilabial fricative
  ϐ (Greek beta alt) voiced bilabial approximant
f f voiceless labiodental fricative
v v voiced labiodental fricative
T θ (Greek theta) voiceless dental fricative
  ϑ (Greek theta alt) voiceless dental approximant
D ð (Icelandic eth) voiced dental fricative
  δ (Greek delta) voiced dental approximant
s s voiceless alveolar fricative
z z voiced alveolar fricative
S ʃ voiceless postalveolar fricative
Z ʒ voiced postalveolar fricative
C ç (cedilla) voiceless palatal fricative
j\ (jj) ʝ (j with crossed tail) voiced palatal fricative
x x voiceless velar fricative
G γ (Greek gamma) voiced velar fricative
  ɰ voiced velar approximant
X\ ħ (overstroked h) voiceless pharyngeal fricative
?\ ʕ (Inverted ?) voiced pharyngeal fricative
h h voiceless glottal approximant
h\ ɦ (h with upper tail to the right) voiced glottal approximant
m m bilabial nasal
F ɱ (m with downward right tail) labiodental nasal
n n alveolar or dental nasal
J ɲ (n with downward left tail) palatal nasal
N ŋ (n with downward right tail) velar nasal
l l alveolar lateral
L ʎ turned down y, alt. λ (Greek lambda) palatal lateral
5 ɫ (l with middle tilde) velarized dental lateral
4 (r) ɾ (r without upper-left serif) alveolar flap
r (rr) r alveolar trill
r\ ɹ (r rotated 180°) retroflexed alveolar approximant
R ʀ (small capital R) uvular trill
P ʋ labiodental approximant
w w velo-labial approximant
H ɥ (turned down h) palato-labial approximant
j j palatal approximant

Vowels

.             front   near-front    central   near-back   back
close          i • y               1 • }                 M • u
near-close              I • Y                    U
close-mid      e • 2              @\ • 8                 7 • o
mid                                  @            
open-mid       E • 9               3 • 3\                V • O
near-open        {                    6           
open           a • &                                     A • Q

More SAMPA/IPA documentation

(Some symbols are doubled or escaped with \ in the source to escape Markdown (mis)interpretation, they will appear correct in the rendered HTML.)

Description SAMPA IPA Unicode
retroflex plosive, voiceless t` 1 ʈ 0288, 648
retroflex plosive, voiced d` 1 ɖ 0256, 598
labiodental nasal F ɱ 0271, 625
retroflex nasal n` 1 ɳ 0273, 627
palatal nasal J ɲ 0272, 626
velar nasal N ŋ 014B, 331
uvular nasal N\ ɴ 0274, 628
bilabial trill B\ ʙ 0299, 665
uvular trill R\ ʀ 0280, 640
alveolar tap 4 ɾ 027E, 638
retroflex flap r` 1 ɽ 027D, 637
bilabial fricative, voiceless p\ ɸ 0278, 632
bilabial fricative, voiced B β 03B2, 946
dental fricative, voiceless T θ 03B8, 952
dental fricative, voiced D ð 00F0, 240
postalveolar fricative, voiceless S ʃ 0283, 643
postalveolar fricative, voiced Z ʒ 0292, 658
retroflex fricative, voiceless s` 1 ʂ 0282, 642
retroflex fricative, voiced z` 1 ʐ 0290, 656
palatal fricative, voiceless C ç 00E7, 231
palatal fricative, voiced j\ ʝ 029D, 669
velar fricative, voiced G ɣ 0263, 611
uvular fricative, voiceless X χ 03C7, 967
uvular fricative, voiced R ʁ 0281, 641
pharyngeal fricative, voiceless X\ ħ 0127, 295
pharyngeal fricative, voiced ?\ ʕ 0295, 661
glottal fricative, voiced h\ ɦ 0266, 614
       
alveolar lateral fricative, vl. K    
alveolar lateral fricative, vd. K\    
       
labiodental approximant P (or v\ )    
alveolar approximant r\    
retroflex approximant r\` 1    
velar approximant M\    
       
retroflex lateral approximant l` 1    
palatal lateral approximant L    
velar lateral approximant L\    
       
Clicks      
bilabial O\   (O = capital letter)
dental |\    
(post)alveolar !\    
palatoalveolar =\    
alveolar lateral ||\    
       
Ejectives, implosives      
ejective _>   e.g. ejective p = p_>
implosive _<   e.g. implosive b = b_<
       
Vowels      
close back unrounded M    
close central unrounded 1    
close central rounded }    
lax i I    
lax y Y    
lax u U    
       
close-mid front rounded 2    
close-mid central unrounded @\    
close-mid central rounded 8    
close-mid back unrounded 7    
       
schwa ə @    
       
open-mid front unrounded E    
open-mid front rounded 9    
open-mid central unrounded 3    
open-mid central rounded 3\    
open-mid back unrounded V    
open-mid back rounded O    
       
ash (ae digraph) {    
open schwa (turned a) 6    
       
open front rounded &    
open back unrounded A    
open back rounded Q    
       
Other symbols      
voiceless labial-velar fricative W    
voiced labial-palatal approx. H    
voiceless epiglottal fricative H\    
voiced epiglottal fricative <\    
epiglottal plosive >\    
       
alveolo-palatal fricative, vl. s\    
alveolo-palatal fricative, voiced z\    
alveolar lateral flap l\    
simultaneous S and x x\    
tie bar _    
       
Suprasegmentals      
primary stress    
secondary stress %    
long :    
half-long :\    
extra-short _X    
linking mark -\    
       
Tones and word accents      
level extra high _T    
level high _H    
level mid _M    
level low _L    
level extra low _B    
downstep !    
upstep ^   (caret, circumflex)
       
contour, rising _R    
contour, falling _F    
contour, high rising _H_T    
contour, low rising _B_L    
       
contour, rising-falling _R_F   (NB Instead of being written as diacritics with _, all prosodic marks can alternatively be placed in a separate tier, set off by < >, as recommended for the next two symbols.)
       
global rise <R>    
global fall <F>    
       
Diacritics      
       
voiceless _0   (0 = figure), e.g. n_0
voiced _v    
aspirated _h    
more rounded _O   (O = letter)
less rounded _c    
advanced _+    
retracted _-    
centralized _”    
syllabic = (or _=)   e.g. n= (or n_=)
non-syllabic _^    
rhoticity `    
       
breathy voiced _t    
creaky voiced _k    
linguolabial _N    
labialized _w    
palatalized ’ (or _j)   e.g. t’ (or t_j)
velarized _G    
pharyngealized _?\    
       
dental _d    
apical _a    
laminal _m    
nasalized ~ (or _~)   e.g. A~ (or A_~)
nasal release _n    
lateral release _l    
no audible release _}    
       
velarized or pharyngealized _e    
velarized l, alternatively 5    
raised _r    
lowered _o    
advanced tongue root _A    
retracted tongue root _q    

This (part of) documentation was generated from src/fst/phonetics/txt2ipa.xfscript


src-fst-transcriptions-transcriptor-abbrevs2text.lexc.md

Faroese abbreviations

We describe here how abbreviations are in Faroese are read out, e.g. for text-to-speech systems.

LEXICON Root

For example:


This (part of) documentation was generated from src/fst/transcriptions/transcriptor-abbrevs2text.lexc


src-fst-transcriptions-transcriptor-clock-digit2text.lexc.md

The Faroese clock

Multichar_Symbols defines flags and +Use/NG and Úse/NA.

LEXICON Root where it all begins

LEXICON smallhour giving the 30-day

LEXICON largehour giving the 30-day

LEXICON BEFpunkt before punct

LEXICON AFTpunkt after punct

LEXICON BEF

LEXICON AFT after

LEXICON TOHALF before half

LEXICON OVERHALF after half

LEXICON TO í

LEXICON OVER yvir

LEXICON HOUR split in cases (not in use)

LEXICON NOMHOUR hours 1-12 in nominative


This (part of) documentation was generated from src/fst/transcriptions/transcriptor-clock-digit2text.lexc


src-fst-transcriptions-transcriptor-date-digit2text.lexc.md

Faroese dates

Defining one tag: +Use/NG for do not generate

LEXICON Root starts.

LEXICON DAY splits days 1-9 in nominative and accusative

LEXICON DAY10 splits days 10-31 in nominative and accusative

LEXICON DAY_NOM the nominative ones (fyrsti…)

LEXICON DAY_ACC the accusative ones (fyrsta…)

LEXICON DAY10_NOM nominative tiggjundi…

LEXICON DAY10_ACC accusative tiggjunda…

LEXICON 29MONTH splits in 3 month types

LEXICON 30MONTH giving the 30-day

LEXICON 31MONTH giving the 31-day months

LEXICON PUNCT gives punctiation


This (part of) documentation was generated from src/fst/transcriptions/transcriptor-date-digit2text.lexc


src-fst-transcriptions-transcriptor-numbers-digit2text.lexc.md

Faroese numbers

digits are translated to text and vice versa

It starts with lexicon Root, which splits into thousands, hundreds, tens, ones. LEXICON @ØLEXNAME@

LEXICON THOUSANDS

LEXICON 2to9T for two to nine thousand, pointing to THOUSAND.

LEXICON 10to99T for 10t and up

LEXICON TEENT for 10-19 thousands

LEXICON TENST

LEXICON TENCOUNTT

LEXICON OLDTENST

LEXICON OLDTEN-1T

LEXICON OLDTEN-2T

LEXICON OLDTEN-3T

LEXICON OLDTEN-4T

LEXICON OLDTEN-5T

LEXICON OLDTEN-6T

LEXICON OLDTEN-7T

LEXICON OLDTEN-8T

LEXICON OLDTEN-9T

LEXICON END1T

LEXICON END2T

LEXICON END3T

LEXICON END4T

LEXICON END5T

LEXICON END6T

LEXICON END7T

LEXICON END8T

LEXICON END9T

LEXICON HUNDREDST

LEXICON HUNDREDT

LEXICON 1to99T

LEXICON THOUSAND

LEXICON HUNDREDS

LEXICON HUNDRED

LEXICON 1to99

LEXICON 1to9

LEXICON 10to99

LEXICON TEEN

LEXICON TENS

LEXICON TENCOUNT

LEXICON ZERO

LEXICON OLDTENS

LEXICON OLDTEN-1

LEXICON OLDTEN-2

LEXICON OLDTEN-3

LEXICON OLDTEN-4

LEXICON OLDTEN-5

LEXICON OLDTEN-6

LEXICON OLDTEN-7

LEXICON OLDTEN-8

LEXICON OLDTEN-9

LEXICON END1

LEXICON END2

LEXICON END3

LEXICON END4

LEXICON END5

LEXICON END6

LEXICON END7

LEXICON END8

LEXICON END9


This (part of) documentation was generated from src/fst/transcriptions/transcriptor-numbers-digit2text.lexc


tools-grammarcheckers-grammarchecker.cg3.md

Faroese grammarchecker

This is work in progress. The main focus is on ð errors,

This file contains two parts: Definitions and rules

Definition section

Delimiters

Grammatical tags

Here we declare all grammatical tags

Declaring all the error tags

Rule section

Verbs

Sg1 target forms

RULE: Sup should be 1Sg

RULE: Sup should be 1Sg

RULE: sup > inf

RULE: Neu should be 1Sg

RULE: Imp Pl should be 1Sg

Plural forms

RULE: Sup should be Pl – marginal??

RULE: Sup should be Pl – marginal??

Supine forms

RULE:s for Pl should be Sup are not written

RULE: Inf should be Sup

RULE: Inf should be Sup

RULE: Inf should be Sup

Specific verbs

RULE: Past tens of láta is læt not lat

Nouns

Definiteness

RULE: Neu Indef should be Neu Def

We turn off this rule for now, it is too hard to avoid false alarms.

Quantor phrases

RULE: Num + N Sg should be Num + N Pl

Num + N Sg should be Num + N Pl (We need arabic tag here)

Subjunctives

Nothing here.

ta / tað rules

RULE: ta should be tað

Adjectives

RULE: líti should be lítið


This (part of) documentation was generated from tools/grammarcheckers/grammarchecker.cg3


tools-grammarcheckers-grc-disambiguator.cg3.md

Faroese disambiguator

Usage, in lang-fao: cat text.txt|hfst-tokenize -cg tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst |vislcg3 -g src/cg3/disambiguator.cg3

This file documents the Faroese disambiguator file .

Delimiters, tags and sets

Test: Go for minimal weight. This rules gives priority to lexicalised forms.

MAPPING OF CC AND CS

Mostly we map both @CNP and @CVP, then we select @CNP, after that we remove them so @CVP remains


This (part of) documentation was generated from tools/grammarcheckers/grc-disambiguator.cg3


tools-tokenisers-tokeniser-disamb-gt-desc.pmscript.md

Tokeniser for fao

Usage:

$ make
$ echo "ja, ja" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa boasttu olmmoš, man mielde lahtuid." | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "márffibiillagáffe" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst

Pmatch documentation: https://github.com/hfst/hfst/wiki/HfstPmatch

Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words:

Whitespace contains ASCII white space and the List contains some unicode white space characters

Apart from what’s in our morphology, there are

  1. unknown word-like forms, and
  2. unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a Unknowns are made of:
    • lower-case ASCII
    • upper-case ASCII
    • select extended latin symbols
    • Faroese-specific alphabet ASCII digits
    • select symbols
    • Combining diacritics as individual symbols,
    • various symbols from Private area (probably Microsoft), so far:
    • U+F0B7 for “x in box”

Unknown handling

Unknowns are tagged ?? and treated specially with hfst-tokenise hfst-tokenise –giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it’s safer to let hfst-tokenise handle them.

Finally we mark as a token any sequence making up a:


This (part of) documentation was generated from tools/tokenisers/tokeniser-disamb-gt-desc.pmscript


tools-tokenisers-tokeniser-gramcheck-gt-desc.pmscript.md

Grammar checker tokenisation for fao

Requires a recent version of HFST (3.10.0 / git revision>=3aecdbc) Then just:

$ make
$ echo "ja, ja" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst

More usage examples:

$ echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa boasttu olmmoš, man mielde lahtuid." | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "márffibiillagáffe" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst

Pmatch documentation: https://github.com/hfst/hfst/wiki/HfstPmatch

Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words:

Whitespace contains ASCII white space and the List contains some unicode white space characters

Apart from what’s in our morphology, there are 1) unknown word-like forms, and 2) unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a

TODO: Could use something like this, but built-in’s don’t include šžđčŋ:

Simply give an empty reading when something is unknown: hfst-tokenise –giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it’s safer to let hfst-tokenise handle them.

Finally we mark as a token any sequence making up a:


This (part of) documentation was generated from tools/tokenisers/tokeniser-gramcheck-gt-desc.pmscript


tools-tokenisers-tokeniser-tts-cggt-desc.pmscript.md

TTS tokenisation for smj

Requires a recent version of HFST (3.10.0 / git revision>=3aecdbc) Then just:

make
echo "ja, ja" \
| hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst

More usage examples:

echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa \
boasttu olmmoš, man mielde lahtuid." \
| hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" \
| hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
echo "márffibiillagáffe" \
| hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst

Pmatch documentation: https://kitwiki.csc.fi/twiki/bin/view/KitWiki/HfstPmatch

Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words:

Whitespace contains ASCII white space and the List contains some unicode white space characters

Apart from what’s in our morphology, there are 1) unknown word-like forms, and 2) unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a

TODO: Could use something like this, but built-in’s don’t include šžđčŋ:

Simply give an empty reading when something is unknown: hfst-tokenise –giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it’s safer to let hfst-tokenise handle them.

Needs hfst-tokenise to output things differently depending on the tag they get


This (part of) documentation was generated from tools/tokenisers/tokeniser-tts-cggt-desc.pmscript

  1. `  = ASCII 096  2 3 4 5 6 7 8