Finite state and Constraint Grammar based analysers, proofing tools and other resources
View the project on GitHub giellalt/lang-fao
All doc-comment documentation in one large file.
Usage, in lang-fao
:
cat text.txt|hfst-tokenize -cg tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst |vislcg3 -g src/cg3/disambiguator.cg3
This file documents the Faroese disambiguator file .
Test: Go for minimal weight. This rules gives priority to lexicalised forms.
Mostly we map both @CNP and @CVP, then we select @CNP, after that we remove them so @CVP remains
CCasCNPCVP Map (@CNP @CVP) to CC
killAllahtenotCS All occurrences of “at” are CSs.
Kill Sem/ID
killAllCNP removes all remaining @CNP
XCC-CS removes CC and CS with no synttag
ErrOrth goes for correct forms
X removes readings with no syntax
This (part of) documentation was generated from src/cg3/disambiguator.cg3
S Y N T A C T I C F U N C T I O N S F O R F A R O E S E
Sámi language technology project 2003-2014, University of Tromsø #
This file adds syntactic functions. It was copied from sme.
!! Syntactic sets
@X : The function is unknown, e.g. because of that the word is unknown
NP sets defined according to their morphosyntactic features
These sets model noun phrases (NPs). The idea is to first define whatever can occur in front of the head of the NP, and thereafter negate that with the expression WORD - premodifiers.
The set NOT-NPMOD is used to find barriers between NPs. Typical usage: … (*1 N BARRIER NPT-NPMOD) … meaning: Scan to the first noun, ignoring anything that can be part of the noun phrase of that noun (i.e., “scan to the next NP head”)
!!HNOUN MAPPING
!! The leftovers are tagged @X
! missingX adds @X to all missings
! therestX adds @X to all what is left, often errouneus disambiguated forms
This (part of) documentation was generated from src/cg3/functions.cg3
S Y N T A C T I C F U N C T I O N S F O R F A R O E S E
Sámi language technology project 2003-2014, University of Tromsø #
This file adds syntactic functions. It was copied from sme.
!! Syntactic sets
@X : The function is unknown, e.g. because of that the word is unknown
NP sets defined according to their morphosyntactic features
These sets model noun phrases (NPs). The idea is to first define whatever can occur in front of the head of the NP, and thereafter negate that with the expression WORD - premodifiers.
The set NOT-NPMOD is used to find barriers between NPs. Typical usage: … (*1 N BARRIER NPT-NPMOD) … meaning: Scan to the first noun, ignoring anything that can be part of the noun phrase of that noun (i.e., “scan to the next NP head”)
!!HNOUN MAPPING
!! The leftovers are tagged @X
! missingX adds @X to all missings
! therestX adds @X to all what is left, often errouneus disambiguated forms
This (part of) documentation was generated from src/cg3/korp.cg3
Now splitting according to POS, and according to dot or not
First collecting POS info, *-noun, *-adv, etc. Also splitting when in doubt: -noun-adj => -noun and -adj Then pointing to two contlexes, a dot-one and a non-dot-one.
**LEXICON ab-dot-noun ** This is the lexicon for abbrs that must have a period.
**LEXICON ab-dot-adj ** This is the lexicon for abbrs that must have a period.
**LEXICON nodot-infl **
**LEXICON dot-infl **
"kvæð" ABBR Gram/IAbbr N Abbr
in tokeniser mode also:"ABBR Gram/IAbbr N Abbr
+ "." CLB
to account for sentence
final kvæð with no extra full stop."kvæða" V Imp Sg
+ "." CLB
due to
homonymy.
Same treatment is done with two and three full stops after abbreviation in
the end of the sentence:"su" Adv Abbr
+ "." CLB Err/Orth
"su" Adv Abbr
+ "..." CLB
This (part of) documentation was generated from src/fst/morphology/affixes/abbreviations.lexc
LEXICON ACRONOUN ** is the lexicon for **nouns (not +Prop) like ATV
**LEXICON UNIT ** As acro, but without paradigm
**LEXICON acroconnector ** Here comes a set of possible symbols to put between the abbreviation and its suffix
**LEXICON acronull ** for suffixless forms, redirecting to K_only for clitic forms
This (part of) documentation was generated from src/fst/morphology/affixes/acronyms.lexc
Adjectival case lexica
Msc
Neu
Positiv, def, u-umlj Msc
Fem
Neu
Positiv, def, ø-umlj Msc
Fem Neu
Gender tags
Case tags
Compound flags
This (part of) documentation was generated from src/fst/morphology/affixes/adjectives.lexc
This file contains the inflection suffixes for the Faroese nowns The infection classes are identical to the ones in Føroysk orðabók.
The morphology is ordered in three layers.
The nominal morphology is added in three layers. In this first layer we add gender tags and morphophonological diacritics. The next two layers are for indefinite and definite suffixes, respectively.
We first list 4 lexica for words waiting to be checked.!
LEXICON xh25. TOOD: classify xkv2. They are all f and end in a consonant
These are lexica with number 0, they have no inflectional morphology.!
These are simply split (h11/12 to h11 and h12, etc).!
These lexica split into sg and pl lexica, and add +N and gender tags. Thereafter it points to Layer 2, the case suffixes
This is the second layer. Here we do indefinite forms and compounds.
This is the third layer. Here we do the indefinite and definite forms. These are common to (almost) all different paradigms, hence they are gathered here.
This concludes the nominal morphology.
The rest of the file contains flags, that govern the ways stems may be combined.
This (part of) documentation was generated from src/fst/morphology/affixes/nouns.lexc
This lexicon just goes to #, this in order to coexist with number files in giella-shared. They are relevant for Sámi, not for faroese.
Lexica:
This (part of) documentation was generated from src/fst/morphology/affixes/numerals.lexc
For each group, the maltag etc. lexicon functions as a default lexicon. The other lexica are there for specific subgroups of the names.
This (part of) documentation was generated from src/fst/morphology/affixes/propernouns.lexc
This (part of) documentation was generated from src/fst/morphology/affixes/symbols.lexc
s1 nevna = riggar!
s2 keypa = riggar!
SETA seta = riggar!
s3 leiða = riggar!
s4 frøa = riggar!
s5 senda = riggar!
s6 hirða = riggar!
s7 gista = riggar!
s8 kenna = riggar!
s9 klippa = riggar!
s10 fylgja = riggar!
s11 roykja = riggar!
s12 boyggja = riggar!
s13 søkkja = riggar!
s14 heingja = riggar!
s15 skeinkja = riggar!
s15_2 steikja = riggar!
s16 flekja = riggar!
s17 berja = riggar!
s18 krevja = riggar!
s19 dylja = riggar!
s20 leggja = riggar!
s21 selja = riggar!
s22 ryðja = riggar ikki í sup og prfptc!
s22_1 ýðja = riggar!
s23 smyrja = riggar!
s24 flysa = riggar ikki í pass!
s25 liva = riggar!
s26 plaga = riggar (formurin plagdur manglar)!
s26_1 mála->máldi
s27 spáa = riggar!
s28 skaða = riggar ikki í prfptc!
s29 brúka = riggar!
s30 kalla = riggar!
s31 only gera and *gera = riggar!
s32/30 útbúgva = riggar!
s32 búgva = riggar!
s33 rógva
s34 goyggja = riggar!
s35 bíta riggar!
s36 svíkja riggar!
s37 bróta riggar!
s38 skjóta riggar!
s39d
s39s
s39
s40 fúka
s41 flúgva
s42 klúgva
s44 finna
s45 binda = riggar!
s46 stinga = riggar!
s47 svimja = riggar … men kanska skal tað ikki hava passiv
s48 drekka = riggar ikki í adj pga dpkons
s48_2 renna = riggar ikki í adj pga dpkons
s49 detta = riggar ikki í adj pga dpkons
s49_2 treffa = riggar ikki í adj pga dpkons
s49_3 sleppa = riggar ikki í adj pga dpkons
s49_4 verpa = riggar!
s50 røkka = riggar ikki í adj pga dpkons
s51 ganga = riggar!
s52 veva = riggar!
s53 leypa = riggar!
s54 bera = riggar!
s55 fara = riggar!
s56 geva = riggar!
s57 sita = riggar ikki + skal nokk ikki hava passiv
s58 mala
s59 stjala
s60 taka, aka
s61 halda
s62 sova
s63 koma
s64 lata
s64_1 láta
s65 standa
s66 biðja
s67 draga
s68 hvørva
s69 sláa
s70 siga
s71 skerja
s72 eta
s73 læa
BLÍVA
EIGA
EITA
GRÁTA
HAVA
KUNNA
MEGA
MUNNA
SKULA
TYKJA
VERA
VERÐA
VILJA
VITA
SÍGGJA
FÁA
NÁA XXX check
LIGGJA
RADA
BURDA
GJALDA
VALDA
FALLA
GJALLA
BREGDA
SYNGJA XXX check
HOGGA høgga
KVODA
FLYGGJA
VAKSA
VEKSA
s30/26_1 dáma
HYGGJA
TYGGJA
MYLA
BLASA
TYSJA
GROA
KVOTTA
GALDA
TAKAST
LOYPAST loypast
sxrefl This is an ad hoc lexicon
s74 grindast
s75 balast
s76 ræðast
s77 skiftast
s78 farast
s79 skjótast
s80 trivast
s81 kíkjast
s82 fýlast
s83 samsinnast
FYRIB kopi, s83
s8/48_2 s9/30
standard_ir
standard_ir_t
ir_verb
ir_verb_t
jinf
inf
reflinf
pres_ir
pres_ir_j2
pres_jir
pres_ir_sg
pres_ar
pres_ur
pres_iur
pres_ur_j
pres_ur_j2
pres_strong_s1
pres_strong_s23
pres_strong_s23_t
pres_strong_s23_t0
pres_strong_s23_t1
pres_pl
pres_ast
pres_ist
pres_1ist
pres_23st
pres_plast
pret_adist
pret_dist
pret_tist
pret_ist
pret_st
pret_plust
pret_pltust
prt_d
prt_ð
prt_t
prt_ði
prt_ti
prt_du
prt_tu
prt_ðu
prt_dd
prt_a
prt_null
prt_null_s
prt_null_s2
prt_null_s2_t
prt_u_p
imp_prsptc
imp_prsptc_j
imp
imp_j
impsg
imppl
imppl_j
prsptc
sup
sup_t
sup_tt
sup_a kalla
sup_null stungið
sup_in kalla
sup_ið_in stungið
VANDI
p18
p26
p26_2
p34_6
p34_7
p32
p39
p5pos
p5
p6
p7
p8
This (part of) documentation was generated from src/fst/morphology/affixes/verbs.lexc
@P.CmpFrst.FALSE@@P.CmpPref.FALSE@@D.CmpLast.TRUE@@D.CmpNone.TRUE@@U.CmpNone.FALSE@@P.CmpOnly.TRUE@ RReal ; are Flags to control compounding
+Cmp#:%- Nouns ;
+Cmp#:%- Adjectives ;
+Use/SpellNoSugg+Cmp/Hyph+Cmp#:-# Nouns ; For Num Cmp Noun, vi vil ikke ha Num Cmp Num
This (part of) documentation was generated from src/fst/morphology/compounding.lexc
This file documents the phonology.twolc file
Here we declare all symbols.
á é ó ú í à è ò ù ì ä ë ö ü ï â ê ô û î ã ý þ ñ ð ß ç
Á É Ó Ú Í À È Ò Ù Ì Ä Ë Ö Ü Ï Â Ê Ô Û Î Ã Ý þ Ñ Ð
v2:v for invariant v
a3:a a:ø for da3n -> dønum, where normal a:o.
%^OEA:0 : ø to a
%^aAB:0 %^uAB:0 : Ablaut series subcases
Here we define some convenient sets.
These are the rules. After each rule (or rather: after many of the rules) there are test cases that are there to test whether the rules work.
Rule: Deleting g
Deleting g in sting:stakst
Rule: ng to kk Part 1 changes n to k in ng:kk before ^NGKK trigger
Rule: ng to kk Part 2 changes g to k in ng:kk before ^NGKK trigger
Rule: Deleting v in gv sequences Verschärverung II gives v:0 for gv:00 before ^GVDEL and in some other contexts
Verschärfung tests:*
Rule: Deleting r in Genitive of ur stems
Rule: **Deleting m in um%>num **
Tests:
Rule: Deleting Double Consonant in Front of Consonant
The preceeding rule is fishy - the test cases below don’t fit the context requirements, and the >s# in the right context seems to indicate passive. The rule conflicts with the “Cns Deletion in front of Pass” rule at the end of the file - but only when using the Xerox tools! XXX - please have a look!
Tests:
hjall0ar
Rule: Geminate Assimilation in Past Tense d
Rule: Geminate Assimilation in Past Tense t
Tests:
Rule: ð Assimilation in Front of Dental Past Suffix -d(i)
Tests:
Rule: Deleting Double Consonant in Front of Epenthesis mark
Tests:
Rule: Deleting stem-final s in s genitive
Tests:
Rule: Double ð Deletion
Rule: ð Assimilation in Front of Supine Suffix -t
Tests:
Rule: Adjusting Dental Past Suffix -d(i)
Tests:
Rule: Adjective neuter after nlr 1
Rule: Adjective neuter after nlr 2
Tests:
Rule: t Deletion in Neuter
j rules
Rule: Deleting j
Tests:
Rule: Realising j in front of vowels
Tests:
Vowel rules
Rule: Realising i2 as i
Tests:
Rule: Epenthetic deletion
Tests:
Rule: U-umlaut of Epenthetic vowel
Tests:
Rule: U-umlaut in Front of Nasal
Tests:
Rule: General U-umlaut
Tests:
Rule: U-umlaut for akur
Tests:
Rule: I-umlaut
Tests:
Rule: eI-umlaut for o:e, á:e, i:e
Rule: **I-umlaut for bróðir **
Rule: Inverted U-umlaut from ø
Tests:
Rule: Inverted U-umlaut from o
Tests:
Rule: o/ei-Umlaut I
Rule: o/ei-Umlaut II
Tests:
Rule: Vowel deletion in front of na
Rule: Stem vowel change in Weak Verbs
Tests:
Rule: Stem Vowel Shortening in Supine and Participle
Tests:
Rule: Past tense singular diphthongs I
Rule: Past tense singular diphthongs II
Tests:
Rule: Past tense singular monophthongs
Tests:
Rule: Past tense plural monophthongs
Rule: Past tense plural monophthongs to a
Rule: Supine u
Rule: Supine o
Rule: Supine i
Rule: Present tense ý
Rule: Vowel shortening in Neuter
Tests:
Rule: u in ur Deletion in front of Pass
Rule: r Deletion in front of Pass
Rule: ð Deletion in front of Pass
This (part of) documentation was generated from src/fst/morphology/phonology.twolc
+CLBfinal Sentence final abbreviated expression ending in full stop, so that the full stop is ambiguous
+Sg3 : This is inherited from common files, should be changed to +3Sg.
+Arab sub-pos
+Coll sub-pos
+Ine samiske kasus, skal bort
+MWE multiword expression
+Rom sjekk desse XXX
+Der/Adv derivation to Adverb
+Sem/Fem
+Sem/Year - year (i.e. 1000 - 2999), used only for numerals
+Sem/Txt
a3 This is for a special a Umlaut case a3:ø (normal: a:o)
%^PASS : todo ,
%> : Suffix boundary ,
Language tags
The tags are of the following form:
This entry / word should be in the following position(s):
+Use/Circ = for compound restrictions
+Use/PMatch means that the following is only used in the analyser feeding the disambiguator. This is missing.
+Use/-TTS – never retained in the HFST Text-To-Speech disambiguation tokeniser
Todo: Check whether these can be removed. They are probably obsolete.
%[%>%] - Literal >
%[%<%] - Literal <
We have manually optimised the structure of our lexicon using following flag diacritics to restrict morhpological combinatorics - only allow compounds with verbs if the verb is further derived into a noun again:
@P.NeedNoun.ON@ | (Dis)allow compounds with verbs unless nominalised |
@D.NeedNoun.ON@ | (Dis)allow compounds with verbs unless nominalised |
@C.NeedNoun@ | (Dis)allow compounds with verbs unless nominalised |
@D.ErrOrth.ON@ |
@C.ErrOrth@ |
@P.ErrOrth.ON@ |
@R.ErrOrth.ON@ |
Set flag for compounds
Flag | Example word |
---|---|
@P.Case.MscNom@ | fyrstiflokkur |
@P.Case.MscObl@ | fyrstaflokk |
@P.Case.FemNom@ | lítlasystir |
@P.Case.FemObl@ | lítluusystur |
@P.Case.Neu@ | breiðaskarð |
@P.Case.Pl@ | fyrstuflokkar, lítlusystrar, breiðuskørð |
Control flag values for compounds
Flag | Example word |
---|---|
@R.Case.MscNom@ | fyrstiflokkur |
@R.Case.MscObl@ | fyrstaflokk |
@R.Case.FemNom@ | lítlasystir |
@R.Case.FemObl@ | lítluusystur |
@R.Case.Neu@ | breiðaskarð |
@R.Case.Pl@ | fyrstuflokkar, lítlusystrar, breiðuskørð |
Control flag values for compounds
Flag | Example word |
---|---|
@U.Case.MscNom@ | fyrstiflokkur |
@U.Case.MscObl@ | fyrstaflokk |
@U.Case.FemNom@ | lítlasystir |
@U.Case.FemObl@ | lítluusystur |
@U.Case.Neu@ | breiðaskarð |
@U.Case.Pl@ | fyrstuflokkar, lítlusystrar, breiðuskørð |
Flag diacritic look-alikes for grammar checker & tokenisation purposes
Flag | Explanation |
---|---|
@P.Pmatch.Loc@ | Location in string used or parsed by hfst-pmatch |
@P.Pmatch.Backtrack@ | Also for hfst-pmatch |
For languages that allow compounding, the following flag diacritics are needed to control position-based compounding restrictions for nominals. Their use is handled automatically if combined with +CmpN/xxx tags. If not used, they will do no harm.
Flag | Explanation |
---|---|
@P.CmpFrst.FALSE@ | Require that words tagged as such only appear first |
@D.CmpPref.TRUE@ | Block such words from entering ENDLEX |
@P.CmpPref.FALSE@ | Block these words from making further compounds |
@D.CmpLast.TRUE@ | Block such words from entering R |
@D.CmpNone.TRUE@ | Combines with the next tag to prohibit compounding |
@U.CmpNone.FALSE@ | Combines with the prev tag to prohibit compounding |
@P.CmpOnly.TRUE@ | Sets a flag to indicate that the word has passed R |
@D.CmpOnly.FALSE@ | Disallow words coming directly from root. |
Use the following flag diacritics to control downcasing of derived proper nouns (e.g. Finnish Pariisi -> pariisilainen). See e.g. North Sámi for how to use these flags. There exists a ready-made regex that will do the actual down-casing given the proper use of these flags. | @U.Cap.Obl@ | Allowing downcasing of derived names: deatnulasj. | @U.Cap.Opt@ | Allowing downcasing of derived names: deatnulasj.
Flag diacritic | Explanation |
---|---|
@U.number.one@ | Flag used to give arabic numerals in smj different cases ; |
@U.number.two@ | Flag used to give arabic numerals in smj different cases ; |
@U.number.three@ | Flag used to give arabic numerals in smj different cases ; |
@U.number.four@ | Flag used to give arabic numerals in smj different cases ; |
@U.number.five@ | Flag used to give arabic numerals in smj different cases ; |
@U.number.six@ | Flag used to give arabic numerals in smj different cases ; |
@U.number.seven@ | Flag used to give arabic numerals in smj different cases ; |
@U.number.eight@ | Flag used to give arabic numerals in smj different cases ; |
@U.number.nine@ | Flag used to give arabic numerals in smj different cases ; |
@U.number.zero@ | Flag used to give arabic numerals in smj different cases ; |
This is the beginning of everything. The Root lexicon is reserved in the LexC language, and must be the first lexicon defined.
Lexicon Acronyms is split in two:
And this is the ENDLEX of everything:
@D.CmpOnly.FALSE@@D.CmpPref.TRUE@@D.NeedNoun.ON@ ENDLEX2 ;
The @D.CmpOnly.FALSE@
flag diacritic is ued to disallow words tagged
with +CmpNP/Only to end here.
The @D.NeedNoun.ON@
flag diacritic is used to block illegal compounds.
This (part of) documentation was generated from src/fst/morphology/root.lexc
Lexica for adding tags and periods
The idea is (or may be) to use both common and language-speicfic abbreviations.
Splitting in 3 groups, because of the preprocessor
Abbreviation
dot% noStb.db Abbreviations that never induce sentence boundaries The file is too large and should be shrinked
This (part of) documentation was generated from src/fst/morphology/stems/abbreviations.lexc
The adjectives and their inflectional codes are taken from “Føroysk orðabók”.
Adjectives for the list of adjectives
This (part of) documentation was generated from src/fst/morphology/stems/adjectives.lexc
We should eventually have syntactic tags here…
p for the tag +Pr
Preposition for the list of prepositions, ordered according to case they select for.
| —
This (part of) documentation was generated from src/fst/morphology/stems/adpositions.lexc
adv for the tag +Adv
advcomp for the tag +Adv+Cmp
advsuperl for the tag +Adv+Superl
Adverb for the list of appr 1000 adverbs
This (part of) documentation was generated from src/fst/morphology/stems/adverbs.lexc
The file stems/conjunctions.lexc
contains two lexica:
LEXICON CCtag for assigning the +CC tag to all the conjunctions below. It has one entry:
LEXICON Conjunction for the list of 10 or so conjunctions that are found in the file. Here are the first entries:
This (part of) documentation was generated from src/fst/morphology/stems/conjunctions.lexc
This (part of) documentation was generated from src/fst/morphology/stems/determiners.lexc
This documents the stems/fao-acronyms.lexc
file.
Most acronyms are taken from a common generated file, this file is for the Faroese-specific acronyms.
LEXICON Acronym-fao pointing to the lexica
LEXICON Acronym-fao-list for selve listen, i øjeblikket 2:
Akronymnumeralier for 0-9
anl send numvers too letterloops – this might be too liberal.
This (part of) documentation was generated from src/fst/morphology/stems/fao-acronyms.lexc
The tag +Interj
Interj
The words
Interjection okey, ááá, aj, huff, …
This (part of) documentation was generated from src/fst/morphology/stems/interjections.lexc
The lexicon names are taken from Føroysk orðabók I-II (FO). Reference is made to Thráinsson & al (“fg”).
Note that in some cases, the lexicon names and stems here deviate from FO. In that case the lexica have names ending in wordforms, written in capital lettes.
Shortnouns for 1, 2 and 3 letter nouns excluded from compounding
These are now always excluded from lastpart compound and in norm from first-part compounding as well
Her kjem alle substantiva. Dei er baklengssortert. leksikon som byrjar med x er ikkje manuelt sjekka.
Nouns
Fila inneheld i underkant av 50000 lemma.
This (part of) documentation was generated from src/fst/morphology/stems/nouns.lexc
Numeral splitting in types
1-9
TRÝsplit
nsplit
TEXTTENS
TEXTTEENS
basic
EITT
TVEY
TRÝ
PAIRNUM
n
ordinals
ord_decl
ANNAR
ANNARMORPH
This (part of) documentation was generated from src/fst/morphology/stems/numerals.lexc
Pronoun splitting into 3 sublexica:
Personal for the personal pronouns
egtu-obl
okkumtykkum
S_okkumtykkum
3obl
Reflexive
Interrogative
EINHVOR
ANNARHVOR
HANNSJALVUR
Indefinite
ONKUR
NAKAR
BADIR
HVORGIN
EINGIN
This (part of) documentation was generated from src/fst/morphology/stems/pronouns.lexc
Table of content
Propernouns splitting in 3 lexica: multipartnames, names, guess
multipartnames contains only 3 names for now
names gives the list of names.
This (part of) documentation was generated from src/fst/morphology/stems/propernouns.lexc
The file stems/subjunctions.lexc
contains three lexica:
LEXICON CStag assigns the +CS TAG. It has one entry: +CS: # ;
LEXICON IMtag assigns the +IM tag for the infinitive marker. The entry is: +IM: # ;
LEXICON Subjunction contains the list of some 10-20 CSs. Here are the first 4:
This (part of) documentation was generated from src/fst/morphology/stems/subjunctions.lexc
This file documents the file stems/verbs.lexc
The file contains one lexicon:
LEXICON Verbs = the lexicon containing all verb stems
mega, eiga, eita, gráta, liggja, … and 15 more
The lexica listed here represent the declension patterns presented in Føroysk orðabók. The lexicon names correspond to the declension codes in the dictionary.
Simple declension class verbs
Still to be classified
Double declension class verbs
Finally some candidates to be considered for verb compounding.
This (part of) documentation was generated from src/fst/morphology/stems/verbs.lexc
Table below taken from:
Standardising Pronunciation for a Grapheme-to-Phoneme Converter for Faroese
Phoneme class | Orthography | FARSAMPA | IPA |
---|---|---|---|
Stops | p | p | pʰ |
b | b | p | |
t | t | tʰ | |
d | d | t | |
k | k | kʰ | |
g | g | k | |
Fricatives | f | f | f |
v | v | v | |
? | 4 | ð | |
? | 5 | θ | |
s | s | s | |
s | S | ʃ | |
? | z | ʂ | |
h | h | h | |
Affricates | b | tS | tʃʰ |
b | dZ | tʃ | |
Nasals | m | m | m |
m | M | m̥ | |
n | n | n | |
n | x | n̥ | |
n | N | ŋ | |
n | X | ŋ̊ | |
Laterals | l | l | l |
l | L | l̥ | |
Approximants | ð | w | w |
ð | j | j | |
r | r | ɹ | |
Monophthongs | i | i | i |
i? | I | ɪ | |
e | e | e | |
e? | E | ɛ | |
a | a | a | |
y | y | y | |
? | Y | ʏ | |
ø | 2 | ø | |
? | 9 | œ | |
ú? | u | u | |
? | U | ʊ | |
? | o | o | |
? | O | ɔ | |
? | 8 | ə | |
Diphthongs | æ? | EA | ɛa |
á | OA | ɔa | |
oy | OJ | ʊi | |
? | UJ | ɛi | |
ei | EJ | ai | |
ei? | aJ | ai | |
? | aW | au | |
? | OJ | ɔi | |
? | OW | ɔu | |
? | 3W | ʉu | |
? | EW | ɛu | |
? | 9W | œu | |
? | 9J | œi | |
Diacritics | ? | H | ʰ |
Others | (length) | : | ː |
(prim. stress | % | ˈ | |
(sec. stress) | ~ | ˌ |
SAMPA | IPA | Description |
---|---|---|
p | p | voiceless bilabial stop |
b | b | voiced bilabial stop |
t | t | voiceless alveolar or dental stop |
d | d | voiced alveolar or dental stop |
ts | ʦ | voiceless alveolar affricate |
dz | ʣ | voiced alveolar affricate |
tS | ʧ | voiceless postalveolar affricate |
dZ | ʤ | voiced postalveolar affricate |
c | c | voiceless palatal stop |
J\ | ɟ | (overstroked j) voiced palatal stop |
k | k | voiceless velar stop |
g | g | voiced velar stop |
q | q | voiceless uvular stop |
p\ | ɸ | (Greek phi) voiceless bilabial fricative |
B | β | (Greek beta) voiced bilabial fricative |
ϐ | (Greek beta alt) voiced bilabial approximant | |
f | f | voiceless labiodental fricative |
v | v | voiced labiodental fricative |
T | θ | (Greek theta) voiceless dental fricative |
ϑ | (Greek theta alt) voiceless dental approximant | |
D | ð | (Icelandic eth) voiced dental fricative |
δ | (Greek delta) voiced dental approximant | |
s | s | voiceless alveolar fricative |
z | z | voiced alveolar fricative |
S | ʃ | voiceless postalveolar fricative |
Z | ʒ | voiced postalveolar fricative |
C | ç | (cedilla) voiceless palatal fricative |
j\ (jj) | ʝ | (j with crossed tail) voiced palatal fricative |
x | x | voiceless velar fricative |
G | γ | (Greek gamma) voiced velar fricative |
ɰ | voiced velar approximant | |
X\ | ħ | (overstroked h) voiceless pharyngeal fricative |
?\ | ʕ | (Inverted ?) voiced pharyngeal fricative |
h | h | voiceless glottal approximant |
h\ | ɦ | (h with upper tail to the right) voiced glottal approximant |
m | m | bilabial nasal |
F | ɱ | (m with downward right tail) labiodental nasal |
n | n | alveolar or dental nasal |
J | ɲ | (n with downward left tail) palatal nasal |
N | ŋ | (n with downward right tail) velar nasal |
l | l | alveolar lateral |
L | ʎ | turned down y, alt. λ (Greek lambda) palatal lateral |
5 | ɫ | (l with middle tilde) velarized dental lateral |
4 (r) | ɾ | (r without upper-left serif) alveolar flap |
r (rr) | r | alveolar trill |
r\ | ɹ | (r rotated 180°) retroflexed alveolar approximant |
R | ʀ | (small capital R) uvular trill |
P | ʋ | labiodental approximant |
w | w | velo-labial approximant |
H | ɥ | (turned down h) palato-labial approximant |
j | j | palatal approximant |
. front near-front central near-back back
close i • y 1 • } M • u
near-close I • Y U
close-mid e • 2 @\ • 8 7 • o
mid @
open-mid E • 9 3 • 3\ V • O
near-open { 6
open a • & A • Q
(Some symbols are doubled or escaped with \ in the source to escape Markdown (mis)interpretation, they will appear correct in the rendered HTML.)
Description | SAMPA | IPA | Unicode |
---|---|---|---|
retroflex plosive, voiceless | t` 1 | ʈ | 0288, 648 |
retroflex plosive, voiced | d` 1 | ɖ | 0256, 598 |
labiodental nasal | F | ɱ | 0271, 625 |
retroflex nasal | n` 1 | ɳ | 0273, 627 |
palatal nasal | J | ɲ | 0272, 626 |
velar nasal | N | ŋ | 014B, 331 |
uvular nasal | N\ | ɴ | 0274, 628 |
bilabial trill | B\ | ʙ | 0299, 665 |
uvular trill | R\ | ʀ | 0280, 640 |
alveolar tap | 4 | ɾ | 027E, 638 |
retroflex flap | r` 1 | ɽ | 027D, 637 |
bilabial fricative, voiceless | p\ | ɸ | 0278, 632 |
bilabial fricative, voiced | B | β | 03B2, 946 |
dental fricative, voiceless | T | θ | 03B8, 952 |
dental fricative, voiced | D | ð | 00F0, 240 |
postalveolar fricative, voiceless | S | ʃ | 0283, 643 |
postalveolar fricative, voiced | Z | ʒ | 0292, 658 |
retroflex fricative, voiceless | s` 1 | ʂ | 0282, 642 |
retroflex fricative, voiced | z` 1 | ʐ | 0290, 656 |
palatal fricative, voiceless | C | ç | 00E7, 231 |
palatal fricative, voiced | j\ | ʝ | 029D, 669 |
velar fricative, voiced | G | ɣ | 0263, 611 |
uvular fricative, voiceless | X | χ | 03C7, 967 |
uvular fricative, voiced | R | ʁ | 0281, 641 |
pharyngeal fricative, voiceless | X\ | ħ | 0127, 295 |
pharyngeal fricative, voiced | ?\ | ʕ | 0295, 661 |
glottal fricative, voiced | h\ | ɦ | 0266, 614 |
alveolar lateral fricative, vl. | K | ||
alveolar lateral fricative, vd. | K\ | ||
labiodental approximant | P (or v\ ) | ||
alveolar approximant | r\ | ||
retroflex approximant | r\` 1 | ||
velar approximant | M\ | ||
retroflex lateral approximant | l` 1 | ||
palatal lateral approximant | L | ||
velar lateral approximant | L\ | ||
Clicks | |||
bilabial | O\ | (O = capital letter) | |
dental | |\ | ||
(post)alveolar | !\ | ||
palatoalveolar | =\ | ||
alveolar lateral | ||\ | ||
Ejectives, implosives | |||
ejective | _> | e.g. ejective p = p_> | |
implosive | _< | e.g. implosive b = b_< | |
Vowels | |||
close back unrounded | M | ||
close central unrounded | 1 | ||
close central rounded | } | ||
lax i | I | ||
lax y | Y | ||
lax u | U | ||
close-mid front rounded | 2 | ||
close-mid central unrounded | @\ | ||
close-mid central rounded | 8 | ||
close-mid back unrounded | 7 | ||
schwa ə | @ | ||
open-mid front unrounded | E | ||
open-mid front rounded | 9 | ||
open-mid central unrounded | 3 | ||
open-mid central rounded | 3\ | ||
open-mid back unrounded | V | ||
open-mid back rounded | O | ||
ash (ae digraph) | { | ||
open schwa (turned a) | 6 | ||
open front rounded | & | ||
open back unrounded | A | ||
open back rounded | Q | ||
Other symbols | |||
voiceless labial-velar fricative | W | ||
voiced labial-palatal approx. | H | ||
voiceless epiglottal fricative | H\ | ||
voiced epiglottal fricative | <\ | ||
epiglottal plosive | >\ | ||
alveolo-palatal fricative, vl. | s\ | ||
alveolo-palatal fricative, voiced | z\ | ||
alveolar lateral flap | l\ | ||
simultaneous S and x | x\ | ||
tie bar | _ | ||
Suprasegmentals | |||
primary stress | ” | ||
secondary stress | % | ||
long | : | ||
half-long | :\ | ||
extra-short | _X | ||
linking mark | -\ | ||
Tones and word accents | |||
level extra high | _T | ||
level high | _H | ||
level mid | _M | ||
level low | _L | ||
level extra low | _B | ||
downstep | ! | ||
upstep | ^ | (caret, circumflex) | |
contour, rising | _R | ||
contour, falling | _F | ||
contour, high rising | _H_T | ||
contour, low rising | _B_L | ||
contour, rising-falling | _R_F | (NB Instead of being written as diacritics with _, all prosodic marks can alternatively be placed in a separate tier, set off by < >, as recommended for the next two symbols.) | |
global rise | <R> | ||
global fall | <F> | ||
Diacritics | |||
voiceless | _0 | (0 = figure), e.g. n_0 | |
voiced | _v | ||
aspirated | _h | ||
more rounded | _O | (O = letter) | |
less rounded | _c | ||
advanced | _+ | ||
retracted | _- | ||
centralized | _” | ||
syllabic | = (or _=) | e.g. n= (or n_=) | |
non-syllabic | _^ | ||
rhoticity | ` | ||
breathy voiced | _t | ||
creaky voiced | _k | ||
linguolabial | _N | ||
labialized | _w | ||
palatalized | ’ (or _j) | e.g. t’ (or t_j) | |
velarized | _G | ||
pharyngealized | _?\ | ||
dental | _d | ||
apical | _a | ||
laminal | _m | ||
nasalized | ~ (or _~) | e.g. A~ (or A_~) | |
nasal release | _n | ||
lateral release | _l | ||
no audible release | _} | ||
velarized or pharyngealized | _e | ||
velarized l, alternatively | 5 | ||
raised | _r | ||
lowered | _o | ||
advanced tongue root | _A | ||
retracted tongue root | _q |
This (part of) documentation was generated from src/fst/phonetics/txt2ipa.xfscript
We describe here how abbreviations are in Faroese are read out, e.g. for text-to-speech systems.
LEXICON Root
For example:
This (part of) documentation was generated from src/fst/transcriptions/transcriptor-abbrevs2text.lexc
Multichar_Symbols defines flags and +Use/NG and Úse/NA.
LEXICON Root where it all begins
LEXICON smallhour giving the 30-day
LEXICON largehour giving the 30-day
LEXICON BEFpunkt before punct
LEXICON AFTpunkt after punct
LEXICON BEF
LEXICON AFT after
LEXICON TOHALF before half
LEXICON OVERHALF after half
LEXICON TO í
LEXICON OVER yvir
LEXICON HOUR split in cases (not in use)
LEXICON NOMHOUR hours 1-12 in nominative
This (part of) documentation was generated from src/fst/transcriptions/transcriptor-clock-digit2text.lexc
Defining one tag: +Use/NG for do not generate
LEXICON Root starts.
LEXICON DAY splits days 1-9 in nominative and accusative
LEXICON DAY10 splits days 10-31 in nominative and accusative
LEXICON DAY_NOM the nominative ones (fyrsti…)
LEXICON DAY_ACC the accusative ones (fyrsta…)
LEXICON DAY10_NOM nominative tiggjundi…
LEXICON DAY10_ACC accusative tiggjunda…
LEXICON 29MONTH splits in 3 month types
LEXICON 30MONTH giving the 30-day
LEXICON 31MONTH giving the 31-day months
LEXICON PUNCT gives punctiation
This (part of) documentation was generated from src/fst/transcriptions/transcriptor-date-digit2text.lexc
digits are translated to text and vice versa
It starts with lexicon Root, which splits into thousands, hundreds, tens, ones. LEXICON @ØLEXNAME@
LEXICON THOUSANDS
LEXICON 2to9T for two to nine thousand, pointing to THOUSAND.
LEXICON 10to99T for 10t and up
LEXICON TEENT for 10-19 thousands
LEXICON TENST
LEXICON TENCOUNTT
LEXICON OLDTENST
LEXICON OLDTEN-1T
LEXICON OLDTEN-2T
LEXICON OLDTEN-3T
LEXICON OLDTEN-4T
LEXICON OLDTEN-5T
LEXICON OLDTEN-6T
LEXICON OLDTEN-7T
LEXICON OLDTEN-8T
LEXICON OLDTEN-9T
LEXICON END1T
LEXICON END2T
LEXICON END3T
LEXICON END4T
LEXICON END5T
LEXICON END6T
LEXICON END7T
LEXICON END8T
LEXICON END9T
LEXICON HUNDREDST
LEXICON HUNDREDT
LEXICON 1to99T
LEXICON THOUSAND
LEXICON HUNDREDS
LEXICON HUNDRED
LEXICON 1to99
LEXICON 1to9
LEXICON 10to99
LEXICON TEEN
LEXICON TENS
LEXICON TENCOUNT
LEXICON ZERO
LEXICON OLDTENS
LEXICON OLDTEN-1
LEXICON OLDTEN-2
LEXICON OLDTEN-3
LEXICON OLDTEN-4
LEXICON OLDTEN-5
LEXICON OLDTEN-6
LEXICON OLDTEN-7
LEXICON OLDTEN-8
LEXICON OLDTEN-9
LEXICON END1
LEXICON END2
LEXICON END3
LEXICON END4
LEXICON END5
LEXICON END6
LEXICON END7
LEXICON END8
LEXICON END9
This (part of) documentation was generated from src/fst/transcriptions/transcriptor-numbers-digit2text.lexc
This is work in progress. The main focus is on ð errors,
This file contains two parts: Definitions and rules
Here we declare all grammatical tags
Declaring all the error tags
We turn off this rule for now, it is too hard to avoid false alarms.
Num + N Sg should be Num + N Pl (We need arabic tag here)
Nothing here.
This (part of) documentation was generated from tools/grammarcheckers/grammarchecker.cg3
Usage, in lang-fao
:
cat text.txt|hfst-tokenize -cg tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst |vislcg3 -g src/cg3/disambiguator.cg3
This file documents the Faroese disambiguator file .
Test: Go for minimal weight. This rules gives priority to lexicalised forms.
Mostly we map both @CNP and @CVP, then we select @CNP, after that we remove them so @CVP remains
CCasCNPCVP Map (@CNP @CVP) to CC
killAllahtenotCS All occurrences of “at” are CSs.
Kill Sem/ID
killAllCNP removes all remaining @CNP
XCC-CS removes CC and CS with no synttag
ErrOrth goes for correct forms
X removes readings with no syntax
This (part of) documentation was generated from tools/grammarcheckers/grc-disambiguator.cg3
Usage:
$ make
$ echo "ja, ja" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa boasttu olmmoš, man mielde lahtuid." | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "márffibiillagáffe" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
Pmatch documentation: https://github.com/hfst/hfst/wiki/HfstPmatch
Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words:
U+00AD
U+FEFF
.Whitespace contains ASCII white space and the List contains some unicode white space characters
Apart from what’s in our morphology, there are
hfst-tokenise -a
Unknowns are made of:
Unknowns are tagged ?? and treated specially with hfst-tokenise
hfst-tokenise –giella-cg will treat such empty analyses as unknowns, and
remove empty analyses from other readings. Empty readings are also
legal in CG, they get a default baseform equal to the wordform, but
no tag to check, so it’s safer to let hfst-tokenise handle them.
Finally we mark as a token any sequence making up a:
This (part of) documentation was generated from tools/tokenisers/tokeniser-disamb-gt-desc.pmscript
Requires a recent version of HFST (3.10.0 / git revision>=3aecdbc) Then just:
$ make
$ echo "ja, ja" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
More usage examples:
$ echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa boasttu olmmoš, man mielde lahtuid." | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "márffibiillagáffe" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
Pmatch documentation: https://github.com/hfst/hfst/wiki/HfstPmatch
Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words:
U+00AD
U+FEFF
.Whitespace contains ASCII white space and the List contains some unicode white space characters
Apart from what’s in our morphology, there are 1) unknown word-like forms, and 2) unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a
TODO: Could use something like this, but built-in’s don’t include šžđčŋ:
Simply give an empty reading when something is unknown: hfst-tokenise –giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it’s safer to let hfst-tokenise handle them.
Finally we mark as a token any sequence making up a:
This (part of) documentation was generated from tools/tokenisers/tokeniser-gramcheck-gt-desc.pmscript
Requires a recent version of HFST (3.10.0 / git revision>=3aecdbc) Then just:
make
echo "ja, ja" \
| hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
More usage examples:
echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa \
boasttu olmmoš, man mielde lahtuid." \
| hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" \
| hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
echo "márffibiillagáffe" \
| hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
Pmatch documentation: https://kitwiki.csc.fi/twiki/bin/view/KitWiki/HfstPmatch
Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words:
U+00AD
U+FEFF
.Whitespace contains ASCII white space and the List contains some unicode white space characters
Apart from what’s in our morphology, there are 1) unknown word-like forms, and 2) unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a
TODO: Could use something like this, but built-in’s don’t include šžđčŋ:
Simply give an empty reading when something is unknown: hfst-tokenise –giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it’s safer to let hfst-tokenise handle them.
Needs hfst-tokenise to output things differently depending on the tag they get
This (part of) documentation was generated from tools/tokenisers/tokeniser-tts-cggt-desc.pmscript