Faroese NLP Grammar

Finite state and Constraint Grammar based analysers, proofing tools and other resources

View the project on GitHub giellalt/lang-fao

Page Content

  • src-fst-morphology-affixes-acronyms.lexc.md
  • North Saami acronyms - affix part
  • src-fst-morphology-affixes-adjectives.lexc.md
  • Adjective morphology !
  • Intermediate adjectival lexica
  • Comparative
  • Superlative
  • src-fst-morphology-affixes-nouns.lexc.md
  • Faroese Noun morphology
  • Layer 1: Basic noun lexica
  • Layer 2: Case inflection
  • Layer 3: Definite inflection
  • Compound flags
  • src-fst-morphology-affixes-numerals.lexc.md
  • Numeral affixess
  • src-fst-morphology-affixes-propernouns.lexc.md
  • Proper nouns
  • src-fst-morphology-affixes-symbols.lexc.md
  • Symbol affixes
  • src-fst-morphology-affixes-verbs.lexc.md
  • Verb morphology !
  • src-fst-morphology-compounding.lexc.md
  • Compounding morphology
  • Lexicon R gets flags and sends compounds over to RReal
  • Lexicon RReal is the lexicon for the Cmp tag and resending to N, A
  • Lexicon R- for compounds with hyphen
  • Lexicon RNum for compounds numeral + noun
  • src-fst-morphology-phonology.twolc.md
  • The Faroese morphophonological file
  • Rules
  • src-fst-morphology-root.lexc.md
  • Faroese morphological analyser
  • Definitions for Multichar_Symbols
  • Lexicon Root
  • Lexicon ENDLEX
  • src-fst-morphology-stems-abbreviations.lexc.md
  • File containing Faroese abbreviations
  • src-fst-morphology-stems-adjectives.lexc.md
  • Faroese adjectives
  • src-fst-morphology-stems-adpositions.lexc.md
  • Faroese prepositions
  • src-fst-morphology-stems-adverbs.lexc.md
  • Faroese adverbs
  • src-fst-morphology-stems-conjunctions.lexc.md
  • The Faroese conjunctions
  • src-fst-morphology-stems-determiners.lexc.md
  • Faroese deternminers
  • src-fst-morphology-stems-fao-acronyms.lexc.md
  • Akronymer
  • src-fst-morphology-stems-interjections.lexc.md
  • Interjections
  • src-fst-morphology-stems-nouns.lexc.md
  • Faroese noun stem file
  • src-fst-morphology-stems-numerals.lexc.md
  • Faroese Numerals
  • src-fst-morphology-stems-pronouns.lexc.md
  • Faroese pronouns
  • src-fst-morphology-stems-propernouns.lexc.md
  • Proper nouns
  • src-fst-morphology-stems-subjunctions.lexc.md
  • Faroese subjunctions
  • src-fst-morphology-stems-verbs.lexc.md
  • Faroese verb stems
  • src-fst-phonetics-txt2ipa.xfscript.md
  • Phonological converter for Faroese
  • For reference: The SAMPA - IPA correspondence
  • src-fst-transcriptions-transcriptor-abbrevs2text.lexc.md
  • Faroese abbreviations
  • src-fst-transcriptions-transcriptor-clock-digit2text.lexc.md
  • The Faroese clock
  • src-fst-transcriptions-transcriptor-date-digit2text.lexc.md
  • Faroese dates
  • src-fst-transcriptions-transcriptor-numbers-digit2text.lexc.md
  • Faroese numbers
  • tools-grammarcheckers-grammarchecker.cg3.md
  • Faroese grammarchecker
  • Definition section
  • Rule section
  • tools-grammarcheckers-grc-disambiguator.cg3.md
  • Faroese disambiguator
  • MAPPING OF CC AND CS
  • tools-tokenisers-tokeniser-disamb-gt-desc.pmscript.md
  • Tokeniser for fao
  • tools-tokenisers-tokeniser-gramcheck-gt-desc.pmscript.md
  • Grammar checker tokenisation for fao
  • tools-tokenisers-tokeniser-tts-cggt-desc.pmscript.md
  • TTS tokenisation for smj
  • Faroese language model documentation

    All doc-comment documentation in one large file.


    src-cg3-disambiguator.cg3.md

    Faroese disambiguator

    Usage, in lang-fao: cat text.txt|hfst-tokenize -cg tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst |vislcg3 -g src/cg3/disambiguator.cg3

    This file documents the Faroese disambiguator file .

    Delimiters, tags and sets

    Test: Go for minimal weight. This rules gives priority to lexicalised forms.

    MAPPING OF CC AND CS

    Mostly we map both @CNP and @CVP, then we select @CNP, after that we remove them so @CVP remains


    This (part of) documentation was generated from src/cg3/disambiguator.cg3


    src-cg3-functions.cg3.md

    S Y N T A C T I C F U N C T I O N S F O R F A R O E S E

    Sámi language technology project 2003-2014, University of Tromsø #

    This file adds syntactic functions. It was copied from sme.

    !! Syntactic sets

    These sets model noun phrases (NPs). The idea is to first define whatever can occur in front of the head of the NP, and thereafter negate that with the expression WORD - premodifiers.

    The set NOT-NPMOD is used to find barriers between NPs. Typical usage: … (*1 N BARRIER NPT-NPMOD) … meaning: Scan to the first noun, ignoring anything that can be part of the noun phrase of that noun (i.e., “scan to the next NP head”)

    !!HNOUN MAPPING

    !! The leftovers are tagged @X

    ! missingX adds @X to all missings

    ! therestX adds @X to all what is left, often errouneus disambiguated forms


    This (part of) documentation was generated from src/cg3/functions.cg3


    src-cg3-korp.cg3.md

    S Y N T A C T I C F U N C T I O N S F O R F A R O E S E

    Sámi language technology project 2003-2014, University of Tromsø #

    This file adds syntactic functions. It was copied from sme.

    !! Syntactic sets

    These sets model noun phrases (NPs). The idea is to first define whatever can occur in front of the head of the NP, and thereafter negate that with the expression WORD - premodifiers.

    The set NOT-NPMOD is used to find barriers between NPs. Typical usage: … (*1 N BARRIER NPT-NPMOD) … meaning: Scan to the first noun, ignoring anything that can be part of the noun phrase of that noun (i.e., “scan to the next NP head”)

    !!HNOUN MAPPING

    !! The leftovers are tagged @X

    ! missingX adds @X to all missings

    ! therestX adds @X to all what is left, often errouneus disambiguated forms


    This (part of) documentation was generated from src/cg3/korp.cg3


    src-fst-morphology-affixes-abbreviations.lexc.md

    Abbreviation affixes

    Now splitting according to POS, and according to dot or not

    First collecting POS info, *-noun, *-adv, etc. Also splitting when in doubt: -noun-adj => -noun and -adj Then pointing to two contlexes, a dot-one and a non-dot-one.

    Lexicons without final period

    Lexicons with final period


    This (part of) documentation was generated from src/fst/morphology/affixes/abbreviations.lexc


    src-fst-morphology-affixes-acronyms.lexc.md

    North Saami acronyms - affix part

    The lexica giving tags and suffixes to the acronyms


    This (part of) documentation was generated from src/fst/morphology/affixes/acronyms.lexc


    src-fst-morphology-affixes-adjectives.lexc.md

    Adjective morphology !

    Ad hoc lexica

    The lexicons

    Irregular adjectives

    Irregular comparatives

    Intermediate adjectival lexica

    Adjectival case lexica

    Msc

    Neu

    Definite declension

    Positiv, def, u-umlj Msc

    Fem

    Neu

    Positiv, def, ø-umlj Msc

    Fem Neu

    Gender tags

    Case tags

    Compound flags

    Comparative

    Superlative


    This (part of) documentation was generated from src/fst/morphology/affixes/adjectives.lexc


    src-fst-morphology-affixes-nouns.lexc.md

    Faroese Noun morphology

    This file contains the inflection suffixes for the Faroese nowns The infection classes are identical to the ones in Føroysk orðabók.

    The morphology is ordered in three layers.

    Layer 1: Basic noun lexica

    The nominal morphology is added in three layers. In this first layer we add gender tags and morphophonological diacritics. The next two layers are for indefinite and definite suffixes, respectively.

    Lexicons still to be allocated

    We first list 4 lexica for words waiting to be checked.!

    Irregular nouns

    These are lexica with number 0, they have no inflectional morphology.!

    Lexica for words belonging to two paradigms.

    These are simply split (h11/12 to h11 and h12, etc).!

    The ordinary lexica

    These lexica split into sg and pl lexica, and add +N and gender tags. Thereafter it points to Layer 2, the case suffixes

    Lexica for weak masculines.

    Lexica for strong masculines

    Lexica for feminines

    Lexica for Neuter nouns

    Layer 2: Case inflection

    This is the second layer. Here we do indefinite forms and compounds.

    Lexica for masculine nouns

    Lexica for weak case suffixes.

    Singular

    Plural

    Strong case suffixes

    Nominative Sg

    Accusative Sg

    Dative Sg

    Genitive Sg

    Plural forms

    Nominative

    Accusative

    Dative

    Genitive

    Feminine forms

    Singular case suffixes.

    Nominative

    Oblique

    Plural case suffixes

    Neuter forms

    Singular

    Layer 3: Definite inflection

    This is the third layer. Here we do the indefinite and definite forms. These are common to (almost) all different paradigms, hence they are gathered here.

    Masculine forms

    Masc def sg

    Masc def pl

    Feminine forms

    Fem Sg

    Feminine plural forms

    Neuter forms

    Neuter sg

    This concludes the nominal morphology.

    Compound flags

    The rest of the file contains flags, that govern the ways stems may be combined.


    This (part of) documentation was generated from src/fst/morphology/affixes/nouns.lexc


    src-fst-morphology-affixes-numerals.lexc.md

    Numeral affixess

    This lexicon just goes to #, this in order to coexist with number files in giella-shared. They are relevant for Sámi, not for faroese.

    Lexica:


    This (part of) documentation was generated from src/fst/morphology/affixes/numerals.lexc


    src-fst-morphology-affixes-propernouns.lexc.md

    Proper nouns

    Table of content

    The morphological tags

    For each group, the maltag etc. lexicon functions as a default lexicon. The other lexica are there for specific subgroups of the names.

    Indeclineables

    Male first names

    Female first names

    Surnames

    Place names and other names


    This (part of) documentation was generated from src/fst/morphology/affixes/propernouns.lexc


    src-fst-morphology-affixes-symbols.lexc.md

    Symbol affixes


    This (part of) documentation was generated from src/fst/morphology/affixes/symbols.lexc


    src-fst-morphology-affixes-verbs.lexc.md

    Verb morphology !

    s1 nevna = riggar!

    s2 keypa = riggar!

    SETA seta = riggar!

    s3 leiða = riggar!

    s4 frøa = riggar!

    s5 senda = riggar!

    s6 hirða = riggar!

    s7 gista = riggar!

    s8 kenna = riggar!

    s9 klippa = riggar!

    s10 fylgja = riggar!

    s11 roykja = riggar!

    s12 boyggja = riggar!

    s13 søkkja = riggar!

    s14 heingja = riggar!

    s15 skeinkja = riggar!

    s15_2 steikja = riggar!

    s16 flekja = riggar!

    s17 berja = riggar!

    s18 krevja = riggar!

    s19 dylja = riggar!

    s20 leggja = riggar!

    s21 selja = riggar!

    s22 ryðja = riggar ikki í sup og prfptc!

    s22_1 ýðja = riggar!

    s23 smyrja = riggar!

    s24 flysa = riggar ikki í pass!

    s25 liva = riggar!

    s26 plaga = riggar (formurin plagdur manglar)!

    s26_1 mála->máldi

    s27 spáa = riggar!

    s28 skaða = riggar ikki í prfptc!

    s29 brúka = riggar!

    s30 kalla = riggar!

    s31 only gera and *gera = riggar!

    s32/30 útbúgva = riggar!

    s32 búgva = riggar!

    s33 rógva

    s34 goyggja = riggar!

    Strong verbs starting here

    s35 bíta riggar!

    s36 svíkja riggar!

    s37 bróta riggar!

    s38 skjóta riggar!

    s39d

    s39s

    s39

    s40 fúka

    s41 flúgva

    s42 klúgva

    s44 finna

    s45 binda = riggar!

    s46 stinga = riggar!

    s47 svimja = riggar … men kanska skal tað ikki hava passiv

    s48 drekka = riggar ikki í adj pga dpkons

    s48_2 renna = riggar ikki í adj pga dpkons

    s49 detta = riggar ikki í adj pga dpkons

    s49_2 treffa = riggar ikki í adj pga dpkons

    s49_3 sleppa = riggar ikki í adj pga dpkons

    s49_4 verpa = riggar!

    s50 røkka = riggar ikki í adj pga dpkons

    s51 ganga = riggar!

    s52 veva = riggar!

    s53 leypa = riggar!

    s54 bera = riggar!

    s55 fara = riggar!

    s56 geva = riggar!

    s57 sita = riggar ikki + skal nokk ikki hava passiv

    s58 mala

    s59 stjala

    s60 taka, aka

    s61 halda

    s62 sova

    s63 koma

    s64 lata

    s64_1 láta

    s65 standa

    s66 biðja

    s67 draga

    s68 hvørva

    s69 sláa

    s70 siga

    s71 skerja

    s72 eta

    s73 læa

    Ad hoc, irregular

    BLÍVA

    EIGA

    EITA

    GRÁTA

    HAVA

    KUNNA

    MEGA

    MUNNA

    SKULA

    TYKJA

    VERA

    VERÐA

    VILJA

    VITA

    SÍGGJA

    FÁA

    NÁA XXX check

    LIGGJA

    RADA

    BURDA

    GJALDA

    VALDA

    FALLA

    GJALLA

    BREGDA

    SYNGJA XXX check

    HOGGA høgga

    KVODA

    FLYGGJA

    VAKSA

    VEKSA

    s30/26_1 dáma

    HYGGJA

    TYGGJA

    MYLA

    BLASA

    TYSJA

    GROA

    KVOTTA

    GALDA

    TAKAST

    LOYPAST loypast

    sxrefl This is an ad hoc lexicon

    s74 grindast

    s75 balast

    s76 ræðast

    s77 skiftast

    s78 farast

    s79 skjótast

    s80 trivast

    s81 kíkjast

    s82 fýlast

    s83 samsinnast

    FYRIB kopi, s83

    Split lexica

    s8/48_2 s9/30

    Intermediate lexicon groups

    standard_ir

    standard_ir_t

    ir_verb

    ir_verb_t

    Suffix lexica

    Infinitive

    jinf

    inf

    reflinf

    Present

    pres_ir

    pres_ir_j2

    pres_jir

    pres_ir_sg

    pres_ar

    pres_ur

    pres_iur

    pres_ur_j

    pres_ur_j2

    pres_strong_s1

    pres_strong_s23

    pres_strong_s23_t

    pres_strong_s23_t0

    pres_strong_s23_t1

    pres_pl

    pres_ast

    pres_ist

    pres_1ist

    pres_23st

    pres_plast

    pret_adist

    pret_dist

    pret_tist

    pret_ist

    pret_st

    pret_plust

    pret_pltust

    Preterite

    prt_d

    prt_ð

    prt_t

    prt_ði

    prt_ti

    prt_du

    prt_tu

    prt_ðu

    prt_dd

    prt_a

    prt_null

    prt_null_s

    prt_null_s2

    prt_null_s2_t

    prt_u_p

    Passive lexica

    Imperative and present participle

    imp_prsptc

    imp_prsptc_j

    imp

    imp_j

    impsg

    imppl

    imppl_j

    prsptc

    Supine and preterite participle

    sup

    sup_t

    sup_tt

    sup_a kalla

    sup_null stungið

    sup_in kalla

    sup_ið_in stungið

    Middle lexicon

    VANDI

    Perfect Participles !

    p18

    p26

    p26_2

    p34_6

    p34_7

    p32

    p39

    p5pos

    p5

    p6

    p7

    p8


    This (part of) documentation was generated from src/fst/morphology/affixes/verbs.lexc


    src-fst-morphology-compounding.lexc.md

    Compounding morphology

    Lexicon R gets flags and sends compounds over to RReal

    @P.CmpFrst.FALSE@@P.CmpPref.FALSE@@D.CmpLast.TRUE@@D.CmpNone.TRUE@@U.CmpNone.FALSE@@P.CmpOnly.TRUE@ RReal ; are Flags to control compounding

    Lexicon RReal is the lexicon for the Cmp tag and resending to N, A

    Lexicon R- for compounds with hyphen

    +Cmp#:%- Nouns ;
    +Cmp#:%- Adjectives ;

    Lexicon RNum for compounds numeral + noun

      +Use/SpellNoSugg+Cmp/Hyph+Cmp#:-# Nouns ;    For Num Cmp Noun, vi vil ikke ha Num Cmp Num
    

    This (part of) documentation was generated from src/fst/morphology/compounding.lexc


    src-fst-morphology-phonology.twolc.md

    The Faroese morphophonological file

    This file documents the phonology.twolc file

    Alphabet

    Here we declare all symbols.

    Sets

    Here we define some convenient sets.

    Rules

    These are the rules. After each rule (or rather: after many of the rules) there are test cases that are there to test whether the rules work.

    Verschärfung

    Rule: Deleting g

    Rule: ng to kk Part 1 changes n to k in ng:kk before ^NGKK trigger

    Rule: ng to kk Part 2 changes g to k in ng:kk before ^NGKK trigger

    Rule: Deleting v in gv sequences Verschärverung II gives v:0 for gv:00 before ^GVDEL and in some other contexts

    Verschärfung tests:*

    Rule: Deleting r in Genitive of ur stems

    Rule: **Deleting m in um%>num **

    Tests:

    Rule: Deleting Double Consonant in Front of Consonant

    The preceeding rule is fishy - the test cases below don’t fit the context requirements, and the >s# in the right context seems to indicate passive. The rule conflicts with the “Cns Deletion in front of Pass” rule at the end of the file - but only when using the Xerox tools! XXX - please have a look!

    Tests:

    Verbal Sandhi rules

    Rule: Geminate Assimilation in Past Tense d

    Rule: Geminate Assimilation in Past Tense t

    Tests:

    Rule: ð Assimilation in Front of Dental Past Suffix -d(i)

    Tests:

    Rule: Deleting Double Consonant in Front of Epenthesis mark

    Tests:

    Rule: Deleting stem-final s in s genitive

    Tests:

    Rule: Double ð Deletion

    Rule: ð Assimilation in Front of Supine Suffix -t

    Tests:

    Rule: Adjusting Dental Past Suffix -d(i)

    Tests:

    Adjectival sandhi rules

    Rule: Adjective neuter after nlr 1

    Rule: Adjective neuter after nlr 2

    Tests:

    Rule: t Deletion in Neuter

    j rules

    Rule: Deleting j

    Tests:

    Rule: Realising j in front of vowels

    Tests:

    Vowel rules

    Rule: Realising i2 as i

    Tests:

    Epenthetic vowel rules

    Rule: Epenthetic deletion

    Tests:

    Rule: U-umlaut of Epenthetic vowel

    Tests:

    Umlaut rules

    Rule: U-umlaut in Front of Nasal

    Tests:

    Rule: General U-umlaut

    Tests:

    Rule: U-umlaut for akur

    Tests:

    Rule: I-umlaut

    Tests:

    Rule: eI-umlaut for o:e, á:e, i:e

    Rule: **I-umlaut for bróðir **

    Rule: Inverted U-umlaut from ø

    Tests:

    Rule: Inverted U-umlaut from o

    Tests:

    Rule: o/ei-Umlaut I

    Rule: o/ei-Umlaut II

    Tests:

    Vowel deletion rules

    Rule: Vowel deletion in front of na

    Verbal vowel alternation rules

    Rule: Stem vowel change in Weak Verbs

    Tests:

    Rule: Stem Vowel Shortening in Supine and Participle

    Tests:

    Rule: Past tense singular diphthongs I

    Rule: Past tense singular diphthongs II

    Tests:

    Rule: Past tense singular monophthongs

    Tests:

    Rule: Past tense plural monophthongs

    Rule: Past tense plural monophthongs to a

    Rule: Supine u

    Rule: Supine o

    Rule: Supine i

    Rule: Present tense ý

    Adjectival Sandhi rule

    Rule: Vowel shortening in Neuter

    Tests:

    Other rules

    Morphological passive rules

    Rule: u in ur Deletion in front of Pass

    Rule: r Deletion in front of Pass

    Rule: ð Deletion in front of Pass


    This (part of) documentation was generated from src/fst/morphology/phonology.twolc


    src-fst-morphology-root.lexc.md

    Faroese morphological analyser

    Definitions for Multichar_Symbols

    Tags for POS

    Semantic tags

    Non-changing letters

    Triggers for Morphophonology

    Language tags

    Non-ascii letters, perhaps needed as multichar symbols

    Compounding tags

    The tags are of the following form:

    This entry / word should be in the following position(s):

    Usage tags

    Symbols that need to be escaped on the lower side (towards twolc):

    Todo: Check whether these can be removed. They are probably obsolete.

    Flag diacritics

    We have manually optimised the structure of our lexicon using following flag diacritics to restrict morhpological combinatorics - only allow compounds with verbs if the verb is further derived into a noun again:

    @P.NeedNoun.ON@ (Dis)allow compounds with verbs unless nominalised
    @D.NeedNoun.ON@ (Dis)allow compounds with verbs unless nominalised
    @C.NeedNoun@ (Dis)allow compounds with verbs unless nominalised

    Flags for speller suggestions

    @D.ErrOrth.ON@
    @C.ErrOrth@
    @P.ErrOrth.ON@
    @R.ErrOrth.ON@

    Flag for case harmony in compounds

    Set flag for compounds

    Flag Example word
    @P.Case.MscNom@ fyrstiflokkur
    @P.Case.MscObl@ fyrstaflokk
    @P.Case.FemNom@ lítlasystir
    @P.Case.FemObl@ lítluusystur
    @P.Case.Neu@ breiðaskarð
    @P.Case.Pl@ fyrstuflokkar, lítlusystrar, breiðuskørð

    Control flag values for compounds

    Flag Example word
    @R.Case.MscNom@ fyrstiflokkur
    @R.Case.MscObl@ fyrstaflokk
    @R.Case.FemNom@ lítlasystir
    @R.Case.FemObl@ lítluusystur
    @R.Case.Neu@ breiðaskarð
    @R.Case.Pl@ fyrstuflokkar, lítlusystrar, breiðuskørð

    Control flag values for compounds

    Flag Example word
    @U.Case.MscNom@ fyrstiflokkur
    @U.Case.MscObl@ fyrstaflokk
    @U.Case.FemNom@ lítlasystir
    @U.Case.FemObl@ lítluusystur
    @U.Case.Neu@ breiðaskarð
    @U.Case.Pl@ fyrstuflokkar, lítlusystrar, breiðuskørð

    Flag diacritic look-alikes for grammar checker & tokenisation purposes

    Flag Explanation
    @P.Pmatch.Loc@ Location in string used or parsed by hfst-pmatch
    @P.Pmatch.Backtrack@ Also for hfst-pmatch

    Flags for compound restriction

    For languages that allow compounding, the following flag diacritics are needed to control position-based compounding restrictions for nominals. Their use is handled automatically if combined with +CmpN/xxx tags. If not used, they will do no harm.

    Flag Explanation
    @P.CmpFrst.FALSE@ Require that words tagged as such only appear first
    @D.CmpPref.TRUE@ Block such words from entering ENDLEX
    @P.CmpPref.FALSE@ Block these words from making further compounds
    @D.CmpLast.TRUE@ Block such words from entering R
    @D.CmpNone.TRUE@ Combines with the next tag to prohibit compounding
    @U.CmpNone.FALSE@ Combines with the prev tag to prohibit compounding
    @P.CmpOnly.TRUE@ Sets a flag to indicate that the word has passed R
    @D.CmpOnly.FALSE@ Disallow words coming directly from root.

    Use the following flag diacritics to control downcasing of derived proper nouns (e.g. Finnish Pariisi -> pariisilainen). See e.g. North Sámi for how to use these flags. There exists a ready-made regex that will do the actual down-casing given the proper use of these flags. | @U.Cap.Obl@ | Allowing downcasing of derived names: deatnulasj. | @U.Cap.Opt@ | Allowing downcasing of derived names: deatnulasj.

    Lexicon Root

    This is the beginning of everything. The Root lexicon is reserved in the LexC language, and must be the first lexicon defined.

    Lexicon Acronyms is split in two:

    Lexicon ENDLEX

    And this is the ENDLEX of everything:

    @D.CmpOnly.FALSE@@D.CmpPref.TRUE@@D.NeedNoun.ON@ ENDLEX2 ;
    

    The @D.CmpOnly.FALSE@ flag diacritic is ued to disallow words tagged with +CmpNP/Only to end here. The @D.NeedNoun.ON@ flag diacritic is used to block illegal compounds.


    This (part of) documentation was generated from src/fst/morphology/root.lexc


    src-fst-morphology-stems-abbreviations.lexc.md

    File containing Faroese abbreviations

    Lexica for adding tags and periods

    The idea is (or may be) to use both common and language-speicfic abbreviations.

    Splitting in 3 groups, because of the preprocessor

    Abbreviation

    dot% noStb.db Abbreviations that never induce sentence boundaries The file is too large and should be shrinked


    This (part of) documentation was generated from src/fst/morphology/stems/abbreviations.lexc


    src-fst-morphology-stems-adjectives.lexc.md

    Faroese adjectives

    The adjectives and their inflectional codes are taken from “Føroysk orðabók”.

    The list of ajectives

    Adjectives for the list of adjectives

    Irregular comparatives and superlatives

    Prefixed present participles

    Regular adjectives, systematic list


    This (part of) documentation was generated from src/fst/morphology/stems/adjectives.lexc


    src-fst-morphology-stems-adpositions.lexc.md

    Faroese prepositions

    We should eventually have syntactic tags here…

    Tags

    p for the tag +Pr

    The list of prepositions

    Preposition for the list of prepositions, ordered according to case they select for.

    Foreign

    Several cases

    Accusative or dative

    | —

    Accusative or genitive

    Accusative

    Dative


    This (part of) documentation was generated from src/fst/morphology/stems/adpositions.lexc


    src-fst-morphology-stems-adverbs.lexc.md

    Faroese adverbs

    adv for the tag +Adv

    advcomp for the tag +Adv+Cmp

    advsuperl for the tag +Adv+Superl

    Adverb for the list of appr 1000 adverbs


    This (part of) documentation was generated from src/fst/morphology/stems/adverbs.lexc


    src-fst-morphology-stems-conjunctions.lexc.md

    The Faroese conjunctions

    The file stems/conjunctions.lexc contains two lexica:

    LEXICON CCtag for assigning the +CC tag to all the conjunctions below. It has one entry:

    LEXICON Conjunction for the list of 10 or so conjunctions that are found in the file. Here are the first entries:


    This (part of) documentation was generated from src/fst/morphology/stems/conjunctions.lexc


    src-fst-morphology-stems-determiners.lexc.md

    Faroese deternminers


    This (part of) documentation was generated from src/fst/morphology/stems/determiners.lexc


    src-fst-morphology-stems-fao-acronyms.lexc.md

    Akronymer

    This documents the stems/fao-acronyms.lexc file. Most acronyms are taken from a common generated file, this file is for the Faroese-specific acronyms.

    LEXICON Acronym-fao pointing to the lexica

    LEXICON Acronym-fao-list for selve listen, i øjeblikket 2:

    Akronymnumeralier for 0-9

    anl send numvers too letterloops – this might be too liberal.


    This (part of) documentation was generated from src/fst/morphology/stems/fao-acronyms.lexc


    src-fst-morphology-stems-interjections.lexc.md

    Interjections

    The tag +Interj

    Interj

    The words

    Interjection okey, ááá, aj, huff, …


    This (part of) documentation was generated from src/fst/morphology/stems/interjections.lexc


    src-fst-morphology-stems-nouns.lexc.md

    Faroese noun stem file

    The lexicon names are taken from Føroysk orðabók I-II (FO). Reference is made to Thráinsson & al (“fg”).

    Note that in some cases, the lexicon names and stems here deviate from FO. In that case the lexica have names ending in wordforms, written in capital lettes.

    Short lexica

    Shortnouns for 1, 2 and 3 letter nouns excluded from compounding

    These are now always excluded from lastpart compound and in norm from first-part compounding as well

    The main list of nouns

    Her kjem alle substantiva. Dei er baklengssortert. leksikon som byrjar med x er ikkje manuelt sjekka.

    Nouns

    Fila inneheld i underkant av 50000 lemma.


    This (part of) documentation was generated from src/fst/morphology/stems/nouns.lexc


    src-fst-morphology-stems-numerals.lexc.md

    Faroese Numerals

    Numeral splitting in types

    1-9

    TRÝsplit

    nsplit

    TEXTTENS

    TEXTTEENS

    basic

    EITT

    TVEY

    TRÝ

    PAIRNUM

    n

    Ordinals

    ordinals

    ord_decl

    ANNAR

    ANNARMORPH


    This (part of) documentation was generated from src/fst/morphology/stems/numerals.lexc


    src-fst-morphology-stems-pronouns.lexc.md

    Faroese pronouns

    Pronoun splitting into 3 sublexica:

    1. Personal ;
    2. Reflexive ;
    3. Interrogative ;
    4. Indefinite ;

    Personal for the personal pronouns

    egtu-obl

    okkumtykkum

    S_okkumtykkum

    3obl

    Reflexive

    Interrogative

    EIN

    ANNAR_P

    EINHVOR

    ANNARHVOR

    HANNSJALVUR

    Indefinite

    ONKUR

    NAKAR

    BADIR

    HVORGIN

    EINGIN


    This (part of) documentation was generated from src/fst/morphology/stems/pronouns.lexc


    src-fst-morphology-stems-propernouns.lexc.md

    Proper nouns

    Table of content

    Splitting into name types

    Propernouns splitting in 3 lexica: multipartnames, names, guess

    multipartnames contains only 3 names for now

    names gives the list of names.


    This (part of) documentation was generated from src/fst/morphology/stems/propernouns.lexc


    src-fst-morphology-stems-subjunctions.lexc.md

    Faroese subjunctions

    The file stems/subjunctions.lexc contains three lexica:

    LEXICON CStag assigns the +CS TAG. It has one entry: +CS: # ;

    LEXICON IMtag assigns the +IM tag for the infinitive marker. The entry is: +IM: # ;

    LEXICON Subjunction contains the list of some 10-20 CSs. Here are the first 4:


    This (part of) documentation was generated from src/fst/morphology/stems/subjunctions.lexc


    src-fst-morphology-stems-verbs.lexc.md

    Faroese verb stems

    This file documents the file stems/verbs.lexc

    The file contains one lexicon:

    LEXICON Verbs = the lexicon containing all verb stems

    Some irregular verbs

    mega, eiga, eita, gráta, liggja, … and 15 more

    some irregular passive verbs

    The long verb list

    The lexica listed here represent the declension patterns presented in Føroysk orðabók. The lexicon names correspond to the declension codes in the dictionary.

    Simple declension class verbs

    Still to be classified

    Double declension class verbs

    Finally some candidates to be considered for verb compounding.


    This (part of) documentation was generated from src/fst/morphology/stems/verbs.lexc


    src-fst-phonetics-txt2ipa.xfscript.md

    Phonological converter for Faroese

    Table below taken from:

    Standardising Pronunciation for a Grapheme-to-Phoneme Converter for Faroese

    FARSAMPA/IPA table

    Phoneme class Orthography FARSAMPA IPA
    Stops p p
      b b p
      t t
      d d t
      k k
      g g k
    Fricatives f f f
      v v v
      ? 4 ð
      ? 5 θ
      s s s
      s S ʃ
      ? z ʂ
      h h h
    Affricates b tS tʃʰ
      b dZ
    Nasals m m m
      m M
      n n n
      n x
      n N ŋ
      n X ŋ̊
    Laterals l l l
      l L
    Approximants ð w w
      ð j j
      r r ɹ
    Monophthongs i i i
      i? I ɪ
      e e e
      e? E ɛ
      a a a
      y y y
      ? Y ʏ
      ø 2 ø
      ? 9 œ
      ú? u u
      ? U ʊ
      ? o o
      ? O ɔ
      ? 8 ə
    Diphthongs æ? EA ɛa
      á OA ɔa
      oy OJ ʊi
      ? UJ ɛi
      ei EJ ai
      ei? aJ ai
      ? aW au
      ? OJ ɔi
      ? OW ɔu
      ? 3W ʉu
      ? EW ɛu
      ? 9W œu
      ? 9J œi
    Diacritics ? H ʰ
    Others (length) : ː
      (prim. stress % ˈ
      (sec. stress) ~ ˌ

    For reference: The SAMPA - IPA correspondence

    SAMPA IPA Description
    p p voiceless bilabial stop
    b b voiced bilabial stop
    t t voiceless alveolar or dental stop
    d d voiced alveolar or dental stop
    ts ʦ voiceless alveolar affricate
    dz ʣ voiced alveolar affricate
    tS ʧ voiceless postalveolar affricate
    dZ ʤ voiced postalveolar affricate
    c c voiceless palatal stop
    J\ ɟ (overstroked j) voiced palatal stop
    k k voiceless velar stop
    g g voiced velar stop
    q q voiceless uvular stop
    p\ ɸ (Greek phi) voiceless bilabial fricative
    B β (Greek beta) voiced bilabial fricative
      ϐ (Greek beta alt) voiced bilabial approximant
    f f voiceless labiodental fricative
    v v voiced labiodental fricative
    T θ (Greek theta) voiceless dental fricative
      ϑ (Greek theta alt) voiceless dental approximant
    D ð (Icelandic eth) voiced dental fricative
      δ (Greek delta) voiced dental approximant
    s s voiceless alveolar fricative
    z z voiced alveolar fricative
    S ʃ voiceless postalveolar fricative
    Z ʒ voiced postalveolar fricative
    C ç (cedilla) voiceless palatal fricative
    j\ (jj) ʝ (j with crossed tail) voiced palatal fricative
    x x voiceless velar fricative
    G γ (Greek gamma) voiced velar fricative
      ɰ voiced velar approximant
    X\ ħ (overstroked h) voiceless pharyngeal fricative
    ?\ ʕ (Inverted ?) voiced pharyngeal fricative
    h h voiceless glottal approximant
    h\ ɦ (h with upper tail to the right) voiced glottal approximant
    m m bilabial nasal
    F ɱ (m with downward right tail) labiodental nasal
    n n alveolar or dental nasal
    J ɲ (n with downward left tail) palatal nasal
    N ŋ (n with downward right tail) velar nasal
    l l alveolar lateral
    L ʎ turned down y, alt. λ (Greek lambda) palatal lateral
    5 ɫ (l with middle tilde) velarized dental lateral
    4 (r) ɾ (r without upper-left serif) alveolar flap
    r (rr) r alveolar trill
    r\ ɹ (r rotated 180°) retroflexed alveolar approximant
    R ʀ (small capital R) uvular trill
    P ʋ labiodental approximant
    w w velo-labial approximant
    H ɥ (turned down h) palato-labial approximant
    j j palatal approximant

    Vowels

    .             front   near-front    central   near-back   back
    close          i • y               1 • }                 M • u
    near-close              I • Y                    U
    close-mid      e • 2              @\ • 8                 7 • o
    mid                                  @            
    open-mid       E • 9               3 • 3\                V • O
    near-open        {                    6           
    open           a • &                                     A • Q
    

    More SAMPA/IPA documentation

    (Some symbols are doubled or escaped with \ in the source to escape Markdown (mis)interpretation, they will appear correct in the rendered HTML.)

    Description SAMPA IPA Unicode
    retroflex plosive, voiceless t` 1 ʈ 0288, 648
    retroflex plosive, voiced d` 1 ɖ 0256, 598
    labiodental nasal F ɱ 0271, 625
    retroflex nasal n` 1 ɳ 0273, 627
    palatal nasal J ɲ 0272, 626
    velar nasal N ŋ 014B, 331
    uvular nasal N\ ɴ 0274, 628
    bilabial trill B\ ʙ 0299, 665
    uvular trill R\ ʀ 0280, 640
    alveolar tap 4 ɾ 027E, 638
    retroflex flap r` 1 ɽ 027D, 637
    bilabial fricative, voiceless p\ ɸ 0278, 632
    bilabial fricative, voiced B β 03B2, 946
    dental fricative, voiceless T θ 03B8, 952
    dental fricative, voiced D ð 00F0, 240
    postalveolar fricative, voiceless S ʃ 0283, 643
    postalveolar fricative, voiced Z ʒ 0292, 658
    retroflex fricative, voiceless s` 1 ʂ 0282, 642
    retroflex fricative, voiced z` 1 ʐ 0290, 656
    palatal fricative, voiceless C ç 00E7, 231
    palatal fricative, voiced j\ ʝ 029D, 669
    velar fricative, voiced G ɣ 0263, 611
    uvular fricative, voiceless X χ 03C7, 967
    uvular fricative, voiced R ʁ 0281, 641
    pharyngeal fricative, voiceless X\ ħ 0127, 295
    pharyngeal fricative, voiced ?\ ʕ 0295, 661
    glottal fricative, voiced h\ ɦ 0266, 614
           
    alveolar lateral fricative, vl. K    
    alveolar lateral fricative, vd. K\    
           
    labiodental approximant P (or v\ )    
    alveolar approximant r\    
    retroflex approximant r\` 1    
    velar approximant M\    
           
    retroflex lateral approximant l` 1    
    palatal lateral approximant L    
    velar lateral approximant L\    
           
    Clicks      
    bilabial O\   (O = capital letter)
    dental |\    
    (post)alveolar !\    
    palatoalveolar =\    
    alveolar lateral ||\    
           
    Ejectives, implosives      
    ejective _>   e.g. ejective p = p_>
    implosive _<   e.g. implosive b = b_<
           
    Vowels      
    close back unrounded M    
    close central unrounded 1    
    close central rounded }    
    lax i I    
    lax y Y    
    lax u U    
           
    close-mid front rounded 2    
    close-mid central unrounded @\    
    close-mid central rounded 8    
    close-mid back unrounded 7    
           
    schwa ə @    
           
    open-mid front unrounded E    
    open-mid front rounded 9    
    open-mid central unrounded 3    
    open-mid central rounded 3\    
    open-mid back unrounded V    
    open-mid back rounded O    
           
    ash (ae digraph) {    
    open schwa (turned a) 6    
           
    open front rounded &    
    open back unrounded A    
    open back rounded Q    
           
    Other symbols      
    voiceless labial-velar fricative W    
    voiced labial-palatal approx. H    
    voiceless epiglottal fricative H\    
    voiced epiglottal fricative <\    
    epiglottal plosive >\    
           
    alveolo-palatal fricative, vl. s\    
    alveolo-palatal fricative, voiced z\    
    alveolar lateral flap l\    
    simultaneous S and x x\    
    tie bar _    
           
    Suprasegmentals      
    primary stress    
    secondary stress %    
    long :    
    half-long :\    
    extra-short _X    
    linking mark -\    
           
    Tones and word accents      
    level extra high _T    
    level high _H    
    level mid _M    
    level low _L    
    level extra low _B    
    downstep !    
    upstep ^   (caret, circumflex)
           
    contour, rising _R    
    contour, falling _F    
    contour, high rising _H_T    
    contour, low rising _B_L    
           
    contour, rising-falling _R_F   (NB Instead of being written as diacritics with _, all prosodic marks can alternatively be placed in a separate tier, set off by < >, as recommended for the next two symbols.)
           
    global rise <R>    
    global fall <F>    
           
    Diacritics      
           
    voiceless _0   (0 = figure), e.g. n_0
    voiced _v    
    aspirated _h    
    more rounded _O   (O = letter)
    less rounded _c    
    advanced _+    
    retracted _-    
    centralized _”    
    syllabic = (or _=)   e.g. n= (or n_=)
    non-syllabic _^    
    rhoticity `    
           
    breathy voiced _t    
    creaky voiced _k    
    linguolabial _N    
    labialized _w    
    palatalized ’ (or _j)   e.g. t’ (or t_j)
    velarized _G    
    pharyngealized _?\    
           
    dental _d    
    apical _a    
    laminal _m    
    nasalized ~ (or _~)   e.g. A~ (or A_~)
    nasal release _n    
    lateral release _l    
    no audible release _}    
           
    velarized or pharyngealized _e    
    velarized l, alternatively 5    
    raised _r    
    lowered _o    
    advanced tongue root _A    
    retracted tongue root _q    

    This (part of) documentation was generated from src/fst/phonetics/txt2ipa.xfscript


    src-fst-transcriptions-transcriptor-abbrevs2text.lexc.md

    Faroese abbreviations

    We describe here how abbreviations are in Faroese are read out, e.g. for text-to-speech systems.

    LEXICON Root

    For example:


    This (part of) documentation was generated from src/fst/transcriptions/transcriptor-abbrevs2text.lexc


    src-fst-transcriptions-transcriptor-clock-digit2text.lexc.md

    The Faroese clock

    Multichar_Symbols defines flags and +Use/NG and Úse/NA.

    LEXICON Root where it all begins

    LEXICON smallhour giving the 30-day

    LEXICON largehour giving the 30-day

    LEXICON BEFpunkt before punct

    LEXICON AFTpunkt after punct

    LEXICON BEF

    LEXICON AFT after

    LEXICON TOHALF before half

    LEXICON OVERHALF after half

    LEXICON TO í

    LEXICON OVER yvir

    LEXICON HOUR split in cases (not in use)

    LEXICON NOMHOUR hours 1-12 in nominative


    This (part of) documentation was generated from src/fst/transcriptions/transcriptor-clock-digit2text.lexc


    src-fst-transcriptions-transcriptor-date-digit2text.lexc.md

    Faroese dates

    Defining one tag: +Use/NG for do not generate

    LEXICON Root starts.

    LEXICON DAY splits days 1-9 in nominative and accusative

    LEXICON DAY10 splits days 10-31 in nominative and accusative

    LEXICON DAY_NOM the nominative ones (fyrsti…)

    LEXICON DAY_ACC the accusative ones (fyrsta…)

    LEXICON DAY10_NOM nominative tiggjundi…

    LEXICON DAY10_ACC accusative tiggjunda…

    LEXICON 29MONTH splits in 3 month types

    LEXICON 30MONTH giving the 30-day

    LEXICON 31MONTH giving the 31-day months

    LEXICON PUNCT gives punctiation


    This (part of) documentation was generated from src/fst/transcriptions/transcriptor-date-digit2text.lexc


    src-fst-transcriptions-transcriptor-numbers-digit2text.lexc.md

    Faroese numbers

    digits are translated to text and vice versa

    It starts with lexicon Root, which splits into thousands, hundreds, tens, ones. LEXICON @ØLEXNAME@

    LEXICON THOUSANDS

    LEXICON 2to9T for two to nine thousand, pointing to THOUSAND.

    LEXICON 10to99T for 10t and up

    LEXICON TEENT for 10-19 thousands

    LEXICON TENST

    LEXICON TENCOUNTT

    LEXICON OLDTENST

    LEXICON OLDTEN-1T

    LEXICON OLDTEN-2T

    LEXICON OLDTEN-3T

    LEXICON OLDTEN-4T

    LEXICON OLDTEN-5T

    LEXICON OLDTEN-6T

    LEXICON OLDTEN-7T

    LEXICON OLDTEN-8T

    LEXICON OLDTEN-9T

    LEXICON END1T

    LEXICON END2T

    LEXICON END3T

    LEXICON END4T

    LEXICON END5T

    LEXICON END6T

    LEXICON END7T

    LEXICON END8T

    LEXICON END9T

    LEXICON HUNDREDST

    LEXICON HUNDREDT

    LEXICON 1to99T

    LEXICON THOUSAND

    LEXICON HUNDREDS

    LEXICON HUNDRED

    LEXICON 1to99

    LEXICON 1to9

    LEXICON 10to99

    LEXICON TEEN

    LEXICON TENS

    LEXICON TENCOUNT

    LEXICON ZERO

    LEXICON OLDTENS

    LEXICON OLDTEN-1

    LEXICON OLDTEN-2

    LEXICON OLDTEN-3

    LEXICON OLDTEN-4

    LEXICON OLDTEN-5

    LEXICON OLDTEN-6

    LEXICON OLDTEN-7

    LEXICON OLDTEN-8

    LEXICON OLDTEN-9

    LEXICON END1

    LEXICON END2

    LEXICON END3

    LEXICON END4

    LEXICON END5

    LEXICON END6

    LEXICON END7

    LEXICON END8

    LEXICON END9


    This (part of) documentation was generated from src/fst/transcriptions/transcriptor-numbers-digit2text.lexc


    tools-grammarcheckers-grammarchecker.cg3.md

    Faroese grammarchecker

    This is work in progress. The main focus is on ð errors,

    This file contains two parts: Definitions and rules

    Definition section

    Delimiters

    Grammatical tags

    Here we declare all grammatical tags

    Declaring all the error tags

    Rule section

    Verbs

    Sg1 target forms

    RULE: Sup should be 1Sg

    RULE: Sup should be 1Sg

    RULE: sup > inf

    RULE: Neu should be 1Sg

    RULE: Imp Pl should be 1Sg

    Plural forms

    RULE: Sup should be Pl – marginal??

    RULE: Sup should be Pl – marginal??

    Supine forms

    RULE:s for Pl should be Sup are not written

    RULE: Inf should be Sup

    RULE: Inf should be Sup

    RULE: Inf should be Sup

    Specific verbs

    RULE: Past tens of láta is læt not lat

    Nouns

    Definiteness

    RULE: Neu Indef should be Neu Def

    We turn off this rule for now, it is too hard to avoid false alarms.

    Quantor phrases

    RULE: Num + N Sg should be Num + N Pl

    Num + N Sg should be Num + N Pl (We need arabic tag here)

    Subjunctives

    Nothing here.

    ta / tað rules

    RULE: ta should be tað

    Adjectives

    RULE: líti should be lítið


    This (part of) documentation was generated from tools/grammarcheckers/grammarchecker.cg3


    tools-grammarcheckers-grc-disambiguator.cg3.md

    Faroese disambiguator

    Usage, in lang-fao: cat text.txt|hfst-tokenize -cg tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst |vislcg3 -g src/cg3/disambiguator.cg3

    This file documents the Faroese disambiguator file .

    Delimiters, tags and sets

    Test: Go for minimal weight. This rules gives priority to lexicalised forms.

    MAPPING OF CC AND CS

    Mostly we map both @CNP and @CVP, then we select @CNP, after that we remove them so @CVP remains


    This (part of) documentation was generated from tools/grammarcheckers/grc-disambiguator.cg3


    tools-tokenisers-tokeniser-disamb-gt-desc.pmscript.md

    Tokeniser for fao

    Usage:

    $ make
    $ echo "ja, ja" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
    $ echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa boasttu olmmoš, man mielde lahtuid." | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
    $ echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
    $ echo "márffibiillagáffe" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
    

    Pmatch documentation: https://github.com/hfst/hfst/wiki/HfstPmatch

    Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words:

    Whitespace contains ASCII white space and the List contains some unicode white space characters

    Apart from what’s in our morphology, there are

    1. unknown word-like forms, and
    2. unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a Unknowns are made of:
      • lower-case ASCII
      • upper-case ASCII
      • select extended latin symbols
      • Faroese-specific alphabet ASCII digits
      • select symbols
      • Combining diacritics as individual symbols,
      • various symbols from Private area (probably Microsoft), so far:
      • U+F0B7 for “x in box”

    Unknown handling

    Unknowns are tagged ?? and treated specially with hfst-tokenise hfst-tokenise –giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it’s safer to let hfst-tokenise handle them.

    Finally we mark as a token any sequence making up a:


    This (part of) documentation was generated from tools/tokenisers/tokeniser-disamb-gt-desc.pmscript


    tools-tokenisers-tokeniser-gramcheck-gt-desc.pmscript.md

    Grammar checker tokenisation for fao

    Requires a recent version of HFST (3.10.0 / git revision>=3aecdbc) Then just:

    $ make
    $ echo "ja, ja" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
    

    More usage examples:

    $ echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa boasttu olmmoš, man mielde lahtuid." | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
    $ echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
    $ echo "márffibiillagáffe" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
    

    Pmatch documentation: https://github.com/hfst/hfst/wiki/HfstPmatch

    Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words:

    Whitespace contains ASCII white space and the List contains some unicode white space characters

    Apart from what’s in our morphology, there are 1) unknown word-like forms, and 2) unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a

    TODO: Could use something like this, but built-in’s don’t include šžđčŋ:

    Simply give an empty reading when something is unknown: hfst-tokenise –giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it’s safer to let hfst-tokenise handle them.

    Finally we mark as a token any sequence making up a:


    This (part of) documentation was generated from tools/tokenisers/tokeniser-gramcheck-gt-desc.pmscript


    tools-tokenisers-tokeniser-tts-cggt-desc.pmscript.md

    TTS tokenisation for smj

    Requires a recent version of HFST (3.10.0 / git revision>=3aecdbc) Then just:

    make
    echo "ja, ja" \
    | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
    

    More usage examples:

    echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa \
    boasttu olmmoš, man mielde lahtuid." \
    | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
    echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" \
    | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
    echo "márffibiillagáffe" \
    | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
    

    Pmatch documentation: https://kitwiki.csc.fi/twiki/bin/view/KitWiki/HfstPmatch

    Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words:

    Whitespace contains ASCII white space and the List contains some unicode white space characters

    Apart from what’s in our morphology, there are 1) unknown word-like forms, and 2) unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a

    TODO: Could use something like this, but built-in’s don’t include šžđčŋ:

    Simply give an empty reading when something is unknown: hfst-tokenise –giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it’s safer to let hfst-tokenise handle them.

    Needs hfst-tokenise to output things differently depending on the tag they get


    This (part of) documentation was generated from tools/tokenisers/tokeniser-tts-cggt-desc.pmscript

    1. `  = ASCII 096  2 3 4 5 6 7 8