Tornedalen Finnish NLP Grammar

Finite state and Constraint Grammar based analysers, proofing tools and other resources

View the project on GitHub giellalt/lang-fit

Page Content

  • src-fst-morphology-affixes-acronyms.lexc.md
  • Documenting Meänkieli acronym morphology
  • src-fst-morphology-affixes-adjectives.lexc.md
  • Documenting the file for Meänkieli adjective morphology
  • src-fst-morphology-affixes-nouns.lexc.md
  • Meänkieli noun morphology
  • The lexica themselves
  • src-fst-morphology-affixes-numerals.lexc.md
  • Meänkieli numerals
  • Numeral inflection
  • src-fst-morphology-affixes-pronouns.lexc.md
  • Pronominien morfologia
  • src-fst-morphology-affixes-propernouns.lexc.md
  • Meänkieli propernoun morphology
  • src-fst-morphology-affixes-symbols.lexc.md
  • Symbol affixes
  • src-fst-morphology-affixes-verbs.lexc.md
  • Meänkieli verbs
  • Irregular verbs
  • Regular verbs
  • src-fst-morphology-phonology.twolc.md
  • Meänkieli twolc file
  • Declaring the alphabet, sets and definitions
  • Rules
  • src-fst-morphology-root.lexc.md
  • Meänkieli morphological transducer
  • src-fst-morphology-stems-adjectives.lexc.md
  • Meänkieli adjectives
  • src-fst-morphology-stems-adverbs.lexc.md
  • Meänkieli adverbs
  • src-fst-morphology-stems-conjunctions.lexc.md
  • Meänkieli conjunctions
  • src-fst-morphology-stems-fit-abbreviations.lexc.md
  • File containing meänkieli abbreviations
  • src-fst-morphology-stems-fit-acronyms.lexc.md
  • Meänkieli aacronyms
  • src-fst-morphology-stems-fit-propernouns.lexc.md
  • Meänkieli propernouns
  • src-fst-morphology-stems-interjections.lexc.md
  • Meänkieli interjections
  • src-fst-morphology-stems-nouns.lexc.md
  • Noun stems for Meänkieli
  • Vowel stems
  • The lexica themselves
  • src-fst-morphology-stems-numerals.lexc.md
  • Meänkieli numerals
  • src-fst-morphology-stems-postpositions.lexc.md
  • Meänkieli postpositions
  • src-fst-morphology-stems-prepositions.lexc.md
  • Meänkieli prepositions
  • src-fst-morphology-stems-pronouns.lexc.md
  • Meänkieli pronouns
  • src-fst-morphology-stems-subjunctions.lexc.md
  • Meänkieli subjunctions
  • src-fst-morphology-stems-verbs.lexc.md
  • Documenting the file for meänkieli verbs
  • src-fst-phonetics-txt2ipa.xfscript.md
  • src-fst-transcriptions-transcriptor-abbrevs2text.lexc.md
  • src-fst-transcriptions-transcriptor-numbers-digit2text.lexc.md
  • Number transcriptions
  • tools-grammarcheckers-grammarchecker.cg3.md
  • DELIMITERS
  • TAGS AND SETS
  • Meänkieli (Tornedalen Finnish) language model documentation

    All doc-comment documentation in one large file.


    src-cg3-dependency.cg3.md

    C O M M O N S Á M I D E P E N D E N C Y G R A M M A R

    This dep file is for sma, sme, smj, sje.

    DELIMITERS

    Sentence delimiters are the following: <.> <!> <?> <…> <¶>

    TAGS AND SETS

    N V A Adv CC CS Inf Sup Neg Num Po Pr

    Pcle Prop

    Pron IV TV COMMA DASH CITATION to keep colouring we add a “ HYPHEN QMARK PUNCT LEFT RIGHT CLB Ind Pot Impr ImprtII Cond ConNeg Caus causative eus VGen Interj ABBR ACR Prs Prt Cmpnd RCmpnd PrfPrc PrsPrc Actor Actio Ger Indef Nom Acc Ill Com Gen Ess

    IM For fao

    POS sub-categories

    Syntactic tags and sets

    Syntactic tags in input to this file

    Syntactic tags added in this file

    fao syntags

    kal syntags

    eus syntags

    Syntactic set definitions

    Dep grammar

    Correction rules

    The finite verb

    Mapping rules

    lgRemove removes the language tags , , etc, before proceeding to the dep file.


    This (part of) documentation was generated from src/cg3/dependency.cg3


    src-cg3-disambiguator.cg3.md

    Disambiguator for Meänkieli

    Usage:

    cat text.txt|hfst-tokenize -cg tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst |vislcg3 -g src/cg3/disambiguator.cg3

    This file documents the Meänkieli disambiguator file .

    Delimiters, tags and sets

    Sentence delimiters are the following: “<.>” “<…>” “<!>” “<?>” “<¶>”

    Part-of-Speech

    Numerus

    Person

    Cases

    Types

    Sets with more members

    Boundaries

    Verbs

    Disambiguation rules

    Dialects

    Early rules

    Possessive suffixes

    First we put rules to choose Px forms… (forthcomong)

    Then we remove the remaining Px

    Numeral phrases

    Preposition/postposition/adverb rules

    Rules for mapping @CVP and @CNP on the CC and CS

    Case rules

    Partitive

    Genitive

    Illative

    Number rules

    More disambiguation rules

    Elative

    Propernouns

    Verbs

    Specific verbs

    ei negation verb

    eli

    Adverbs

    paljon

    kerran

    jälkhiin

    Adjectives

    toinen

    Conjunctions

    Subjunctions

    että

    jos

    ko

    mutta

    sillä

    Pronouns

    sie

    tet

    Verb rules, Verbs

    Infinitive

    Present Sg3

    Present Pl3 or PrsPrc

    Present Pl3 or Passive

    Imperative

    Past tense

    Prt Pl3 or Prt Sg2

    Relative pronouns

    HNOUN MAPPING


    This (part of) documentation was generated from src/cg3/disambiguator.cg3


    src-cg3-functions.cg3.md

    S Y N T A C T I C F U N C T I O N S F O R S Á M I

    Sámi language technology project 2003-2018, University of Tromsø #

    This file adds syntactic functions. It is common for all the Saami

    LEFT RIGHT because of apertium

    Syntactic tags

    Tag sets

    These sets model noun phrases (NPs). The idea is to first define whatever can occur in front of the head of the NP, and thereafter negate that with the expression WORD - premodifiers.

    The set NOT-NPMOD is used to find barriers between NPs. Typical usage: … (*1 N BARRIER NPT-NPMOD) … meaning: Scan to the first noun, ignoring anything that can be part of the noun phrase of that noun (i.e., “scan to the next NP head”)

    ADLVCASE

    These were the set types.

    Numeral outside the sentence

    HABITIVE MAPPING

    sma object

    SUBJ MAPPING - leftovers

    OBJ MAPPING - leftovers

    MAPPING for MT - experimental

    HNOUN MAPPING

    missingX adds @X to all missings

    therestX adds @X to all what is left, often errouneus disambiguated forms

    For Apertium:

    The analysis give double analysis because of optional semtags. We go for the one with semtag.


    This (part of) documentation was generated from src/cg3/functions.cg3


    src-fst-morphology-affixes-abbreviations.lexc.md

    Documenting the morphological tags for Meänkieli abbreviations

    This file documents affixes/abbreviations.lexc, the file for Meänkieli abbreviation morphology

    Now splitting according to POS, and according to dot or not

    LEXICON ab-noun-itrab LEXICON ab-noun-trab LEXICON ab-noun-trnumab

    Lexicons without final period

    Lexicons with final period


    This (part of) documentation was generated from src/fst/morphology/affixes/abbreviations.lexc


    src-fst-morphology-affixes-acronyms.lexc.md

    Documenting Meänkieli acronym morphology

    This file documents affixes/acronyms.lexc, the file for Meänkieli acronym morphology

    LEXICON Acronym-fit-suf for adding +ACR tag

    LEXICON ACRONOUN_cons

    LEXICON ACRONOUN_vow

    LEXICON ACRO_BERN

    LEXICON ACRO_LONDON

    LEXICON ACRO_NYSTØ

    LEXICON ACRO_cons

    LEXICON ACRO_vow


    This (part of) documentation was generated from src/fst/morphology/affixes/acronyms.lexc


    src-fst-morphology-affixes-adjectives.lexc.md

    Documenting the file for Meänkieli adjective morphology

    This file documents the file affixes/adjectives.lexc for Meänkieli adjective morphology.

    Most lexica here (a1, a_e, …) add +A, and thereafter redirect to the corresponding x1, x_e, … lexicon in affixes/nouns.lexc for case inflection. The lexicon numbers correspond to the ones for nouns.

    In addition, each lexicon also points to comparative and superlative sublexica.

    Unassigned

    LEXICON ax pointing to a1. It is for adjectives that have still not been classified.

    Regular lexica

    LEXICON a1 adding +A and sending to x1, and to 3comp, 3sup.

    LEXICON a1_e vanha, which has Err/Orth vanhee-, otherwise like a1

    LEXICON a_vasen adding +A and sending to x1, and to 3comp, 3sup.

    LEXICON a_e gets +A and goes to x_e.

    LEXICON a3 kamala gets +A and points to x3

    LEXICON a4 has no comparative or superlative , just points to x4

    LEXICON anen has no comparative or superlative , just points to xnen

    LEXICON aas has no comparative or superlative , just points to xnas

    LEXICON a_suuri has no comparative or superlative , just points to x4

    LEXICON a1_ton

    LEXICON x1_ton

    Comparative inflection

    LEXICON 3comp 2syll adj, 3syll comparative

    LEXICON 4comp 3syll adj, 4syll comparative

    LEXICON xcomp common for 2syll and 3syll

    Superlative inflection

    LEXICON 3sup 2syll adj, 3syll superlative

    LEXICON 4sup 3syll adj, 4syll superlative

    LEXICON xsup common for 2syll and 3syll


    This (part of) documentation was generated from src/fst/morphology/affixes/adjectives.lexc


    src-fst-morphology-affixes-nouns.lexc.md

    Meänkieli noun morphology

    This file documents affixes/nouns.lexc, the file for Meänkieli noun morphology

    This is an overview of the continuation lexicon types.

    Special stems

    Vowel stems

    Stems for -i-words, vowel AND consonant

    Special cases for -i-words

    Consonant stems of other types

    The lexica themselves

    Lexica for unassigned words

    LEXICON nx pointing to n1.

    LEXICON n_nomorph for uninflected nouns

    LEXICON nc for consonant-final nouns, structure CVC

    LEXICON xc_sg

    LEXICON xc_pl

    Lexica for regular nouns

    LEXICON n0 for 1-syllabic: maa, suu, tie, …

    LEXICON n0_pl for plurals of the same: häät

    LEXICON x0 splitting to sg and pl

    LEXICON x0_sg sg forms x0 point here

    LEXICON x0_sg_oblique for oblique case forms in sg

    LEXICON x0_pl for plural case forms

    LEXICON n1 for 2-syll ordinary nouns (talo)

    LEXICON n1_pl for the same plural words (urut)

    LEXICON x1 for the bisyallbic, pointing to sg, pl

    LEXICON x1_sg bisyllabic sg

    LEXICON x1_sg_oblique gives the rest

    LEXICON x1_pl the pl forms

    LEXICON n_e vene, liike, säe

    LEXICON n_e_pl vehkheet

    LEXICON x_e splits in sg and pl

    LEXICON x_e_sg the sg

    LEXICON x_e_pl the pl

    LEXICON x_e_pl urvakke etc, n_e-ord med -lle/-lla

    LEXICON x_e_pl splits in sg and pl

    LEXICON x_e_pl the sg

    LEXICON x_e_pl the pl

    LEXICON n3 odd-syllabic: kanava

    LEXICON n3_pl haalarit

    LEXICON x3

    LEXICON x3_oblique

    LEXICON x3_sg

    LEXICON x3_oblique_sg

    LEXICON x3_pl

    LEXICON x3_pl

    LEXICON 3nc

    LEXICON xnc

    The i>e-family; kivi, kieli, käsi, lumi etc

    LEXICON n4 kivi, stem kive

    LEXICON x4 veri

    LEXICON n4_pl

    LEXICON x4_sg shared lexica for n4, n5, n5_lumi/loimi/lapsi EXCEPT SgNom, SgPar

    LEXICON x4_pl

    LEXICON n5 kieli, stem kiele

    LEXICON n5 kieli, stem kiele

    LEXICON n5_kieli kieli, stem kiele

    LEXICON n5_lumi lumi, stem lu

    LEXICON n5_loimi loimi, stem loi, som n5_lumi PLUS partitiv loimea

    LEXICON n5_vuosi vuosi> vuoessa/vuessa, stem ELLER vu

    LEXICON n5_kasi käsi, stem kä

    LEXICON n5_kasi_pl continuation for kasi_pl

    LEXICON x5_kasi veri

    LEXICON x5_kasi_pl

    LEXICON n5_lapsi

    LEXICON n5_ie_odd

    LEXICON n5_ie_odd same as n5_ie except Pl+Part: takki>takkeja

    LEXICON n5_nuoret_pl same as n1_pl except Pl+Gen: nuoret>nuorten

    LEXICON n5_i_pl cont lexica for type n1-words ending with -i

    LEXICON x5_i_pl cont lexica for type n1-words ending with -i

    The nainen (nen) and hevonen (3nen) family

    LEXICON nen bisyllabic nainen stem nai

    LEXICON nen_sg

    LEXICON nen_pl

    LEXICON xnen

    LEXICON xnen_sg +Sg:se 2cases ; for Ade, All, Ess lla, lle, nna

    LEXICON xnen_pl

    LEXICON 3nen odd-syllabic hevonen stem hevose

    LEXICON x3nen

    LEXICON x3nen_sg

    LEXICON x3nen_pl

    LEXICON xnen_common_sg

    LEXICON xnen_common_pl

    LEXICON 3cases

    LEXICON 2cases

    LEXICON 3n_ks

    LEXICON 3n_ks_pl

    LEXICON xn_ks

    LEXICON xn_ks_sg

    LEXICON xn_ks_pl

    LEXICON n_äes

    LEXICON x_äes

    LEXICON 3n_ue

    LEXICON 3x_ue

    LEXICON 3x_ue_sg

    LEXICON 3x_ue_pl

    LEXICON 3n_ime

    LEXICON 3n_ime_sg

    LEXICON 3n_ime_pl

    LEXICON x_ime_sg

    LEXICON x_ime_pl

    LEXICON nas

    LEXICON xnas

    LEXICON xnas_sg

    LEXICON xnas_pl

    LEXICON xnas_pl

    LEXICON xnas_pl

    LEXICON nas_h_pl

    LEXICON 3mies

    LEXICON n_ien

    LEXICON n_ien_sg

    LEXICON n_uus

    LEXICON n_uus_odd

    2-syllabic LNR final stems

    LEXICON 3n_lnr ahven - ahvenheen

    LEXICON 3n_kymmen 3n_kymmen

    LEXICON 30n_lnr askel - askelheesheen

    LEXICON n_kasuven

    LEXICON 3xn_lnr tyär, kort och lång Ill

    LEXICON 3n_lnr_inteill inte Ill, Ine, Ess men alla andra

    LEXICON 4n_ks

    LEXICON x4n_ks

    LEXICON x4n_ks_sg

    LEXICON x4n_ks_pl

    Sublexica for cases

    LEXICON TRA

    Sublexica for possessive suffixes

    Px is now not in use, with one exception, comitative.

    LEXICON n_PxK has either -n or goes to Px LEXICON n_PxK

    LEXICON a_PxK has either -s or goes to Px with -a LEXICON a_PxK

    LEXICON s_PxK has either -s or goes to Px LEXICON s_PxK

    LEXICON sh_PxK has either -s or goes to Px with -he- LEXICON sh_PxK

    LEXICON st_PxK has either -s or goes to Px with -te- rakuaus, rakhauteni LEXICON st_PxK

    LEXICON t_PxK has either -t or goes to Px LEXICON t_PxK

    LEXICON i_PxK Tra: -i or -e and goes to Px LEXICON i_PxK

    LEXICON PxK has only -nsA, compare PxxK LEXICON PxK

    LEXICON PxxK has also -Vn, thus both .. llensa and ..lleen. LEXICON PxxK

    LEXICON Px

    LEXICON Px-Vn

    LEXICON n5_troppi troppi tropin troppia?

    LEXICON n5_troppi_odd


    This (part of) documentation was generated from src/fst/morphology/affixes/nouns.lexc


    src-fst-morphology-affixes-numerals.lexc.md

    Meänkieli numerals

    From fin via fkv.

    Numeral inflection

    Numeral inflection is like nominal, except that numerals compound in all forms which requires great amount of care in the inflection patterns.


    This (part of) documentation was generated from src/fst/morphology/affixes/numerals.lexc


    src-fst-morphology-affixes-pronouns.lexc.md

    Meänkieli pronoun morphology

    This file documents affixes/pronouns.lexc, the file for Meänkieli verb morphology

    Pronominien morfologia

    Pronominit ovat edelleen vaan kokeiluvaiheessa.

    LEXICON 12pronsg on 1., 2. p. yksikkö

    LEXICON 123pronpl

    nuoitä

    tuotä


    This (part of) documentation was generated from src/fst/morphology/affixes/pronouns.lexc


    src-fst-morphology-affixes-propernouns.lexc.md

    Meänkieli propernoun morphology

    This file documents affixes/propernouns.lexc, the file for Meänkieli propernoun morphology. The file pointing here is stems/fit-propernouns.lexc

    The lexicon names look like this: p_mal_1 etc. They have 3 parts, divided by “_”

    We do not use _pl for names

    … and many more.

    Vowel stems, odd and even stems

    Consonant stems, odd and even stems


    This (part of) documentation was generated from src/fst/morphology/affixes/propernouns.lexc


    src-fst-morphology-affixes-symbols.lexc.md

    Symbol affixes

    This file documents affixes/synbols.lexc, the file for the affixes added to language-independent symbols


    This (part of) documentation was generated from src/fst/morphology/affixes/symbols.lexc


    src-fst-morphology-affixes-verbs.lexc.md

    Meänkieli verbs

    This file documents affixes/verbs.lexc, the file for Meänkieli verb morphology

    Overview over the continuation classes

    Continuation lexica for regular verbs

    Continuation lexica for irregular verbs

    Irregular verbs

    Regular verbs

    Subparadigms

    Conditional forms

    LEXICON 2cond for -imm^A

    Infinitive paradigms

    from fkv

    LEXICON v12pers Only sg12, pl12 so far

    LEXICON PRFPRC_OBL is without nom sg from fkv


    This (part of) documentation was generated from src/fst/morphology/affixes/verbs.lexc


    src-fst-morphology-phonology.twolc.md

    Meänkieli twolc file

    This file documents the Meänkieli twolc file (the file governing gradation, gemination, vowel harmony and other morphophonological processes).

    The first part of the file contains definitions, the second part contains rules.

    Declaring the alphabet, sets and definitions

    Alphabet

    This defines all symbols (letters, archiphonemes, triggers) to be used.

    Sets

    Here we group the symbols in convenient sets.

    Definitions

    This defines strings used often in rules.

    WeakGrade = ([l|n|r]) (%^AE:) %^WG:

    Rules

    This chapter gives the rules themselves.

    Consonant rules

    For the gradation rules, each consonant deletion or change is given its own rule. Thus, both kk:k and k:0 are handled in the same *k:0 rule. This to avoid rule conflicts. The change rules (k:g, k:j etc.) are restricted by context (k:g only after n, etc.).

    f rules

    RULE: f:0

    j rules

    RULE: j:0

    k rules

    RULE: k:g

    Tests:

    RULE: k:0

    Tests:

    RULE: k:j

    RULE: k4:j

    Tests:

    RULE: k:v

    Tests:

    l rules

    RULE: k:v

    m rules

    RULE: m:0

    n rules

    RULE: n:0

    p rules

    RULE: p:0

    Tests:

    RULE: p:v

    Tests:

    RULE: p:m

    r rules

    RULE: p:m

    s rules

    RULE: r:0

    t rules

    RULE: t:j

    RULE: t4:0 where t4 is t in rt that shall not become rr

    Tests:

    RULE: t:0

    Tests:

    RULE: t:s

    Tests:

    **RULE: t:l ** for lt:ll

    Tests:

    **RULE: t:n ** for nt:nn

    Tests:

    **RULE: t:r ** for rt:rr

    Tests:

    Tests:

    v rules

    RULE: v:0

    Gemination rules

    The gemination rules insert the geminated consonant (thus 0:h if h to the left). There is one subrule for each vowel context, in order to avoid confilcts.

    RULE: Gemination 0:h

    RULE: Gemination 0:j

    RULE: Gemination 0:k

    Tests:

    RULE: Gemination 0:l

    Tests:

    RULE: Gemination 0:m

    RULE: Gemination 0:n

    RULE: Gemination 0:p

    RULE: Gemination 0:s

    Tests:

    RULE: h:0

    RULE: h:0

    RULE: h:0

    kasva>hm^A^An kasva>mhaan

    saarna>^A>hm^A^An saarna>a>hmaan

    tule>hm^A^An tule>mhaan

    RULE: Gemination 0:t

    Tests:

    RULE: Gemination 0:v Tests:

    Assimilation rules

    These are assimilation rules for n on suffix borders of LNRS consonant stems. There is also a rule j:0 avoiding a lji sequence.

    RULE: Alveolar assimilation for consonant stem l

    Tests:

    RULE: Alveolar assimilation for consonant stem r

    RULE: Alveolar assimilation for consonant stem s in infinitives Tests:

    RULE: Alveolar assimilation for consonant stem s in participles

    Vowel change rules: a - ä - e - i - o - ö - u - y

    Here come the rules for stem vowel changes in front of suffix -i- (be it plural, present, comparative or conditional). Vowels are deleted or changed according to context. There are also some other vowel change rules.

    a rules

    RULE: a:e before the ^AE trigger

    RULE: a:0 before metathesis h

    Tests:

    RULE: a:o when nonrounded root vowel and before i

    Tests:

    ä rules

    RULE: ä:0

    Tests:

    RULE: ä:e

    e rules

    RULE: e:0 deletes -e- in LNR stems as well as before -i-

    Tests:

    RULE: e:i

    Tests:

    i rules

    RULE: i:0

    Tests:

    RULE: i:j

    RULE: i2:j

    RULE: i8:0

    Tests:

    RULE: i:e

    o rules

    RULE: o:0

    Tests:

    ö rules

    RULE: ö:0

    Tests:

    u rules

    RULE: u:0

    Tests:

    y rules

    RULE: y:0

    Tests:

    Vowel copying rules

    These are the rules connected to the Meänkieli -h- suffixes. The vowel must be copied from the stem to the right of the h and also deleted in the stem (cf. talo : talhoon)

    RULE: a copying for h metathesis

    Tests:

    RULE: o copying for h metathesis

    Tests:

    RULE: i copying for h metathesis

    Tests:

    RULE: ä copying for h metathesis

    RULE: e copying for h metathesis

    RULE: ö copying for h metathesis

    RULE: y copying for h metathesis

    RULE: u copying for h metathesis

    Vowel harmony rule

    All vowel harmony is taken care of with one rule.

    RULE: Back harmony

    Tests:


    This (part of) documentation was generated from src/fst/morphology/phonology.twolc


    src-fst-morphology-root.lexc.md

    Meänkieli morphological transducer

    Beware of remnants from the Finnish and Kven files.

    Tags for POS

    Tags for grammar

    Pronoun types

    Other tags

    Number

    Case

    Possessive suffixes

    Comparatives

    Finite verbs

    Verb person tags

    Verb transitivity

    Infinite verbs

    Punctuation

    Language tags

    Speller tags

    Compounds

    Derivation

    These three tags are not added in lexc. The POS tag before derivation is converted into this tag when compiling FST for disambiguation.

    Tag

    Clitic tags

    Semantic tags

    Phonological symbols

    Miscellanious tags

    Flag diacritics

    We have manually optimised the structure of our lexicon using following flag diacritics to restrict morhpological combinatorics - only allow compounds with verbs if the verb is further derived into a noun again:

    Flag Explanation
    @P.NeedNoun.ON@ (Dis)allow compounds with verbs unless nominalised
    @D.NeedNoun.ON@ (Dis)allow compounds with verbs unless nominalised
    @C.NeedNoun@ (Dis)allow compounds with verbs unless nominalised

    For languages that allow compounding, the following flag diacritics are needed to control position-based compounding restrictions for nominals. Their use is handled automatically if combined with +CmpN/xxx tags. If not used, they will do no harm.

    Flag Explanation
    @P.CmpFrst.FALSE@ Require that words tagged as such only appear first
    @D.CmpPref.TRUE@ Block such words from entering ENDLEX
    @P.CmpPref.FALSE@ Block these words from making further compounds
    @D.CmpLast.TRUE@ Block such words from entering R
    @D.CmpSuff.TRUE@ Block such words from entering R
    @P.CmpSuff.TRUE@ Mark that we have passed R
    @D.CmpNone.TRUE@ Combines with the next tag to prohibit compounding
    @U.CmpNone.FALSE@ Combines with the prev tag to prohibit compounding
    @P.CmpOnly.TRUE@ Sets a flag to indicate that the word has passed R
    @D.CmpOnly.FALSE@ Disallow words coming directly from root.

    Use the following flag diacritics to control downcasing of derived proper nouns (e.g. Finnish Pariisi -> pariisilainen). See e.g. North Sámi for how to use these flags. There exists a ready-made regex that will do the actual down-casing given the proper use of these flags.

    Flag Explanation
    @U.Cap.Obl@ Allowing downcasing of derived names: deatnulasj.
    @U.Cap.Opt@ Allowing downcasing of derived names: deatnulasj.

    These tags are for handling errorneous forms | Flag | Explanation | |—– |———– | | @D.ErrOrth.ON@ | tbw | @P.ErrOrth.ON@ | tbw | @C.ErrOrth@ | tbw | @R.ErrOrth.ON@ | tbw

    This is for pronouns with multiple case suffixes (jommallekummalle)

    Flag Explanation
    @U.pron.nom@ tbw
    @U.pron.gen@ tbw
    @U.pron.gen2@ tbw
    @U.pron.ill@ tbw
    @U.pron.par@ tbw
    @U.pron.par2@ tbw
    @U.pron.par3@ tbw
    @U.pron.ess@ tbw
    @U.pron.tra@ tbw
    @U.pron.ine@ tbw
    @U.pron.ela@ tbw
    @U.pron.all@ tbw
    @U.pron.ade@ tbw
    @U.pron.abl@ tbw
    @P.compound.block@ tbw
    @D.compound.block@ tbw
    Flag diacritic Explanation
    @U.number.one@ Flag used to give arabic numerals in smj different cases ;
    @U.number.two@ Flag used to give arabic numerals in smj different cases ;
    @U.number.three@ Flag used to give arabic numerals in smj different cases ;
    @U.number.four@ Flag used to give arabic numerals in smj different cases ;
    @U.number.five@ Flag used to give arabic numerals in smj different cases ;
    @U.number.six@ Flag used to give arabic numerals in smj different cases ;
    @U.number.seven@ Flag used to give arabic numerals in smj different cases ;
    @U.number.eight@ Flag used to give arabic numerals in smj different cases ;
    @U.number.nine@ Flag used to give arabic numerals in smj different cases ;
    @U.number.zero@ Flag used to give arabic numerals in smj different cases ;
    @P.number.one@ Flag used to give arabic numerals in smj different cases ;
    @P.number.two@ Flag used to give arabic numerals in smj different cases ;
    @P.number.three@ Flag used to give arabic numerals in smj different cases ;
    @P.number.four@ Flag used to give arabic numerals in smj different cases ;
    @P.number.five@ Flag used to give arabic numerals in smj different cases ;
    @P.number.six@ Flag used to give arabic numerals in smj different cases ;
    @P.number.seven@ Flag used to give arabic numerals in smj different cases ;
    @P.number.eight@ Flag used to give arabic numerals in smj different cases ;
    @P.number.nine@ Flag used to give arabic numerals in smj different cases ;
    @P.number.zero@ Flag used to give arabic numerals in smj different cases ;

    These are for preprocessing

    Flag Explanation
    @P.Pmatch.Loc@  
    @P.Pmatch.Backtrack@  
    +Use/PMatch  
    +Use/-PMatch  
    +Gram/TAbbr Transitive abbreviation (it needs an argument)
    +Gram/NoAbbr Intransitive abbreviations that are homonymous with more frequent words. They should only be considered abbreviations in the middle of a sentence.
    +Gram/TNumAbbr Transitive abbreviation if the following constituent is numeric
    +Gram/NumNoAbbr Transitive abbreviations for which numerals are complements and normal words. The abbreviation usage is less common and thus only the occurences in the middle of the sentence can be considered as true cases.
    +Gram/TIAbbr Both transitive and intransitive abbreviation
    +Gram/IAbbr Intransitive abbreviation (it takes no argument)
    +Gram/3syll trisyllabic verbs
    +Gram/Superl superlative
    +Gram/Comp comparative

    Semantic tags

    Basic lexica, pointing to the other lexicon files

    Here is the Root lexicon, pointing to all the parts of speech:

    LEXICON Root


    This (part of) documentation was generated from src/fst/morphology/root.lexc


    src-fst-morphology-stems-adjectives.lexc.md

    Meänkieli adjectives

    This file documents the file for Meänkieli adjectives.

    The continuation lexicon types

    The lemma list itself

    LEXICON AdjectiveRoot


    This (part of) documentation was generated from src/fst/morphology/stems/adjectives.lexc


    src-fst-morphology-stems-adverbs.lexc.md

    Meänkieli adverbs

    This file documents the file for Meänkieli adverbs.

    The first part of the file adds tags, and the second lists the adverbs.

    The tags

    The adverbs themselves (some 1200)


    This (part of) documentation was generated from src/fst/morphology/stems/adverbs.lexc


    src-fst-morphology-stems-conjunctions.lexc.md

    Meänkieli conjunctions

    This file documents the file for Meänkieli conjunctions.

    It contains two parts, one for adding tags, and one for listing conjunctions.

    Adding tags

    The conjunctions themselves


    This (part of) documentation was generated from src/fst/morphology/stems/conjunctions.lexc


    src-fst-morphology-stems-fit-abbreviations.lexc.md

    File containing meänkieli abbreviations

    This file documents the file for Meänkieli abbreviations.

    The file contains 5-6 abbreviations, and is thus just a placeholder. Most fit abbreviations thus come from the common abbreviation file. Here we should add meänkieli-specific ones.

    Lexica for adding tags and periods

    1. ITRAB ;
    2. TRNUMAB ;
    3. TRAB ;

    The abbreviation lexicon itself

    Intransitive abbreviations

    Abreviations who are transitive in front of numerals

    Transitive abbreviations


    This (part of) documentation was generated from src/fst/morphology/stems/fit-abbreviations.lexc


    src-fst-morphology-stems-fit-acronyms.lexc.md

    Meänkieli aacronyms

    The file stems/fit-acronyms.lexc is a dummy file, with this comtent only:


    This (part of) documentation was generated from src/fst/morphology/stems/fit-acronyms.lexc


    src-fst-morphology-stems-fit-propernouns.lexc.md

    Meänkieli propernouns

    This file documents the file for Meänkieli propernouns.

    Contrary to other GiellaLT languages, the Meänkieli FST is not set up to use the language-independent name base found in the infrastructure.

    The lexicon names look like this: p_mal_1 etc. They have 3 parts, divided by “_”

    32000 names


    This (part of) documentation was generated from src/fst/morphology/stems/fit-propernouns.lexc


    src-fst-morphology-stems-interjections.lexc.md

    Meänkieli interjections

    This file documents the file for Meänkieli interjections.

    Adding tag


    This (part of) documentation was generated from src/fst/morphology/stems/interjections.lexc


    src-fst-morphology-stems-nouns.lexc.md

    Noun stems for Meänkieli

    This file documents the file for Meänkieli nouns.

    Vowel stems

    This is an overview of the continuation lexicon types.

    Special stems

    Vowel stems

    Stems for -i-words, vowel AND consonant

    Special cases for -i-words

    Consonant stems of other types

    The lexica themselves

    The lemma list


    This (part of) documentation was generated from src/fst/morphology/stems/nouns.lexc


    src-fst-morphology-stems-numerals.lexc.md

    Meänkieli numerals

    This file documents the file for Meänkieli numerals.

    These are taken from fkv, but originally from fin, an FST with very different ways of doing things.

    Numerals have been split in three sections, the compounding parts of cardinals and ordinals, and the non-compounding ones:

    The compounding parts of cardinals are the number multiplier words.

    The suffixes only appear after cardinal multipliers

    The compounding parts of ordinals are the number multiplier words.

    The suffixes only appear after cardinal multipliers

    There is a set of numbers or corresponding expressions that work like them, but are not basic cardinals or ordinals:

    Numeral stem variation

    Numerals follow the same stem variation patterns as nouns, some of these being very rare to extinct for nouns.


    This (part of) documentation was generated from src/fst/morphology/stems/numerals.lexc


    src-fst-morphology-stems-postpositions.lexc.md

    Meänkieli postpositions

    This file documents the file for Meänkieli postpositions.

    Adding tags

    The list of 40 or so postpositions.


    This (part of) documentation was generated from src/fst/morphology/stems/postpositions.lexc


    src-fst-morphology-stems-prepositions.lexc.md

    Meänkieli prepositions

    This file documents stems/prepositions.lexc, the file for Meänkieli prepositions

    The tags

    The prepositons


    This (part of) documentation was generated from src/fst/morphology/stems/prepositions.lexc


    src-fst-morphology-stems-pronouns.lexc.md

    Meänkieli pronouns

    This file documents the file for Meänkieli pronouns.

    Persoonapronominit

    Demonstratiivipronominit

    Sanakirjasta


    This (part of) documentation was generated from src/fst/morphology/stems/pronouns.lexc


    src-fst-morphology-stems-subjunctions.lexc.md

    Meänkieli subjunctions

    This file documents the file for Meänkieli subjunctions.


    This (part of) documentation was generated from src/fst/morphology/stems/subjunctions.lexc


    src-fst-morphology-stems-verbs.lexc.md

    Documenting the file for meänkieli verbs

    This file documents the file for Meänkieli verb stems.

    First, it gives an nverview of the continuation lexica, and thereafter it sketches their actual content.

    Overview over the continuation lexica

    Continuation lexica for regular verbs

    Continuation lexica for irregular verbs

    The verb lexica themselves

    The rest of the file contains some 5500 verbs.

    Irregular verbs

    v1 sanoa, lukea

    v2 tryykätä

    v3 syödä, juoda

    v4 tulla, mennä

    v5 tarvita

    v6 paeta

    Then comes the long list


    This (part of) documentation was generated from src/fst/morphology/stems/verbs.lexc


    src-fst-phonetics-txt2ipa.xfscript.md

    retroflex plosive, voiceless t ʈ 0288, 648 ( = ASCII 096) retroflex plosive, voiced d ɖ 0256, 598 labiodental nasal F ɱ 0271, 625 retroflex nasal n ɳ 0273, 627 palatal nasal J ɲ 0272, 626 velar nasal N ŋ 014B, 331 uvular nasal N\ ɴ 0274, 628

    bilabial trill B\ ʙ 0299, 665 uvular trill R\ ʀ 0280, 640 alveolar tap 4 ɾ 027E, 638 retroflex flap r ɽ 027D, 637 bilabial fricative, voiceless p\ ɸ 0278, 632 bilabial fricative, voiced B β 03B2, 946 dental fricative, voiceless T θ 03B8, 952 dental fricative, voiced D ð 00F0, 240 postalveolar fricative, voiceless S ʃ 0283, 643 postalveolar fricative, voiced Z ʒ 0292, 658 retroflex fricative, voiceless s ʂ 0282, 642 retroflex fricative, voiced z` ʐ 0290, 656 palatal fricative, voiceless C ç 00E7, 231 palatal fricative, voiced j\ ʝ 029D, 669 velar fricative, voiced G ɣ 0263, 611 uvular fricative, voiceless X χ 03C7, 967 uvular fricative, voiced R ʁ 0281, 641 pharyngeal fricative, voiceless X\ ħ 0127, 295 pharyngeal fricative, voiced ?\ ʕ 0295, 661 glottal fricative, voiced h\ ɦ 0266, 614

    alveolar lateral fricative, vl. K alveolar lateral fricative, vd. K\

    labiodental approximant P (or v) alveolar approximant r\ retroflex approximant r` velar approximant M\

    retroflex lateral approximant l` palatal lateral approximant L velar lateral approximant L
    Clicks

    bilabial O\ (O = capital letter) dental |
    (post)alveolar !\ palatoalveolar =\ alveolar lateral ||
    Ejectives, implosives

    ejective > e.g. ejective p p> implosive < e.g. implosive b b< Vowels

    close back unrounded M close central unrounded 1 close central rounded } lax i I lax y Y lax u U

    close-mid front rounded 2 close-mid central unrounded @\ close-mid central rounded 8 close-mid back unrounded 7

    schwa ə @

    open-mid front unrounded E open-mid front rounded 9 open-mid central unrounded 3 open-mid central rounded 3\ open-mid back unrounded V open-mid back rounded O

    ash (ae digraph) { open schwa (turned a) 6

    open front rounded & open back unrounded A open back rounded Q Other symbols

    voiceless labial-velar fricative W voiced labial-palatal approx. H voiceless epiglottal fricative H\ voiced epiglottal fricative <\ epiglottal plosive >\

    alveolo-palatal fricative, vl. s\ alveolo-palatal fricative, voiced z\ alveolar lateral flap l\ simultaneous S and x x\ tie bar _ Suprasegmentals

    primary stress “ secondary stress % long : half-long :\ extra-short _X linking mark -
    Tones and word accents

    level extra high _T level high _H level mid _M level low _L level extra low _B downstep ! upstep ^ (caret, circumflex)

    contour, rising contour, falling _F contour, high rising _H_T contour, low rising _B_L

    contour, rising-falling _R_F (NB Instead of being written as diacritics with _, all prosodic marks can alternatively be placed in a separate tier, set off by < >, as recommended for the next two symbols.) global rise global fall Diacritics

    voiceless 0 (0 = figure), e.g. n_0 voiced _v aspirated _h more rounded _O (O = letter) less rounded _c advanced _+ retracted _- centralized _” syllabic = (or _=) e.g. n= (or n=) non-syllabic _^ rhoticity `

    breathy voiced _t creaky voiced _k linguolabial _N labialized _w palatalized ‘ (or _j) e.g. t’ (or t_j) velarized _G pharyngealized _?\

    dental d apical _a laminal _m nasalized ~ (or _~) e.g. A~ (or A~) nasal release _n lateral release _l no audible release _}

    velarized or pharyngealized _e velarized l, alternatively 5 raised _r lowered _o advanced tongue root _A retracted tongue root _q


    This (part of) documentation was generated from src/fst/phonetics/txt2ipa.xfscript


    src-fst-transcriptions-transcriptor-abbrevs2text.lexc.md

    We describe here how abbreviations are in Tornedalen Finnish are read out, e.g. for text-to-speech systems.

    For example:


    This (part of) documentation was generated from src/fst/transcriptions/transcriptor-abbrevs2text.lexc


    src-fst-transcriptions-transcriptor-numbers-digit2text.lexc.md

    Number transcriptions

    This file is copied from the Finnish one. It should thus be Meänkielified. Transcribing numbers to words in Finnish is not completely trivial, one reason is that numbers in Finnish are written as compounds, regardless of length: 123456 is satakaksikymmentäkolmetuhattaneljäsataaviisikymmentäkuusi. Another limitation is that inflections can be unmarked in running text, that is digit expression is assumed to agree the case of the phrase it is in, e.g. 27 is kaksikymmentäseittemän, and 27:lle kahdellekymmenelleseittemälle but in a phrase: “tarjosin 27 osanottajalle” 27 assumes the allative case without marking and it is preferred grammatical form in good writing.

    Tags

    Flag diacritics

    Flag diacritics in number transcribing are used to control case agreement: in Finnish numeral compounds all words agree in case except in nominative singular where 10’s exponential multipliers are in singular partitive.

    Lexica

    Morphotactics of digit strings

    The morphotactics related to numbers and their transcriptions is that we need to know the whole digit string to know how the length of whole digit string to know what to start reading, and zeroes are not read out but have an effect to readout. The numerals are systematic and perfectly compositional: the implementation of 100 000–999 999 is almost exactly same as 100 000 000–999 000 000 and everything afterwads with the change of word tuhat~tuhatta, miljoona~miljoonaa, miljardia, biljoonaa, biljardia and so forth–that is along the long scale British (French) system where American billion = milliard etc. The numbers are built from ~single word length blocks in decreasing order with the exception of zig-zagging over numbers 11–19 where the second digit comes before first. The rest of this documentation describes the morphotactic implementation by the lexicon structure in descending order of magnitude with examples.

    Lexicon HUNDREDSMRD contains numbers 2-9 that need to be followed by exactly 11 digits: 200 000 000 000–999 999 999 999 this is to implement Nsataa…miljardia…

    Lexicon CUODIMRD contains numbers 2-9 that need to be followed by exactly this is to implement Nsataa…miljardia…

    Lexicon HUNDREDMRD is for numbers in range: 100 000 000 000–199 000 000 000 this is to implement sata…miljardia…

    Lexicon TEENSMRD is for numbers with 11 000 000 000–19 000 000 000 this is to implement …Ntoista…miljardia…

    Lexicon TEENMRD is for numbers with 11 000 000 000–19 000 000 000 this is to implement …Ntoista…miljardia…

    Lexicon TENSMRD is for numbers with 20 000 000 000–90 000 000 000 this is to implement …Nkymmentä…miljardia…

    Lexicon TENMRD is for numbers with 10 000 000 000–10 999 999 999 this is to implement …kymmenenmiljardia…

    Lexicon LÅGEVMRD is for numbers with 20 000 000 000–90 000 000 000 this is to implement …Nkymmentä…miljardia…

    Lexicon ONESMRD is for numbers with 1 000 000 000–9 000 000 000 this is to implement …Nmiljardia…

    Lexicon MILJARD is for numbers with 1 000 000 000–9 000 000 000 this is to implement …Nmiljardia

    Lexicon OVERMILLIONS is for the millions part of numbers greater than 1 milliard

    Lexicon HUNDREDSM contains numbers 2-9 that need to be followed by exactly 8 digits: 200 000 000–999 999 999 this is to implement Nsataa…miljoonaa…

    Lexicon CUODIM contains numbers 2-9 that need to be followed by exactly this is to implement Nsataa…miljoonaa…

    Lexicon HUNDREDM is for numbers in range: 100 000 000–199 000 000 this is to implement sata…miljoonaa…

    Lexicon TEENSM is for numbers with 11 000 000–19 000 000 this is to implement …Ntoista…miljoonaa…

    Lexicon TEENM is for numbers with 11 000 000–19 000 000 this is to implement …Ntoista…miljoonaa…

    Lexicon TENSM is for numbers with 20 000 000–90 000 000 this is to implement …Nkymmentä…miljoonaa…

    Lexicon TENM is for numbers with 10 000 000–10 999 999 this is to implement …kymmenenmiljoonaa…

    Lexicon LÅGEVM is for numbers with 20 000 000–90 000 000 this is to implement …Nkymmentä…miljoonaa..

    Lexicon ONESM is for numbers with 1 000 000–9 000 000 this is to implement …Nmiljoonaa…

    Lexicon MILJON is for numbers with 1 000 000–9 000 000 this is to implement …Nmiljoonaa

    Lexicon UNDERMILLION is for numbers with 100 000–900 000 after milliards

    Lexicon OVERTHOUSANDS is for the thousands part of numbers greater than 1 million

    Lexicon HUNDREDST contains numbers 2-9 that need to be followed by exactly 5 digits: 200 000–999 999 this is to implement Nsataa…tuhatta…

    Lexicon CUODIT contains numbers 2-9 that need to be followed by exactly this is to implement Nsataa…tuhatta…

    Lexicon HUNDREDT is for numbers in range: 100 000–199 000 this is to implement sata…tuhatta…

    Lexicon TEENST is for numbers with 11 000–19 000 this is to implement …Ntoista…tuhatta…

    Lexicon TEENT is for numbers with 11 000–19 000 this is to implement …Ntoista…tuhatta…

    Lexicon TENST is for numbers with 20 000–90 000 this is to implement …Nkymmentä…tuhatta…

    Lexicon TENT is for numbers with 10 000 000–10 999 999 this is to implement …kymmenentuhatta…

    Lexicon LÅGEVT is for numbers with 20 000–90 000 this is to implement …Nkymmentä…tuhatta..

    Lexicon ONEST is for numbers with 1 000–9 000 this is to implement …Ntuhatta…

    Lexicon THOUSANDS is for numbers with 1 000–9 000 this is to implement …Ntuhatta

    Lexicon THOUSAND is for the ones-tens-hundreds of numbers greater than thousand

    Lexicon UNDERTHOUSAND is for numbers with 100–900 after thousands

    Lexicon HUNDREDS contains numbers 2-9 that need to be followed by exactly 2 digits: 200–999 this is to implement Nsataa…

    Lexicon CUODI contains numbers 2-9 that need to be followed by exactly this is to implement Nsataa

    Lexicon HUNDRED is for numbers in range: 100–999

    Lexicon TEENS is for numbers with 11–19 this is to implement …Ntoista

    Lexicon TEEN is for numbers with 11–19 this is to implement …Ntoista

    Lexicon TENS is for numbers with 20–90 this is to implement …Nkymmentä…

    Lexicon LÅGEV is for numbers with 20–90 this is to implement …Nkymmentä

    Lexicon JUSTTEN is for number 10 this is to implement …kymmenen

    Lexicon ONES is for numbers with 1–9 this is to implement yksi, kaksi, kolme…, yheksän

    Lexicon ZERO is for number 0 nolla

    Lexicon LOPPU is to implement potential case inflection with a colon.


    This (part of) documentation was generated from src/fst/transcriptions/transcriptor-numbers-digit2text.lexc


    tools-grammarcheckers-grammarchecker.cg3.md

    [ L A N G U A G E ] G R A M M A R C H E C K E R

    DELIMITERS

    TAGS AND SETS

    Tags

    This section lists all the tags inherited from the fst, and used as tags in the syntactic analysis. The next section, Sets, contains sets defined on the basis of the tags listed here, those set names are not visible in the output.

    Beginning and end of sentence

    BOS EOS

    Parts of speech tags

    N A Adv V Pron CS CC CC-CS Po Pr Pcle Num Interj ABBR ACR CLB LEFT RIGHT WEB PPUNCT PUNCT

    COMMA ¶

    Tags for POS sub-categories

    Pers Dem Interr Indef Recipr Refl Rel Coll NomAg Prop Allegro Arab Romertall

    Tags for morphosyntactic properties

    Nom Acc Gen Ill Loc Com Ess Ess Sg Du Pl Cmp/SplitR Cmp/SgNom Cmp/SgGen Cmp/SgGen PxSg1 PxSg2 PxSg3 PxDu1 PxDu2 PxDu3 PxPl1 PxPl2 PxPl3 Px

    Comp Superl Attr Ord Qst IV TV Prt Prs Ind Pot Cond Imprt ImprtII Sg1 Sg2 Sg3 Du1 Du2 Du3 Pl1 Pl2 Pl3 Inf ConNeg Neg PrfPrc VGen PrsPrc Ger Sup Actio VAbess

    Err/Orth

    Semantic tags

    Sem/Act Sem/Ani Sem/Atr Sem/Body Sem/Clth Sem/Domain Sem/Feat-phys Sem/Fem Sem/Group Sem/Lang Sem/Mal Sem/Measr Sem/Money Sem/Obj Sem/Obj-el Sem/Org Sem/Perc-emo Sem/Plc Sem/Sign Sem/State-sick Sem/Sur Sem/Time Sem/Txt

    HUMAN

    PROP-ATTR PROP-SUR

    TIME-N-SET

    Syntactic tags

    @+FAUXV @+FMAINV @-FAUXV @-FMAINV @-FSUBJ> @-F<OBJ @-FOBJ> @-FSPRED<OBJ @-F<ADVL @-FADVL> @-F<SPRED @-F<OPRED @-FSPRED> @-FOPRED> @>ADVL @ADVL< @<ADVL @ADVL> @ADVL @HAB> @<HAB @>N @Interj @N< @>A @P< @>P @HNOUN @INTERJ @>Num @Pron< @>Pron @Num< @OBJ @<OBJ @OBJ> @OPRED @<OPRED @OPRED> @PCLE @COMP-CS< @SPRED @<SPRED @SPRED> @SUBJ @<SUBJ @SUBJ> SUBJ SPRED OPRED @PPRED @APP @APP-N< @APP-Pron< @APP>Pron @APP-Num< @APP-ADVL< @VOC @CVP @CNP OBJ

    -OTHERS SYN-V @X ### Sets containing sets of lists and tags This part of the file lists a large number of sets based partly upon the tags defined above, and partly upon lexemes drawn from the lexicon. See the sourcefile itself to inspect the sets, what follows here is an overview of the set types. #### Sets for Single-word sets INITIAL #### Sets for word or not WORD NOT-COMMA #### Case sets ADLVCASE CASE-AGREEMENT CASE NOT-NOM NOT-GEN NOT-ACC #### Verb sets NOT-V #### Sets for finiteness and mood REAL-NEG MOOD-V NOT-PRFPRC #### Sets for person SG1-V SG2-V SG3-V DU1-V DU2-V DU3-V PL1-V PL2-V PL3-V #### Pronoun sets #### Adjectival sets and their complements #### Adverbial sets and their complements #### Sets of elements with common syntactic behaviour #### NP sets defined according to their morphosyntactic features #### The PRE-NP-HEAD family of sets These sets model noun phrases (NPs). The idea is to first define whatever can occur in front of the head of the NP, and thereafter negate that with the expression **WORD - premodifiers**. #### Border sets and their complements #### Grammarchecker sets * * * This (part of) documentation was generated from [tools/grammarcheckers/grammarchecker.cg3](https://github.com/giellalt/lang-fit/blob/main/tools/grammarcheckers/grammarchecker.cg3) --- ## tools-tokenisers-tokeniser-disamb-gt-desc.pmscript.md ## Tokeniser for fit Usage: ``` $ make $ echo "ja, ja" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst $ echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa boasttu olmmoš, man mielde lahtuid." | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst $ echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst $ echo "márffibiillagáffe" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst ``` Pmatch documentation: <https://github.com/hfst/hfst/wiki/HfstPmatch> Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words: * Punct contains ASCII punctuation marks * The symbol after m-dash is soft-hyphen `U+00AD` * The symbol following {•} is byte-order-mark / zero-width no-break space `U+FEFF`. Whitespace contains ASCII white space and the List contains some unicode white space characters * En Quad U+2000 to Zero-Width Joiner U+200d' * Narrow No-Break Space U+202F * Medium Mathematical Space U+205F * Word joiner U+2060 Apart from what's in our morphology, there are 1. unknown word-like forms, and 2. unmatched strings We want to give 1) a match, but let 2) be treated specially by `hfst-tokenise -a` Unknowns are made of: * lower-case ASCII * upper-case ASCII * select extended latin symbols ASCII digits * select symbols * Combining diacritics as individual symbols, * various symbols from Private area (probably Microsoft), so far: * U+F0B7 for "x in box" ### Unknown handling Unknowns are tagged ?? and treated specially with `hfst-tokenise` hfst-tokenise --giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it's safer to let hfst-tokenise handle them. Finally we mark as a token any sequence making up a: * known word in context * unknown (OOV) token in context * sequence of word and punctuation * URL in context * * * This (part of) documentation was generated from [tools/tokenisers/tokeniser-disamb-gt-desc.pmscript](https://github.com/giellalt/lang-fit/blob/main/tools/tokenisers/tokeniser-disamb-gt-desc.pmscript) --- ## tools-tokenisers-tokeniser-gramcheck-gt-desc.pmscript.md ## Grammar checker tokenisation for fit Requires a recent version of HFST (3.10.0 / git revision>=3aecdbc) Then just: ``` $ make $ echo "ja, ja" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst ``` More usage examples: ``` $ echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa boasttu olmmoš, man mielde lahtuid." | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst $ echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst $ echo "márffibiillagáffe" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst ``` Pmatch documentation: <https://github.com/hfst/hfst/wiki/HfstPmatch> Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words: * Punct contains ASCII punctuation marks * The symbol after m-dash is soft-hyphen `U+00AD` * The symbol following {•} is byte-order-mark / zero-width no-break space `U+FEFF`. Whitespace contains ASCII white space and the List contains some unicode white space characters * En Quad U+2000 to Zero-Width Joiner U+200d' * Narrow No-Break Space U+202F * Medium Mathematical Space U+205F * Word joiner U+2060 Apart from what's in our morphology, there are 1) unknown word-like forms, and 2) unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a * select extended latin symbols * select symbols * various symbols from Private area (probably Microsoft), so far: * U+F0B7 for "x in box" TODO: Could use something like this, but built-in's don't include šžđčŋ: Simply give an empty reading when something is unknown: hfst-tokenise --giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it's safer to let hfst-tokenise handle them. Finally we mark as a token any sequence making up a: * known word in context * unknown (OOV) token in context * sequence of word and punctuation * URL in context * * * This (part of) documentation was generated from [tools/tokenisers/tokeniser-gramcheck-gt-desc.pmscript](https://github.com/giellalt/lang-fit/blob/main/tools/tokenisers/tokeniser-gramcheck-gt-desc.pmscript) --- ## tools-tokenisers-tokeniser-tts-cggt-desc.pmscript.md ## TTS tokenisation for smj Requires a recent version of HFST (3.10.0 / git revision>=3aecdbc) Then just: ```sh make echo "ja, ja" \ | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst ``` More usage examples: ```sh echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa \ boasttu olmmoš, man mielde lahtuid." \ | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" \ | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst echo "márffibiillagáffe" \ | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst ``` Pmatch documentation: <https://kitwiki.csc.fi/twiki/bin/view/KitWiki/HfstPmatch> Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words: * Punct contains ASCII punctuation marks * The symbol after m-dash is soft-hyphen `U+00AD` * The symbol following {•} is byte-order-mark / zero-width no-break space `U+FEFF`. Whitespace contains ASCII white space and the List contains some unicode white space characters * En Quad U+2000 to Zero-Width Joiner U+200d' * Narrow No-Break Space U+202F * Medium Mathematical Space U+205F * Word joiner U+2060 Apart from what's in our morphology, there are 1) unknown word-like forms, and 2) unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a * select extended latin symbols * select symbols * various symbols from Private area (probably Microsoft), so far: * U+F0B7 for "x in box" TODO: Could use something like this, but built-in's don't include šžđčŋ: Simply give an empty reading when something is unknown: hfst-tokenise --giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it's safer to let hfst-tokenise handle them. Needs hfst-tokenise to output things differently depending on the tag they get * * * This (part of) documentation was generated from [tools/tokenisers/tokeniser-tts-cggt-desc.pmscript](https://github.com/giellalt/lang-fit/blob/main/tools/tokenisers/tokeniser-tts-cggt-desc.pmscript)

    Sitemap

    Debugging site.pages:

    URL: /assets/css/style.css - Title:

    URL: /HInsertion.html - Title:

    URL: /Links.html - Title:

    URL: /fit.html - Title: Meänkieli (Tornedalen Finnish) language model documentation

    URL: /index-header.html - Title: Meänkieli documentation

    URL: / - Title: Meänkieli documentation

    URL: /isof/ - Title: Kurs i lexc og twolc for Isof, april 2022

    URL: /isof/timeplan.html - Title: Oversikt over kurset

    URL: /meetings/230301.html - Title: Møte om språkteknologi for meänkieli

    URL: /src-cg3-dependency.cg3.html - Title: C O M M O N S Á M I D E P E N D E N C Y G R A M M A R

    URL: /src-cg3-disambiguator.cg3.html - Title: Disambiguator for Meänkieli

    URL: /src-cg3-functions.cg3.html - Title:

    URL: /src-fst-morphology-affixes-abbreviations.lexc.html - Title: Documenting the morphological tags for Meänkieli abbreviations

    URL: /src-fst-morphology-affixes-acronyms.lexc.html - Title: Documenting Meänkieli acronym morphology

    URL: /src-fst-morphology-affixes-adjectives.lexc.html - Title: Documenting the file for Meänkieli adjective morphology

    URL: /src-fst-morphology-affixes-nouns.lexc.html - Title: Meänkieli noun morphology

    URL: /src-fst-morphology-affixes-numerals.lexc.html - Title: Meänkieli numerals

    URL: /src-fst-morphology-affixes-pronouns.lexc.html - Title:

    URL: /src-fst-morphology-affixes-propernouns.lexc.html - Title: Meänkieli propernoun morphology

    URL: /src-fst-morphology-affixes-symbols.lexc.html - Title: Symbol affixes

    URL: /src-fst-morphology-affixes-verbs.lexc.html - Title: Meänkieli verbs

    URL: /src-fst-morphology-phonology.twolc.html - Title: Meänkieli twolc file

    URL: /src-fst-morphology-root.lexc.html - Title: Meänkieli morphological transducer

    URL: /src-fst-morphology-stems-adjectives.lexc.html - Title: Meänkieli adjectives

    URL: /src-fst-morphology-stems-adverbs.lexc.html - Title: Meänkieli adverbs

    URL: /src-fst-morphology-stems-conjunctions.lexc.html - Title: Meänkieli conjunctions

    URL: /src-fst-morphology-stems-fit-abbreviations.lexc.html - Title: File containing meänkieli abbreviations

    URL: /src-fst-morphology-stems-fit-acronyms.lexc.html - Title: Meänkieli aacronyms

    URL: /src-fst-morphology-stems-fit-propernouns.lexc.html - Title: Meänkieli propernouns

    URL: /src-fst-morphology-stems-interjections.lexc.html - Title: Meänkieli interjections

    URL: /src-fst-morphology-stems-nouns.lexc.html - Title: Noun stems for Meänkieli

    URL: /src-fst-morphology-stems-numerals.lexc.html - Title: Meänkieli numerals

    URL: /src-fst-morphology-stems-postpositions.lexc.html - Title: Meänkieli postpositions

    URL: /src-fst-morphology-stems-prepositions.lexc.html - Title: Meänkieli prepositions

    URL: /src-fst-morphology-stems-pronouns.lexc.html - Title: Meänkieli pronouns

    URL: /src-fst-morphology-stems-subjunctions.lexc.html - Title: Meänkieli subjunctions

    URL: /src-fst-morphology-stems-verbs.lexc.html - Title: Documenting the file for meänkieli verbs

    URL: /src-fst-phonetics-txt2ipa.xfscript.html - Title:

    URL: /src-fst-transcriptions-transcriptor-abbrevs2text.lexc.html - Title:

    URL: /src-fst-transcriptions-transcriptor-numbers-digit2text.lexc.html - Title: Number transcriptions

    URL: /test-diary.html - Title: Test diary

    URL: /tools-grammarcheckers-grammarchecker.cg3.html - Title:

    URL: /tools-tokenisers-tokeniser-disamb-gt-desc.pmscript.html - Title: Tokeniser for fit

    URL: /tools-tokenisers-tokeniser-gramcheck-gt-desc.pmscript.html - Title: Grammar checker tokenisation for fit

    URL: /tools-tokenisers-tokeniser-tts-cggt-desc.pmscript.html - Title: TTS tokenisation for smj

    URL: /tyolista.html - Title: Työlista = Arbetslista

    Root items:

    URL: /HInsertion.html - Title: Hinsertion

    URL: /Links.html - Title: Links

    URL: /fit.html - Title: Meänkieli (Tornedalen Finnish) language model documentation

    URL: /index-header.html - Title: Meänkieli documentation

    URL: / - Title: Meänkieli documentation

    URL: /isof/ - Title: Kurs i lexc og twolc for Isof, april 2022

    URL: /src-cg3-dependency.cg3.html - Title: C O M M O N S Á M I D E P E N D E N C Y G R A M M A R

    URL: /src-cg3-disambiguator.cg3.html - Title: Disambiguator for Meänkieli

    URL: /src-cg3-functions.cg3.html - Title: Src-cg3-functions.cg3

    URL: /src-fst-morphology-affixes-abbreviations.lexc.html - Title: Documenting the morphological tags for Meänkieli abbreviations

    URL: /src-fst-morphology-affixes-acronyms.lexc.html - Title: Documenting Meänkieli acronym morphology

    URL: /src-fst-morphology-affixes-adjectives.lexc.html - Title: Documenting the file for Meänkieli adjective morphology

    URL: /src-fst-morphology-affixes-nouns.lexc.html - Title: Meänkieli noun morphology

    URL: /src-fst-morphology-affixes-numerals.lexc.html - Title: Meänkieli numerals

    URL: /src-fst-morphology-affixes-pronouns.lexc.html - Title: Src-fst-morphology-affixes-pronouns.lexc

    URL: /src-fst-morphology-affixes-propernouns.lexc.html - Title: Meänkieli propernoun morphology

    URL: /src-fst-morphology-affixes-symbols.lexc.html - Title: Symbol affixes

    URL: /src-fst-morphology-affixes-verbs.lexc.html - Title: Meänkieli verbs

    URL: /src-fst-morphology-phonology.twolc.html - Title: Meänkieli twolc file

    URL: /src-fst-morphology-root.lexc.html - Title: Meänkieli morphological transducer

    URL: /src-fst-morphology-stems-adjectives.lexc.html - Title: Meänkieli adjectives

    URL: /src-fst-morphology-stems-adverbs.lexc.html - Title: Meänkieli adverbs

    URL: /src-fst-morphology-stems-conjunctions.lexc.html - Title: Meänkieli conjunctions

    URL: /src-fst-morphology-stems-fit-abbreviations.lexc.html - Title: File containing meänkieli abbreviations

    URL: /src-fst-morphology-stems-fit-acronyms.lexc.html - Title: Meänkieli aacronyms

    URL: /src-fst-morphology-stems-fit-propernouns.lexc.html - Title: Meänkieli propernouns

    URL: /src-fst-morphology-stems-interjections.lexc.html - Title: Meänkieli interjections

    URL: /src-fst-morphology-stems-nouns.lexc.html - Title: Noun stems for Meänkieli

    URL: /src-fst-morphology-stems-numerals.lexc.html - Title: Meänkieli numerals

    URL: /src-fst-morphology-stems-postpositions.lexc.html - Title: Meänkieli postpositions

    URL: /src-fst-morphology-stems-prepositions.lexc.html - Title: Meänkieli prepositions

    URL: /src-fst-morphology-stems-pronouns.lexc.html - Title: Meänkieli pronouns

    URL: /src-fst-morphology-stems-subjunctions.lexc.html - Title: Meänkieli subjunctions

    URL: /src-fst-morphology-stems-verbs.lexc.html - Title: Documenting the file for meänkieli verbs

    URL: /src-fst-phonetics-txt2ipa.xfscript.html - Title: Src-fst-phonetics-txt2ipa.xfscript

    URL: /src-fst-transcriptions-transcriptor-abbrevs2text.lexc.html - Title: Src-fst-transcriptions-transcriptor-abbrevs2text.lexc

    URL: /src-fst-transcriptions-transcriptor-numbers-digit2text.lexc.html - Title: Number transcriptions

    URL: /test-diary.html - Title: Test diary

    URL: /tools-grammarcheckers-grammarchecker.cg3.html - Title: Tools-grammarcheckers-grammarchecker.cg3

    URL: /tools-tokenisers-tokeniser-disamb-gt-desc.pmscript.html - Title: Tokeniser for fit

    URL: /tools-tokenisers-tokeniser-gramcheck-gt-desc.pmscript.html - Title: Grammar checker tokenisation for fit

    URL: /tools-tokenisers-tokeniser-tts-cggt-desc.pmscript.html - Title: TTS tokenisation for smj

    URL: /tyolista.html - Title: Työlista = Arbetslista

    Directory items:

    URL: /isof/timeplan.html - Title: Oversikt over kurset

    URL: /meetings/230301.html - Title: Møte om språkteknologi for meänkieli