Tornedalen Finnish NLP Grammar

Finite state and Constraint Grammar based analysers, proofing tools and other resources

View the project on GitHub giellalt/lang-fit

Page Content

Meänkieli (Tornedalen Finnish) language model documentation

All doc-comment documentation in one large file.


src-cg3-dependency.cg3.md

C O M M O N S Á M I D E P E N D E N C Y G R A M M A R

This dep file is for sma, sme, smj, sje.

DELIMITERS

Sentence delimiters are the following: <.> <!> <?> <…> <¶>

TAGS AND SETS

N V A Adv CC CS Inf Sup Neg Num Po Pr

Pcle Prop

Pron IV TV COMMA DASH CITATION to keep colouring we add a “ HYPHEN QMARK PUNCT LEFT RIGHT CLB Ind Pot Impr ImprtII Cond ConNeg Caus causative eus VGen Interj ABBR ACR Prs Prt Cmpnd RCmpnd PrfPrc PrsPrc Actor Actio Ger Indef Nom Acc Ill Com Gen Ess

IM For fao

POS sub-categories

Syntactic tags and sets

Syntactic tags in input to this file

Syntactic tags added in this file

fao syntags

kal syntags

eus syntags

Syntactic set definitions

Dep grammar

Correction rules

The finite verb

Mapping rules

lgRemove removes the language tags , , etc, before proceeding to the dep file.


This (part of) documentation was generated from src/cg3/dependency.cg3


src-cg3-disambiguator.cg3.md

Disambiguator for Meänkieli

Usage:

cat text.txt|hfst-tokenize -cg tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst |vislcg3 -g src/cg3/disambiguator.cg3

This file documents the Meänkieli disambiguator file .

Delimiters, tags and sets

Sentence delimiters are the following: “<.>” “<…>” “<!>” “<?>” “<¶>”

Part-of-Speech

Numerus

Person

Cases

Types

Sets with more members

Boundaries

Verbs

Disambiguation rules

Dialects

Early rules

Possessive suffixes

First we put rules to choose Px forms… (forthcomong)

Then we remove the remaining Px

Numeral phrases

Preposition/postposition/adverb rules

Rules for mapping @CVP and @CNP on the CC and CS

Case rules

Partitive

Genitive

Illative

Number rules

More disambiguation rules

Elative

Propernouns

Verbs

Specific verbs

ei negation verb

eli

Adverbs

paljon

kerran

jälkhiin

Adjectives

toinen

Conjunctions

Subjunctions

että

jos

ko

mutta

sillä

Pronouns

sie

tet

Verb rules, Verbs

Infinitive

Present Sg3

Present Pl3 or PrsPrc

Present Pl3 or Passive

Imperative

Past tense

Prt Pl3 or Prt Sg2

Relative pronouns

HNOUN MAPPING


This (part of) documentation was generated from src/cg3/disambiguator.cg3


src-cg3-functions.cg3.md

S Y N T A C T I C F U N C T I O N S F O R S Á M I

Sámi language technology project 2003-2018, University of Tromsø #

This file adds syntactic functions. It is common for all the Saami

LEFT RIGHT because of apertium

Syntactic tags

Tag sets

These sets model noun phrases (NPs). The idea is to first define whatever can occur in front of the head of the NP, and thereafter negate that with the expression WORD - premodifiers.

The set NOT-NPMOD is used to find barriers between NPs. Typical usage: … (*1 N BARRIER NPT-NPMOD) … meaning: Scan to the first noun, ignoring anything that can be part of the noun phrase of that noun (i.e., “scan to the next NP head”)

ADLVCASE

These were the set types.

Numeral outside the sentence

HABITIVE MAPPING

sma object

SUBJ MAPPING - leftovers

OBJ MAPPING - leftovers

MAPPING for MT - experimental

HNOUN MAPPING

missingX adds @X to all missings

therestX adds @X to all what is left, often errouneus disambiguated forms

For Apertium:

The analysis give double analysis because of optional semtags. We go for the one with semtag.


This (part of) documentation was generated from src/cg3/functions.cg3


src-fst-morphology-affixes-abbreviations.lexc.md

Documenting the morphological tags for Meänkieli abbreviations

This file documents affixes/abbreviations.lexc, the file for Meänkieli abbreviation morphology

Now splitting according to POS, and according to dot or not

LEXICON ab-noun-itrab LEXICON ab-noun-trab LEXICON ab-noun-trnumab

Lexicons without final period

Lexicons with final period


This (part of) documentation was generated from src/fst/morphology/affixes/abbreviations.lexc


src-fst-morphology-affixes-acronyms.lexc.md

Documenting Meänkieli acronym morphology

This file documents affixes/acronyms.lexc, the file for Meänkieli acronym morphology

LEXICON Acronym-fit-suf for adding +ACR tag

LEXICON ACRONOUN_cons

LEXICON ACRONOUN_vow

LEXICON ACRO_BERN

LEXICON ACRO_LONDON

LEXICON ACRO_NYSTØ

LEXICON ACRO_cons

LEXICON ACRO_vow


This (part of) documentation was generated from src/fst/morphology/affixes/acronyms.lexc


src-fst-morphology-affixes-adjectives.lexc.md

Documenting the file for Meänkieli adjective morphology

This file documents the file affixes/adjectives.lexc for Meänkieli adjective morphology.

Most lexica here (a1, a_e, …) add +A, and thereafter redirect to the corresponding x1, x_e, … lexicon in affixes/nouns.lexc for case inflection. The lexicon numbers correspond to the ones for nouns.

In addition, each lexicon also points to comparative and superlative sublexica.

Unassigned

LEXICON ax pointing to a1. It is for adjectives that have still not been classified.

Regular lexica

LEXICON a1 adding +A and sending to x1, and to 3comp, 3sup.

LEXICON a1_e vanha, which has Err/Orth vanhee-, otherwise like a1

LEXICON a_vasen adding +A and sending to x1, and to 3comp, 3sup.

LEXICON a_e gets +A and goes to x_e.

LEXICON a3 kamala gets +A and points to x3

LEXICON a4 has no comparative or superlative , just points to x4

LEXICON anen has no comparative or superlative , just points to xnen

LEXICON aas has no comparative or superlative , just points to xnas

LEXICON a_suuri has no comparative or superlative , just points to x4

LEXICON a1_ton

LEXICON x1_ton

Comparative inflection

LEXICON 3comp 2syll adj, 3syll comparative

LEXICON 4comp 3syll adj, 4syll comparative

LEXICON xcomp common for 2syll and 3syll

Superlative inflection

LEXICON 3sup 2syll adj, 3syll superlative

LEXICON 4sup 3syll adj, 4syll superlative

LEXICON xsup common for 2syll and 3syll


This (part of) documentation was generated from src/fst/morphology/affixes/adjectives.lexc


src-fst-morphology-affixes-nouns.lexc.md

Meänkieli noun morphology

This file documents affixes/nouns.lexc, the file for Meänkieli noun morphology

This is an overview of the continuation lexicon types.

Special stems

Vowel stems

Stems for -i-words, vowel AND consonant

Special cases for -i-words

Consonant stems of other types

The lexica themselves

Lexica for unassigned words

LEXICON nx pointing to n1.

LEXICON n_nomorph for uninflected nouns

LEXICON nc for consonant-final nouns, structure CVC

LEXICON xc_sg

LEXICON xc_pl

Lexica for regular nouns

LEXICON n0 for 1-syllabic: maa, suu, tie, …

LEXICON n0_pl for plurals of the same: häät

LEXICON x0 splitting to sg and pl

LEXICON x0_sg sg forms x0 point here

LEXICON x0_sg_oblique for oblique case forms in sg

LEXICON x0_pl for plural case forms

LEXICON n1 for 2-syll ordinary nouns (talo)

LEXICON n1_pl for the same plural words (urut)

LEXICON x1 for the bisyallbic, pointing to sg, pl

LEXICON x1_sg bisyllabic sg

LEXICON x1_sg_oblique gives the rest

LEXICON x1_pl the pl forms

LEXICON n_e vene, liike, säe

LEXICON n_e_pl vehkheet

LEXICON x_e splits in sg and pl

LEXICON x_e_sg the sg

LEXICON x_e_pl the pl

LEXICON x_e_pl urvakke etc, n_e-ord med -lle/-lla

LEXICON x_e_pl splits in sg and pl

LEXICON x_e_pl the sg

LEXICON x_e_pl the pl

LEXICON n3 odd-syllabic: kanava

LEXICON n3_pl haalarit

LEXICON x3

LEXICON x3_oblique

LEXICON x3_sg

LEXICON x3_oblique_sg

LEXICON x3_pl

LEXICON x3_pl

LEXICON 3nc

LEXICON xnc

The i>e-family; kivi, kieli, käsi, lumi etc

LEXICON n4 kivi, stem kive

LEXICON x4 veri

LEXICON n4_pl

LEXICON x4_sg shared lexica for n4, n5, n5_lumi/loimi/lapsi EXCEPT SgNom, SgPar

LEXICON x4_pl

LEXICON n5 kieli, stem kiele

LEXICON n5 kieli, stem kiele

LEXICON n5_kieli kieli, stem kiele

LEXICON n5_lumi lumi, stem lu

LEXICON n5_loimi loimi, stem loi, som n5_lumi PLUS partitiv loimea

LEXICON n5_vuosi vuosi> vuoessa/vuessa, stem ELLER vu

LEXICON n5_kasi käsi, stem kä

LEXICON n5_kasi_pl continuation for kasi_pl

LEXICON x5_kasi veri

LEXICON x5_kasi_pl

LEXICON n5_lapsi

LEXICON n5_ie_odd

LEXICON n5_ie_odd same as n5_ie except Pl+Part: takki>takkeja

LEXICON n5_nuoret_pl same as n1_pl except Pl+Gen: nuoret>nuorten

LEXICON n5_i_pl cont lexica for type n1-words ending with -i

LEXICON x5_i_pl cont lexica for type n1-words ending with -i

The nainen (nen) and hevonen (3nen) family

LEXICON nen bisyllabic nainen stem nai

LEXICON nen_sg

LEXICON nen_pl

LEXICON xnen

LEXICON xnen_sg +Sg:se 2cases ; for Ade, All, Ess lla, lle, nna

LEXICON xnen_pl

LEXICON 3nen odd-syllabic hevonen stem hevose

LEXICON x3nen

LEXICON x3nen_sg

LEXICON x3nen_pl

LEXICON xnen_common_sg

LEXICON xnen_common_pl

LEXICON 3cases

LEXICON 2cases

LEXICON 3n_ks

LEXICON 3n_ks_pl

LEXICON xn_ks

LEXICON xn_ks_sg

LEXICON xn_ks_pl

LEXICON n_äes

LEXICON x_äes

LEXICON 3n_ue

LEXICON 3x_ue

LEXICON 3x_ue_sg

LEXICON 3x_ue_pl

LEXICON 3n_ime

LEXICON 3n_ime_sg

LEXICON 3n_ime_pl

LEXICON x_ime_sg

LEXICON x_ime_pl

LEXICON nas

LEXICON xnas

LEXICON xnas_sg

LEXICON xnas_pl

LEXICON xnas_pl

LEXICON xnas_pl

LEXICON nas_h_pl

LEXICON 3mies

LEXICON n_ien

LEXICON n_ien_sg

LEXICON n_uus

LEXICON n_uus_odd

2-syllabic LNR final stems

LEXICON 3n_lnr ahven - ahvenheen

LEXICON 3n_kymmen 3n_kymmen

LEXICON 30n_lnr askel - askelheesheen

LEXICON n_kasuven

LEXICON 3xn_lnr tyär, kort och lång Ill

LEXICON 3n_lnr_inteill inte Ill, Ine, Ess men alla andra

LEXICON 4n_ks

LEXICON x4n_ks

LEXICON x4n_ks_sg

LEXICON x4n_ks_pl

Sublexica for cases

LEXICON TRA

Sublexica for possessive suffixes

Px is now not in use, with one exception, comitative.

LEXICON n_PxK has either -n or goes to Px LEXICON n_PxK

LEXICON a_PxK has either -s or goes to Px with -a LEXICON a_PxK

LEXICON s_PxK has either -s or goes to Px LEXICON s_PxK

LEXICON sh_PxK has either -s or goes to Px with -he- LEXICON sh_PxK

LEXICON st_PxK has either -s or goes to Px with -te- rakuaus, rakhauteni LEXICON st_PxK

LEXICON t_PxK has either -t or goes to Px LEXICON t_PxK

LEXICON i_PxK Tra: -i or -e and goes to Px LEXICON i_PxK

LEXICON PxK has only -nsA, compare PxxK LEXICON PxK

LEXICON PxxK has also -Vn, thus both .. llensa and ..lleen. LEXICON PxxK

LEXICON Px

LEXICON Px-Vn

LEXICON n5_troppi troppi tropin troppia?

LEXICON n5_troppi_odd


This (part of) documentation was generated from src/fst/morphology/affixes/nouns.lexc


src-fst-morphology-affixes-numerals.lexc.md

Meänkieli numerals

From fin via fkv.

Numeral inflection

Numeral inflection is like nominal, except that numerals compound in all forms which requires great amount of care in the inflection patterns.


This (part of) documentation was generated from src/fst/morphology/affixes/numerals.lexc


src-fst-morphology-affixes-pronouns.lexc.md

Meänkieli pronoun morphology

This file documents affixes/pronouns.lexc, the file for Meänkieli verb morphology

Pronominien morfologia

Pronominit ovat edelleen vaan kokeiluvaiheessa.

LEXICON 12pronsg on 1., 2. p. yksikkö

LEXICON 123pronpl

nuoitä

tuotä


This (part of) documentation was generated from src/fst/morphology/affixes/pronouns.lexc


src-fst-morphology-affixes-propernouns.lexc.md

Meänkieli propernoun morphology

This file documents affixes/propernouns.lexc, the file for Meänkieli propernoun morphology. The file pointing here is stems/fit-propernouns.lexc

The lexicon names look like this: p_mal_1 etc. They have 3 parts, divided by “_”

We do not use _pl for names

… and many more.

Vowel stems, odd and even stems

Consonant stems, odd and even stems


This (part of) documentation was generated from src/fst/morphology/affixes/propernouns.lexc


src-fst-morphology-affixes-symbols.lexc.md

Symbol affixes

This file documents affixes/synbols.lexc, the file for the affixes added to language-independent symbols


This (part of) documentation was generated from src/fst/morphology/affixes/symbols.lexc


src-fst-morphology-affixes-verbs.lexc.md

Meänkieli verbs

This file documents affixes/verbs.lexc, the file for Meänkieli verb morphology

Overview over the continuation classes

Continuation lexica for regular verbs

Continuation lexica for irregular verbs

Irregular verbs

Regular verbs

Subparadigms

Conditional forms

LEXICON 2cond for -imm^A

Infinitive paradigms

from fkv

LEXICON v12pers Only sg12, pl12 so far

LEXICON PRFPRC_OBL is without nom sg from fkv


This (part of) documentation was generated from src/fst/morphology/affixes/verbs.lexc


src-fst-morphology-phonology.twolc.md

Meänkieli twolc file

This file documents the Meänkieli twolc file (the file governing gradation, gemination, vowel harmony and other morphophonological processes).

The first part of the file contains definitions, the second part contains rules.

Declaring the alphabet, sets and definitions

Alphabet

This defines all symbols (letters, archiphonemes, triggers) to be used.

Sets

Here we group the symbols in convenient sets.

Definitions

This defines strings used often in rules.

WeakGrade = ([l|n|r]) (%^AE:) %^WG:

Rules

This chapter gives the rules themselves.

Consonant rules

For the gradation rules, each consonant deletion or change is given its own rule. Thus, both kk:k and k:0 are handled in the same *k:0 rule. This to avoid rule conflicts. The change rules (k:g, k:j etc.) are restricted by context (k:g only after n, etc.).

f rules

RULE: f:0

j rules

RULE: j:0

k rules

RULE: k:g

Tests:

RULE: k:0

Tests:

RULE: k:j

RULE: k4:j

Tests:

RULE: k:v

Tests:

l rules

RULE: k:v

m rules

RULE: m:0

n rules

RULE: n:0

p rules

RULE: p:0

Tests:

RULE: p:v

Tests:

RULE: p:m

r rules

RULE: p:m

s rules

RULE: r:0

t rules

RULE: t:0

Tests:

RULE: t4:0 where t4 is t in rt that shall not become rr

Tests:

RULE: t:j

Tests:

**RULE: t:l ** for lt:ll

Tests:

**RULE: t:n ** for nt:nn

Tests:

**RULE: t:r ** for rt:rr

Tests:

RULE: t:s

Tests:

v rules

RULE: v:0

Gemination rules

The gemination rules insert the geminated consonant (thus 0:h if h to the left). There is one subrule for each vowel context, in order to avoid confilcts.

RULE: Gemination 0:h

RULE: Gemination 0:j

RULE: Gemination 0:k

Tests:

RULE: Gemination 0:l

Tests:

RULE: Gemination 0:m

RULE: Gemination 0:n

RULE: Gemination 0:p

RULE: Gemination 0:s

Tests:

RULE: h:0

RULE: h:0

RULE: h:0

kasva>hm^A^An kasva>mhaan

saarna>^A>hm^A^An saarna>a>hmaan

tule>hm^A^An tule>mhaan

RULE: Gemination 0:t

Tests:

RULE: Gemination 0:v Tests:

Assimilation rules

These are assimilation rules for n on suffix borders of LNRS consonant stems. There is also a rule j:0 avoiding a lji sequence.

RULE: Alveolar assimilation for consonant stem l

Tests:

RULE: Alveolar assimilation for consonant stem r

RULE: Alveolar assimilation for consonant stem s in infinitives Tests:

RULE: Alveolar assimilation for consonant stem s in participles

Vowel change rules: a - ä - e - i - o - ö - u - y

Here come the rules for stem vowel changes in front of suffix -i- (be it plural, present, comparative or conditional). Vowels are deleted or changed according to context. There are also some other vowel change rules.

a rules

RULE: a:e before the ^AE trigger

RULE: a:0 before metathesis h

Tests:

RULE: a:o when nonrounded root vowel and before i

Tests:

ä rules

RULE: ä:0

Tests:

RULE: ä:e

e rules

RULE: e:0 deletes -e- in LNR stems as well as before -i-

Tests:

RULE: e:i

Tests:

i rules

RULE: i:0

Tests:

RULE: i:j

RULE: i2:j

RULE: i8:0

Tests:

RULE: i:e

o rules

RULE: o:0

Tests:

ö rules

RULE: ö:0

Tests:

u rules

RULE: u:0

Tests:

y rules

RULE: y:0

Tests:

Vowel copying rules

These are the rules connected to the Meänkieli -h- suffixes. The vowel must be copied from the stem to the right of the h and also deleted in the stem (cf. talo : talhoon)

RULE: a copying for h metathesis

Tests:

RULE: o copying for h metathesis

Tests:

RULE: i copying for h metathesis

Tests:

RULE: ä copying for h metathesis

RULE: e copying for h metathesis

RULE: ö copying for h metathesis

RULE: y copying for h metathesis

RULE: u copying for h metathesis

Vowel harmony rule

All vowel harmony is taken care of with one rule.

RULE: Back harmony

Tests:


This (part of) documentation was generated from src/fst/morphology/phonology.twolc


src-fst-morphology-root.lexc.md

Meänkieli morphological transducer

Beware of remnants from the Finnish and Kven files.

Tags for POS

Tags for grammar

Pronoun types

Other tags

Number

Case

Possessive suffixes

Comparatives

Finite verbs

Verb person tags

Verb transitivity

Infinite verbs

Punctuation

Language tags

Speller tags

Compounds

Derivation

These three tags are not added in lexc. The POS tag before derivation is converted into this tag when compiling FST for disambiguation.

Tag

Clitic tags

Semantic tags

Phonological symbols

Flag diacritics

We have manually optimised the structure of our lexicon using following flag diacritics to restrict morhpological combinatorics - only allow compounds with verbs if the verb is further derived into a noun again:

Flag Explanation
@P.NeedNoun.ON@ (Dis)allow compounds with verbs unless nominalised
@D.NeedNoun.ON@ (Dis)allow compounds with verbs unless nominalised
@C.NeedNoun@ (Dis)allow compounds with verbs unless nominalised

For languages that allow compounding, the following flag diacritics are needed to control position-based compounding restrictions for nominals. Their use is handled automatically if combined with +CmpN/xxx tags. If not used, they will do no harm.

Flag Explanation
@P.CmpFrst.FALSE@ Require that words tagged as such only appear first
@D.CmpPref.TRUE@ Block such words from entering ENDLEX
@P.CmpPref.FALSE@ Block these words from making further compounds
@D.CmpLast.TRUE@ Block such words from entering R
@D.CmpSuff.TRUE@ Block such words from entering R
@P.CmpSuff.TRUE@ Mark that we have passed R
@D.CmpNone.TRUE@ Combines with the next tag to prohibit compounding
@U.CmpNone.FALSE@ Combines with the prev tag to prohibit compounding
@P.CmpOnly.TRUE@ Sets a flag to indicate that the word has passed R
@D.CmpOnly.FALSE@ Disallow words coming directly from root.

Use the following flag diacritics to control downcasing of derived proper nouns (e.g. Finnish Pariisi -> pariisilainen). See e.g. North Sámi for how to use these flags. There exists a ready-made regex that will do the actual down-casing given the proper use of these flags.

Flag Explanation
@U.Cap.Obl@ Allowing downcasing of derived names: deatnulasj.
@U.Cap.Opt@ Allowing downcasing of derived names: deatnulasj.

These tags are for handling errorneous forms | Flag | Explanation | |—– |———– | | @D.ErrOrth.ON@ | tbw | @P.ErrOrth.ON@ | tbw | @C.ErrOrth@ | tbw | @R.ErrOrth.ON@ | tbw

This is for pronouns with multiple case suffixes (jommallekummalle)

Flag Explanation
@U.pron.nom@ tbw
@U.pron.gen@ tbw
@U.pron.gen2@ tbw
@U.pron.ill@ tbw
@U.pron.par@ tbw
@U.pron.par2@ tbw
@U.pron.par3@ tbw
@U.pron.ess@ tbw
@U.pron.tra@ tbw
@U.pron.ine@ tbw
@U.pron.ela@ tbw
@U.pron.all@ tbw
@U.pron.ade@ tbw
@U.pron.abl@ tbw
@P.compound.block@ tbw
@D.compound.block@ tbw

These are for preprocessing

Flag Explanation
@P.Pmatch.Loc@  
@P.Pmatch.Backtrack@  
+Use/PMatch  
+Use/-PMatch  
+Gram/TAbbr Transitive abbreviation (it needs an argument)
+Gram/NoAbbr Intransitive abbreviations that are homonymous with more frequent words. They should only be considered abbreviations in the middle of a sentence.
+Gram/TNumAbbr Transitive abbreviation if the following constituent is numeric
+Gram/NumNoAbbr Transitive abbreviations for which numerals are complements and normal words. The abbreviation usage is less common and thus only the occurences in the middle of the sentence can be considered as true cases.
+Gram/TIAbbr Both transitive and intransitive abbreviation
+Gram/IAbbr Intransitive abbreviation (it takes no argument)
+Gram/3syll trisyllabic verbs
+Gram/Superl superlative
+Gram/Comp comparative

Semantic tags

Basic lexica, pointing to the other lexicon files

Here is the Root lexicon, pointing to all the parts of speech:

LEXICON Root


This (part of) documentation was generated from src/fst/morphology/root.lexc


src-fst-morphology-stems-adjectives.lexc.md

Meänkieli adjectives

This file documents the file for Meänkieli adjectives.

The continuation lexicon types

The lemma list itself

LEXICON AdjectiveRoot


This (part of) documentation was generated from src/fst/morphology/stems/adjectives.lexc


src-fst-morphology-stems-adverbs.lexc.md

Meänkieli adverbs

This file documents the file for Meänkieli adverbs.

The first part of the file adds tags, and the second lists the adverbs.

The tags

The adverbs themselves (some 1200)


This (part of) documentation was generated from src/fst/morphology/stems/adverbs.lexc


src-fst-morphology-stems-conjunctions.lexc.md

Meänkieli conjunctions

This file documents the file for Meänkieli conjunctions.

It contains two parts, one for adding tags, and one for listing conjunctions.

Adding tags

The conjunctions themselves


This (part of) documentation was generated from src/fst/morphology/stems/conjunctions.lexc


src-fst-morphology-stems-fit-abbreviations.lexc.md

File containing meänkieli abbreviations

This file documents the file for Meänkieli abbreviations.

The file contains 5-6 abbreviations, and is thus just a placeholder. Most fit abbreviations thus come from the common abbreviation file. Here we should add meänkieli-specific ones.

Lexica for adding tags and periods

  1. ITRAB ;
  2. TRNUMAB ;
  3. TRAB ;

The abbreviation lexicon itself

Intransitive abbreviations

Abreviations who are transitive in front of numerals

Transitive abbreviations


This (part of) documentation was generated from src/fst/morphology/stems/fit-abbreviations.lexc


src-fst-morphology-stems-fit-acronyms.lexc.md

Meänkieli aacronyms

The file stems/fit-acronyms.lexc is a dummy file, with this comtent only:


This (part of) documentation was generated from src/fst/morphology/stems/fit-acronyms.lexc


src-fst-morphology-stems-fit-propernouns.lexc.md

Meänkieli propernouns

This file documents the file for Meänkieli propernouns.

Contrary to other GiellaLT languages, the Meänkieli FST is not set up to use the language-independent name base found in the infrastructure.

The lexicon names look like this: p_mal_1 etc. They have 3 parts, divided by “_”

32000 names


This (part of) documentation was generated from src/fst/morphology/stems/fit-propernouns.lexc


src-fst-morphology-stems-interjections.lexc.md

Meänkieli interjections

This file documents the file for Meänkieli interjections.

Adding tag


This (part of) documentation was generated from src/fst/morphology/stems/interjections.lexc


src-fst-morphology-stems-nouns.lexc.md

Noun stems for Meänkieli

This file documents the file for Meänkieli nouns.

Vowel stems

This is an overview of the continuation lexicon types.

Special stems

Vowel stems

Stems for -i-words, vowel AND consonant

Special cases for -i-words

Consonant stems of other types

The lexica themselves

The lemma list


This (part of) documentation was generated from src/fst/morphology/stems/nouns.lexc


src-fst-morphology-stems-numerals.lexc.md

Meänkieli numerals

This file documents the file for Meänkieli numerals.

These are taken from fkv, but originally from fin, an FST with very different ways of doing things.

Numerals have been split in three sections, the compounding parts of cardinals and ordinals, and the non-compounding ones:

The compounding parts of cardinals are the number multiplier words.

The suffixes only appear after cardinal multipliers

The compounding parts of ordinals are the number multiplier words.

The suffixes only appear after cardinal multipliers

There is a set of numbers or corresponding expressions that work like them, but are not basic cardinals or ordinals:

Numeral stem variation

Numerals follow the same stem variation patterns as nouns, some of these being very rare to extinct for nouns.


This (part of) documentation was generated from src/fst/morphology/stems/numerals.lexc


src-fst-morphology-stems-postpositions.lexc.md

Meänkieli postpositions

This file documents the file for Meänkieli postpositions.

Adding tags

The list of 40 or so postpositions.


This (part of) documentation was generated from src/fst/morphology/stems/postpositions.lexc


src-fst-morphology-stems-prepositions.lexc.md

Meänkieli prepositions

This file documents stems/prepositions.lexc, the file for Meänkieli prepositions

The tags

The prepositons


This (part of) documentation was generated from src/fst/morphology/stems/prepositions.lexc


src-fst-morphology-stems-pronouns.lexc.md

Meänkieli pronouns

This file documents the file for Meänkieli pronouns.

Persoonapronominit

Demonstratiivipronominit

Sanakirjasta


This (part of) documentation was generated from src/fst/morphology/stems/pronouns.lexc


src-fst-morphology-stems-subjunctions.lexc.md

Meänkieli subjunctions

This file documents the file for Meänkieli subjunctions.


This (part of) documentation was generated from src/fst/morphology/stems/subjunctions.lexc


src-fst-morphology-stems-verbs.lexc.md

Documenting the file for meänkieli verbs

This file documents the file for Meänkieli verb stems.

First, it gives an nverview of the continuation lexica, and thereafter it sketches their actual content.

Overview over the continuation lexica

Continuation lexica for regular verbs

Continuation lexica for irregular verbs

The verb lexica themselves

The rest of the file contains some 5500 verbs.

Irregular verbs

v1 sanoa, lukea

v2 tryykätä

v3 syödä, juoda

v4 tulla, mennä

v5 tarvita

v6 paeta

Then comes the long list


This (part of) documentation was generated from src/fst/morphology/stems/verbs.lexc


src-fst-phonetics-txt2ipa.xfscript.md

retroflex plosive, voiceless t ʈ 0288, 648 ( = ASCII 096) retroflex plosive, voiced d ɖ 0256, 598 labiodental nasal F ɱ 0271, 625 retroflex nasal n ɳ 0273, 627 palatal nasal J ɲ 0272, 626 velar nasal N ŋ 014B, 331 uvular nasal N\ ɴ 0274, 628

bilabial trill B\ ʙ 0299, 665 uvular trill R\ ʀ 0280, 640 alveolar tap 4 ɾ 027E, 638 retroflex flap r ɽ 027D, 637 bilabial fricative, voiceless p\ ɸ 0278, 632 bilabial fricative, voiced B β 03B2, 946 dental fricative, voiceless T θ 03B8, 952 dental fricative, voiced D ð 00F0, 240 postalveolar fricative, voiceless S ʃ 0283, 643 postalveolar fricative, voiced Z ʒ 0292, 658 retroflex fricative, voiceless s ʂ 0282, 642 retroflex fricative, voiced z` ʐ 0290, 656 palatal fricative, voiceless C ç 00E7, 231 palatal fricative, voiced j\ ʝ 029D, 669 velar fricative, voiced G ɣ 0263, 611 uvular fricative, voiceless X χ 03C7, 967 uvular fricative, voiced R ʁ 0281, 641 pharyngeal fricative, voiceless X\ ħ 0127, 295 pharyngeal fricative, voiced ?\ ʕ 0295, 661 glottal fricative, voiced h\ ɦ 0266, 614

alveolar lateral fricative, vl. K alveolar lateral fricative, vd. K\

labiodental approximant P (or v) alveolar approximant r\ retroflex approximant r` velar approximant M\

retroflex lateral approximant l` palatal lateral approximant L velar lateral approximant L
Clicks

bilabial O\ (O = capital letter) dental |
(post)alveolar !\ palatoalveolar =\ alveolar lateral ||
Ejectives, implosives

ejective > e.g. ejective p p> implosive < e.g. implosive b b< Vowels

close back unrounded M close central unrounded 1 close central rounded } lax i I lax y Y lax u U

close-mid front rounded 2 close-mid central unrounded @\ close-mid central rounded 8 close-mid back unrounded 7

schwa ə @

open-mid front unrounded E open-mid front rounded 9 open-mid central unrounded 3 open-mid central rounded 3\ open-mid back unrounded V open-mid back rounded O

ash (ae digraph) { open schwa (turned a) 6

open front rounded & open back unrounded A open back rounded Q Other symbols

voiceless labial-velar fricative W voiced labial-palatal approx. H voiceless epiglottal fricative H\ voiced epiglottal fricative <\ epiglottal plosive >\

alveolo-palatal fricative, vl. s\ alveolo-palatal fricative, voiced z\ alveolar lateral flap l\ simultaneous S and x x\ tie bar _ Suprasegmentals

primary stress “ secondary stress % long : half-long :\ extra-short _X linking mark -
Tones and word accents

level extra high _T level high _H level mid _M level low _L level extra low _B downstep ! upstep ^ (caret, circumflex)

contour, rising contour, falling _F contour, high rising _H_T contour, low rising _B_L

contour, rising-falling _R_F (NB Instead of being written as diacritics with _, all prosodic marks can alternatively be placed in a separate tier, set off by < >, as recommended for the next two symbols.) global rise global fall Diacritics

voiceless 0 (0 = figure), e.g. n_0 voiced _v aspirated _h more rounded _O (O = letter) less rounded _c advanced _+ retracted _- centralized _” syllabic = (or _=) e.g. n= (or n=) non-syllabic _^ rhoticity `

breathy voiced _t creaky voiced _k linguolabial _N labialized _w palatalized ‘ (or _j) e.g. t’ (or t_j) velarized _G pharyngealized _?\

dental d apical _a laminal _m nasalized ~ (or _~) e.g. A~ (or A~) nasal release _n lateral release _l no audible release _}

velarized or pharyngealized _e velarized l, alternatively 5 raised _r lowered _o advanced tongue root _A retracted tongue root _q


This (part of) documentation was generated from src/fst/phonetics/txt2ipa.xfscript


src-fst-transcriptions-transcriptor-abbrevs2text.lexc.md

We describe here how abbreviations are in Tornedalen Finnish are read out, e.g. for text-to-speech systems.

For example:


This (part of) documentation was generated from src/fst/transcriptions/transcriptor-abbrevs2text.lexc


src-fst-transcriptions-transcriptor-numbers-digit2text.lexc.md

Number transcriptions

This file is copied from the Finnish one. It should thus be Meänkielified. Transcribing numbers to words in Finnish is not completely trivial, one reason is that numbers in Finnish are written as compounds, regardless of length: 123456 is satakaksikymmentäkolmetuhattaneljäsataaviisikymmentäkuusi. Another limitation is that inflections can be unmarked in running text, that is digit expression is assumed to agree the case of the phrase it is in, e.g. 27 is kaksikymmentäseittemän, and 27:lle kahdellekymmenelleseittemälle but in a phrase: “tarjosin 27 osanottajalle” 27 assumes the allative case without marking and it is preferred grammatical form in good writing.

Tags

Flag diacritics

Flag diacritics in number transcribing are used to control case agreement: in Finnish numeral compounds all words agree in case except in nominative singular where 10’s exponential multipliers are in singular partitive.

Lexica

Morphotactics of digit strings

The morphotactics related to numbers and their transcriptions is that we need to know the whole digit string to know how the length of whole digit string to know what to start reading, and zeroes are not read out but have an effect to readout. The numerals are systematic and perfectly compositional: the implementation of 100 000–999 999 is almost exactly same as 100 000 000–999 000 000 and everything afterwads with the change of word tuhat~tuhatta, miljoona~miljoonaa, miljardia, biljoonaa, biljardia and so forth–that is along the long scale British (French) system where American billion = milliard etc. The numbers are built from ~single word length blocks in decreasing order with the exception of zig-zagging over numbers 11–19 where the second digit comes before first. The rest of this documentation describes the morphotactic implementation by the lexicon structure in descending order of magnitude with examples.

Lexicon HUNDREDSMRD contains numbers 2-9 that need to be followed by exactly 11 digits: 200 000 000 000–999 999 999 999 this is to implement Nsataa…miljardia…

Lexicon CUODIMRD contains numbers 2-9 that need to be followed by exactly this is to implement Nsataa…miljardia…

Lexicon HUNDREDMRD is for numbers in range: 100 000 000 000–199 000 000 000 this is to implement sata…miljardia…

Lexicon TEENSMRD is for numbers with 11 000 000 000–19 000 000 000 this is to implement …Ntoista…miljardia…

Lexicon TEENMRD is for numbers with 11 000 000 000–19 000 000 000 this is to implement …Ntoista…miljardia…

Lexicon TENSMRD is for numbers with 20 000 000 000–90 000 000 000 this is to implement …Nkymmentä…miljardia…

Lexicon TENMRD is for numbers with 10 000 000 000–10 999 999 999 this is to implement …kymmenenmiljardia…

Lexicon LÅGEVMRD is for numbers with 20 000 000 000–90 000 000 000 this is to implement …Nkymmentä…miljardia…

Lexicon ONESMRD is for numbers with 1 000 000 000–9 000 000 000 this is to implement …Nmiljardia…

Lexicon MILJARD is for numbers with 1 000 000 000–9 000 000 000 this is to implement …Nmiljardia

Lexicon OVERMILLIONS is for the millions part of numbers greater than 1 milliard

Lexicon HUNDREDSM contains numbers 2-9 that need to be followed by exactly 8 digits: 200 000 000–999 999 999 this is to implement Nsataa…miljoonaa…

Lexicon CUODIM contains numbers 2-9 that need to be followed by exactly this is to implement Nsataa…miljoonaa…

Lexicon HUNDREDM is for numbers in range: 100 000 000–199 000 000 this is to implement sata…miljoonaa…

Lexicon TEENSM is for numbers with 11 000 000–19 000 000 this is to implement …Ntoista…miljoonaa…

Lexicon TEENM is for numbers with 11 000 000–19 000 000 this is to implement …Ntoista…miljoonaa…

Lexicon TENSM is for numbers with 20 000 000–90 000 000 this is to implement …Nkymmentä…miljoonaa…

Lexicon TENM is for numbers with 10 000 000–10 999 999 this is to implement …kymmenenmiljoonaa…

Lexicon LÅGEVM is for numbers with 20 000 000–90 000 000 this is to implement …Nkymmentä…miljoonaa..

Lexicon ONESM is for numbers with 1 000 000–9 000 000 this is to implement …Nmiljoonaa…

Lexicon MILJON is for numbers with 1 000 000–9 000 000 this is to implement …Nmiljoonaa

Lexicon UNDERMILLION is for numbers with 100 000–900 000 after milliards

Lexicon OVERTHOUSANDS is for the thousands part of numbers greater than 1 million

Lexicon HUNDREDST contains numbers 2-9 that need to be followed by exactly 5 digits: 200 000–999 999 this is to implement Nsataa…tuhatta…

Lexicon CUODIT contains numbers 2-9 that need to be followed by exactly this is to implement Nsataa…tuhatta…

Lexicon HUNDREDT is for numbers in range: 100 000–199 000 this is to implement sata…tuhatta…

Lexicon TEENST is for numbers with 11 000–19 000 this is to implement …Ntoista…tuhatta…

Lexicon TEENT is for numbers with 11 000–19 000 this is to implement …Ntoista…tuhatta…

Lexicon TENST is for numbers with 20 000–90 000 this is to implement …Nkymmentä…tuhatta…

Lexicon TENT is for numbers with 10 000 000–10 999 999 this is to implement …kymmenentuhatta…

Lexicon LÅGEVT is for numbers with 20 000–90 000 this is to implement …Nkymmentä…tuhatta..

Lexicon ONEST is for numbers with 1 000–9 000 this is to implement …Ntuhatta…

Lexicon THOUSANDS is for numbers with 1 000–9 000 this is to implement …Ntuhatta

Lexicon THOUSAND is for the ones-tens-hundreds of numbers greater than thousand

Lexicon UNDERTHOUSAND is for numbers with 100–900 after thousands

Lexicon HUNDREDS contains numbers 2-9 that need to be followed by exactly 2 digits: 200–999 this is to implement Nsataa…

Lexicon CUODI contains numbers 2-9 that need to be followed by exactly this is to implement Nsataa

Lexicon HUNDRED is for numbers in range: 100–999

Lexicon TEENS is for numbers with 11–19 this is to implement …Ntoista

Lexicon TEEN is for numbers with 11–19 this is to implement …Ntoista

Lexicon TENS is for numbers with 20–90 this is to implement …Nkymmentä…

Lexicon LÅGEV is for numbers with 20–90 this is to implement …Nkymmentä

Lexicon JUSTTEN is for number 10 this is to implement …kymmenen

Lexicon ONES is for numbers with 1–9 this is to implement yksi, kaksi, kolme…, yheksän

Lexicon ZERO is for number 0 nolla

Lexicon LOPPU is to implement potential case inflection with a colon.


This (part of) documentation was generated from src/fst/transcriptions/transcriptor-numbers-digit2text.lexc


tools-grammarcheckers-grammarchecker.cg3.md

[ L A N G U A G E ] G R A M M A R C H E C K E R

DELIMITERS

TAGS AND SETS

Tags

This section lists all the tags inherited from the fst, and used as tags in the syntactic analysis. The next section, Sets, contains sets defined on the basis of the tags listed here, those set names are not visible in the output.

Beginning and end of sentence

BOS EOS

Parts of speech tags

N A Adv V Pron CS CC CC-CS Po Pr Pcle Num Interj ABBR ACR CLB LEFT RIGHT WEB PPUNCT PUNCT

COMMA ¶

Tags for POS sub-categories

Pers Dem Interr Indef Recipr Refl Rel Coll NomAg Prop Allegro Arab Romertall

Tags for morphosyntactic properties

Nom Acc Gen Ill Loc Com Ess Ess Sg Du Pl Cmp/SplitR Cmp/SgNom Cmp/SgGen Cmp/SgGen PxSg1 PxSg2 PxSg3 PxDu1 PxDu2 PxDu3 PxPl1 PxPl2 PxPl3 Px

Comp Superl Attr Ord Qst IV TV Prt Prs Ind Pot Cond Imprt ImprtII Sg1 Sg2 Sg3 Du1 Du2 Du3 Pl1 Pl2 Pl3 Inf ConNeg Neg PrfPrc VGen PrsPrc Ger Sup Actio VAbess

Err/Orth

Semantic tags

Sem/Act Sem/Ani Sem/Atr Sem/Body Sem/Clth Sem/Domain Sem/Feat-phys Sem/Fem Sem/Group Sem/Lang Sem/Mal Sem/Measr Sem/Money Sem/Obj Sem/Obj-el Sem/Org Sem/Perc-emo Sem/Plc Sem/Sign Sem/State-sick Sem/Sur Sem/Time Sem/Txt

HUMAN

PROP-ATTR PROP-SUR

TIME-N-SET

Syntactic tags

@+FAUXV @+FMAINV @-FAUXV @-FMAINV @-FSUBJ> @-F<OBJ @-FOBJ> @-FSPRED<OBJ @-F<ADVL @-FADVL> @-F<SPRED @-F<OPRED @-FSPRED> @-FOPRED> @>ADVL @ADVL< @<ADVL @ADVL> @ADVL @HAB> @<HAB @>N @Interj @N< @>A @P< @>P @HNOUN @INTERJ @>Num @Pron< @>Pron @Num< @OBJ @<OBJ @OBJ> @OPRED @<OPRED @OPRED> @PCLE @COMP-CS< @SPRED @<SPRED @SPRED> @SUBJ @<SUBJ @SUBJ> SUBJ SPRED OPRED @PPRED @APP @APP-N< @APP-Pron< @APP>Pron @APP-Num< @APP-ADVL< @VOC @CVP @CNP OBJ

-OTHERS SYN-V @X ## Sets containing sets of lists and tags This part of the file lists a large number of sets based partly upon the tags defined above, and partly upon lexemes drawn from the lexicon. See the sourcefile itself to inspect the sets, what follows here is an overview of the set types. ### Sets for Single-word sets INITIAL ### Sets for word or not WORD NOT-COMMA ### Case sets ADLVCASE CASE-AGREEMENT CASE NOT-NOM NOT-GEN NOT-ACC ### Verb sets NOT-V ### Sets for finiteness and mood REAL-NEG MOOD-V NOT-PRFPRC ### Sets for person SG1-V SG2-V SG3-V DU1-V DU2-V DU3-V PL1-V PL2-V PL3-V ### Pronoun sets ### Adjectival sets and their complements ### Adverbial sets and their complements ### Sets of elements with common syntactic behaviour ### NP sets defined according to their morphosyntactic features ### The PRE-NP-HEAD family of sets These sets model noun phrases (NPs). The idea is to first define whatever can occur in front of the head of the NP, and thereafter negate that with the expression **WORD - premodifiers**. ### Border sets and their complements ### Grammarchecker sets * * * This (part of) documentation was generated from [tools/grammarcheckers/grammarchecker.cg3](https://github.com/giellalt/lang-fit/blob/main/tools/grammarcheckers/grammarchecker.cg3) --- # tools-tokenisers-tokeniser-disamb-gt-desc.pmscript.md # Tokeniser for fit Usage: ``` $ make $ echo "ja, ja" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst $ echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa boasttu olmmoš, man mielde lahtuid." | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst $ echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst $ echo "márffibiillagáffe" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst ``` Pmatch documentation: <https://github.com/hfst/hfst/wiki/HfstPmatch> Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words: * Punct contains ASCII punctuation marks * The symbol after m-dash is soft-hyphen `U+00AD` * The symbol following {•} is byte-order-mark / zero-width no-break space `U+FEFF`. Whitespace contains ASCII white space and the List contains some unicode white space characters * En Quad U+2000 to Zero-Width Joiner U+200d' * Narrow No-Break Space U+202F * Medium Mathematical Space U+205F * Word joiner U+2060 Apart from what's in our morphology, there are 1. unknown word-like forms, and 2. unmatched strings We want to give 1) a match, but let 2) be treated specially by `hfst-tokenise -a` Unknowns are made of: * lower-case ASCII * upper-case ASCII * select extended latin symbols ASCII digits * select symbols * Combining diacritics as individual symbols, * various symbols from Private area (probably Microsoft), so far: * U+F0B7 for "x in box" ## Unknown handling Unknowns are tagged ?? and treated specially with `hfst-tokenise` hfst-tokenise --giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it's safer to let hfst-tokenise handle them. Finally we mark as a token any sequence making up a: * known word in context * unknown (OOV) token in context * sequence of word and punctuation * URL in context * * * This (part of) documentation was generated from [tools/tokenisers/tokeniser-disamb-gt-desc.pmscript](https://github.com/giellalt/lang-fit/blob/main/tools/tokenisers/tokeniser-disamb-gt-desc.pmscript) --- # tools-tokenisers-tokeniser-gramcheck-gt-desc.pmscript.md # Grammar checker tokenisation for fit Requires a recent version of HFST (3.10.0 / git revision>=3aecdbc) Then just: ``` $ make $ echo "ja, ja" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst ``` More usage examples: ``` $ echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa boasttu olmmoš, man mielde lahtuid." | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst $ echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst $ echo "márffibiillagáffe" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst ``` Pmatch documentation: <https://github.com/hfst/hfst/wiki/HfstPmatch> Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words: * Punct contains ASCII punctuation marks * The symbol after m-dash is soft-hyphen `U+00AD` * The symbol following {•} is byte-order-mark / zero-width no-break space `U+FEFF`. Whitespace contains ASCII white space and the List contains some unicode white space characters * En Quad U+2000 to Zero-Width Joiner U+200d' * Narrow No-Break Space U+202F * Medium Mathematical Space U+205F * Word joiner U+2060 Apart from what's in our morphology, there are 1) unknown word-like forms, and 2) unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a * select extended latin symbols * select symbols * various symbols from Private area (probably Microsoft), so far: * U+F0B7 for "x in box" TODO: Could use something like this, but built-in's don't include šžđčŋ: Simply give an empty reading when something is unknown: hfst-tokenise --giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it's safer to let hfst-tokenise handle them. Finally we mark as a token any sequence making up a: * known word in context * unknown (OOV) token in context * sequence of word and punctuation * URL in context * * * This (part of) documentation was generated from [tools/tokenisers/tokeniser-gramcheck-gt-desc.pmscript](https://github.com/giellalt/lang-fit/blob/main/tools/tokenisers/tokeniser-gramcheck-gt-desc.pmscript) --- # tools-tokenisers-tokeniser-tts-cggt-desc.pmscript.md # TTS tokenisation for smj Requires a recent version of HFST (3.10.0 / git revision>=3aecdbc) Then just: ```sh make echo "ja, ja" \ | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst ``` More usage examples: ```sh echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa \ boasttu olmmoš, man mielde lahtuid." \ | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" \ | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst echo "márffibiillagáffe" \ | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst ``` Pmatch documentation: <https://kitwiki.csc.fi/twiki/bin/view/KitWiki/HfstPmatch> Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words: * Punct contains ASCII punctuation marks * The symbol after m-dash is soft-hyphen `U+00AD` * The symbol following {•} is byte-order-mark / zero-width no-break space `U+FEFF`. Whitespace contains ASCII white space and the List contains some unicode white space characters * En Quad U+2000 to Zero-Width Joiner U+200d' * Narrow No-Break Space U+202F * Medium Mathematical Space U+205F * Word joiner U+2060 Apart from what's in our morphology, there are 1) unknown word-like forms, and 2) unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a * select extended latin symbols * select symbols * various symbols from Private area (probably Microsoft), so far: * U+F0B7 for "x in box" TODO: Could use something like this, but built-in's don't include šžđčŋ: Simply give an empty reading when something is unknown: hfst-tokenise --giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it's safer to let hfst-tokenise handle them. Needs hfst-tokenise to output things differently depending on the tag they get * * * This (part of) documentation was generated from [tools/tokenisers/tokeniser-tts-cggt-desc.pmscript](https://github.com/giellalt/lang-fit/blob/main/tools/tokenisers/tokeniser-tts-cggt-desc.pmscript)