Võro NLP Grammar

Finite state and Constraint Grammar based analysers, proofing tools and other resources

View the project on GitHub giellalt/lang-vro

Page Content

Võro language model documentation

All doc-comment documentation in one large file.


src-cg3-dependency.cg3.md

C O M M O N S Á M I D E P E N D E N C Y G R A M M A R

This dep file is for sma, sme, smj, sje.

DELIMITERS

Sentence delimiters are the following: <.> <!> <?> <…> <¶>

TAGS AND SETS

N V A Adv CC CS Inf Sup Neg Num Po Pr

Pcle Prop

Pron IV TV COMMA DASH CITATION to keep colouring we add a “ HYPHEN QMARK PUNCT LEFT RIGHT CLB Ind Pot Impr ImprtII Cond ConNeg Caus causative eus VGen Interj ABBR ACR Prs Prt Cmpnd RCmpnd PrfPrc PrsPrc Actor Actio Ger Indef Nom Acc Ill Com Gen Ess

IM For fao

POS sub-categories

Syntactic tags and sets

Syntactic tags in input to this file

Syntactic tags added in this file

fao syntags

kal syntags

eus syntags

Syntactic set definitions

Dep grammar

Correction rules

The finite verb

Mapping rules

lgRemove removes the language tags , , etc, before proceeding to the dep file.


This (part of) documentation was generated from src/cg3/dependency.cg3


src-cg3-disambiguator.cg3.md

Disambiguator for Võro

Sets

Sentence delimiters are the following: “<.>” “<…>” “<!>” “<?>” “<¶>”

Part-of-Speech

Numerus

Cases

Types

Sets with more members

Boundaries

Verbs

Disambiguation rules

Dialects

Early rules

Possessive suffixes

Numeral phrases

Preposition/postposition/adverb rules

Rules for mapping @CVP and @CNP on the CC and CS

Case rules

Partitive

Genitive

Illative

Number rules

More disambiguation rules

Elative

Propernouns

Verbs

Specific verbs

ei negation verb

eli

Adverbs

paljon

kerran

jälkhiin

Adjectives

Conjunctions

Subjunctions

että

jos

ko

sillä

Pronouns

Verb rules, Verbs

Infinitive

Present Sg3

Present Pl3 or PrsPrc

Present Pl3 or Passive

Imperative

Past tense

Prt Pl3 or Prt Sg2

Negative verb

Relative pronouns

HNOUN MAPPING


This (part of) documentation was generated from src/cg3/disambiguator.cg3


src-cg3-functions.cg3.md

S Y N T A C T I C F U N C T I O N S F O R S Á M I

Sámi language technology project 2003-2018, University of Tromsø #

This file adds syntactic functions. It is common for all the Saami

LEFT RIGHT because of apertium

Syntactic tags

Tag sets

These sets model noun phrases (NPs). The idea is to first define whatever can occur in front of the head of the NP, and thereafter negate that with the expression WORD - premodifiers.

The set NOT-NPMOD is used to find barriers between NPs. Typical usage: … (*1 N BARRIER NPT-NPMOD) … meaning: Scan to the first noun, ignoring anything that can be part of the noun phrase of that noun (i.e., “scan to the next NP head”)

ADLVCASE

These were the set types.

Numeral outside the sentence

HABITIVE MAPPING

sma object

SUBJ MAPPING - leftovers

OBJ MAPPING - leftovers

MAPPING for MT - experimental

HNOUN MAPPING

missingX adds @X to all missings

therestX adds @X to all what is left, often errouneus disambiguated forms

For Apertium:

The analysis give double analysis because of optional semtags. We go for the one with semtag.


This (part of) documentation was generated from src/cg3/functions.cg3


src-fst-morphology-affixes-adjectives.lexc.md

Adjective inflection The VÕRO language adjectives compare.

LEXICON A_1HANS1A 1 hanśa:hanśa

LEXICON A_1HERRAE 1 herrä:herrä

LEXICON A_2ARTIKLI suhvli:suhvli

LEXICON A_2KERGE 1 kerge:

LEXICON A_3ALADU aladu:aladu

LEXICON A_3PERAEDUE perädü:perädü

LEXICON A_4AINUS ainus:ainus

LEXICON A_11AINWQ ainõq:ainõ

LEXICON A_11KELMEQ kelmeq:kelme

LEXICON A_13ALONW alonõ:alo

LEXICON A_13TAEHINE tähine:tähi

LEXICON A_13TAEHINE_PL tähine:tähi

LEXICON A_14RITS1KAS ritśkas:ritśka%{sØ%}

LEXICON A_14HAMMAS rikas:ri%{kØ%}ka%{sØ%}

LEXICON A_14IKAES rikas:ri%{kØ%}ka%{sØ%}

LEXICON A_16ABILINW inemine:inemi LEXICON A_16INEMINE inemine:inemi

LEXICON A_19ALOMANW alomanõ:aloma

LEXICON A_19PEDAEJAENE pedäjäne:pedäjä

LEXICON A_19PEDAEJAENE_PL pedäjäne:pedäjä

LEXICON A_22VWROKWNW võrokõnõ:võrokõ

LEXICON A_22VAEHAEKENE võrokõnõ:võrokõ

gradation: no

gradation: yes

gradation: no


This (part of) documentation was generated from src/fst/morphology/affixes/adjectives.lexc


src-fst-morphology-affixes-adverbs.lexc.md

Adverbs The VÕRO language adverbs…

Spatial adverbs

adjective modifiers

What is this 2017-03-27


This (part of) documentation was generated from src/fst/morphology/affixes/adverbs.lexc


src-fst-morphology-affixes-nouns.lexc.md

Noun inflection for Võro

LEXICON N_1HANS1A 1 hanśa:hanśa

LEXICON N_1VIU 1 viu:viu

LEXICON N_1HERRAE 1 herrä:herrä

LEXICON N_1PREI 1 prei:prei

LEXICON N_3PERAEDUE perädü:perädü

LEXICON N_3ALADU aladu:aladu

kipõń:kipõn

allaś:allas

sinneĺ:sinnel

veteĺ:vetel

tukõv:tukõv

verrev:verrev

sallai:sallai

elläi:elläi

herre:herre

villõ:villõ

LEXICON N_10HWRAK hõrak:hõrak

LEXICON N_10HAIDAK haidak:haidak

LEXICON N_10ESAEK esäk:esäk

LEXICON N_10RAAMAT raamat:raamat

LEXICON N_10LEMBIT esäk:esäk

LEXICON N_10AABITS aabits:aabits

LEXICON N_10HEERITS heerits:heerits

LEXICON N_10AADRWS1 aadrõś:aadrõs

LEXICON N_10AMMAT1 ammat́:ammat

LEXICON N_10HUEPAETS1 hüpätś:hüpä%{td%}s

LEXICON N_11LAETEQ läteq:lä%{tØ%}te

LEXICON N_11ANNWQ annõq:andõ

LEXICON N_11AINWQ ainõq:ainõ LEXICON N_11KELMEQ kelmeq:ainõ

LEXICON N_11VAIH vaih:vaih

LEXICON N_12NWKWS1 nõkõś:nõ%{kg%}õ%{sś%}

LEXICON N_13ALONW alonõ:alo

LEXICON N_13TAEHINE tähine:tähi

Gradation: No

LEXICON N_14RITS1KAS ritśkas:ritśka%{sØ%}

Gradation: No

LEXICON N_14HAMMAS hammas:ham%{bm%}a%{sØ%}, saabas:saapa distinguished from 14RITS1KAS due to gradation

LEXICON N_14IKAES ikäs:ikkä distinguished from 14RITS1KAS due to gradation

LEXICON N_14NUMMWR1 nummõŕ:numbõr distinguished from 14RITS1KAS due to gradation

vowel_harmony: front gradation: yes

vowel_harmony: back gradation: yes

kotus:kotus

kotus:kotus

LEXICON N_16INEMINE inemine:inemi

LEXICON N_16ABILINW abilinõ:abili

LEXICON N_16TERAEKENE inemine:inemi

LEXICON N_16TSIRGUKWNW tsirgukõnõ:abili

LEXICON N_19ALOMANW alomanõ:aloma

LEXICON N_19PEDAEJAENE pedäjäne:pedäjä

LEXICON N_20LATS1 latś:lat%{sś%}

LEXICON N_20TAEUES1 täüś:täü

LEXICON N_20VIIS1 täüś:täü

LEXICON N_20ORS1 täüś:täü

LEXICON N_20HIRS1 täüś:täü

LEXICON N_20VAEITS1 väitś:väits

LEXICON N_20KUEUEDS1 küüdś:küüds

LEXICON N_20MIIS1 miiś:m

LEXICON N_21HUEDSI hüdsi:hü

LEXICON N_21KUSI kusi:kus

LEXICON N_22VWROKWNW võrokõnõ:võrokõ

LEXICON N_22NAANW naanõ:naa

LEXICON N_22VAEHAEKENE vähäkene:vähäke

gradation: yes

tarõ:tar

uma:uma

pesä:pesä

nimi:ni%{mØ%}m

lumi:lum

LEXICON N_36TUUM1 tuuḿ:t%{ou%}%{ou%}m

LEXICON N_36HANG1 hanǵ:hang

LEXICON N_36SAERG1 särǵ:sär%{gǵØ%}

LEXICON N_36LAHT1 laht́:lah%{tt́Ø%}

LEXICON N_36PAEIV päiv:päiv

LEXICON N_36LEIB päiv:päiv

kogõr:kogõr

kokr:ko%{kg%}r

sõbõr:sõbõr oblique plural in o

kubõl:ku%{pb%}õl oblique plural in õ

LEXICON N_37PINI pini:pini

LEXICON N_37WLI õli:õli

LEXICON N_37MUNA muna:mu%{nØ%}na

LEXICON N_40TALO talo:ta%{lØ%}lo

LEXICON N_40HELUE helü:helü

LEXICON N_40UJA uja:u%{jØ%}ja

LEXICON N_40IJAE ijä:i%{jØ%}jä

LEXICON N_40SAVV savv:savvu

LEXICON N_40TUEKK tükk:tü%{kØ%}kü

LEXICON N_41JUHT1 juht́:juht

LEXICON N_41AIG aig:a

LEXICON N_41ASK aig:aig

LEXICON N_41MAENG aig:aig

LEXICON N_41VIIT aig:aig

LEXICON N_43KANARIK usklik+A:%{ˋØ%}#uskli%{kØ%}%{kg%}

LEXICON N_43ELAENIK elänik+N:eläni%{kØ%}%{kg%}

LEXICON N_43SASLWK1

LEXICON N_43APRIL1

LEXICON N_43SEKRETAER1

LEXICON N_43AASTAK

LEXICON N_44SWDA sõda:sõ%{tØ%}%{tdØ%}a

LEXICON N_45KANA kana:ka%{nØ%}na

LEXICON N_45RIHAE rihä:ri%{hØ%}hä

LEXICON N_46HAIN hain:hain

LEXICON N_46TARK tark:tark

LEXICON N_47ASI asi:asi

LEXICON N_47VELI veli:ve%{lØ%}l

LEXICON N_47KIRI kiri:kiri

NOMINAL DECLENSIONS

LEXICON NMN_1HANS1A 1 hanśa:hanśa

in d

LEXICON NMN_1HERRAE 1 herrä:herrä

in d

LEXICON NMN_3PERAEDUE perädü:perädü

LEXICON NMN_3ALADU aladu:aladu

ainus:ainus

Secondary

kuldnõ:kuld

Secondary

Secondary

Secondary

Secondary

Secondary

Secondary

Secondary

LEXICON NMN_9KIPWN1/ELLAEI kipõń:kipõń fixme 2016-08-27

LEXICON NMN_9ALLWV1/XX allõv́:ki%{pb%}õ%{nń%}

LEXICON NMN_9ALLAS1/SINNEL1 allaś:allas

LEXICON NMN_9TUKWV/VERREV tukõv:tu%{kg%}õv

LEXICON NMN_9SALLAI/ELLAEI elläi:e%{lØ%}lä%{ij%}

SHOULD THIS BE HERE, c.f. yaml

LEXICON NMN_9TAHHE/HERRE tahhe:ta%{hØ%}he

LEXICON NMN_9VILLW/XX villõ:vi%{lØ%}lõ

Noun (10) perit

vowel_harmony: ONLY FRONT N-lembit10

N-hwrak10

LEXICON NMN_11AINWQ/KELMEQ ainõq:ainõ

A-vaih11

LEXICON NMN_11ANNWQ/LAETEQ läteq:lä%{tØ%}te

A-ainwq11

N-repaenj12

N-suekues12

N-suekues12

LEXICON NMN_13ALONW/TAEHINE alonõ:alo

A-alonw

LEXICON NMN_13VAHTSWNW vahtsõnõ:vah

A-vahtswnw

LEXICON NMN_13XX/SAEAENE sääne:sää A-alonw

Distinguished from 14RITS1KAS due to gradation Yaml: N-hammas_gt-norm.yaml

Distinguished from 14RITS1KAS due to word final h vowel_harmony_variant: hamõh Yaml: N-pereh_gt-norm.yaml

LEXICON NMN_16ABILINW/INEMINE inemine:inemi abilinõ:abili

LEXICON NMN_16TSIRGUKWNW/TERAEKENE inemine:inemi tsirgukõnõ:abili

LEXICON NMN_19ALOMANW/PEDAEJAENE alomanõ:aloma

LEXICON NMN_22NAANW naanõ:naa

LEXICON NMN_22VWROKWNW/VAEHAEKENE vähäkene:vähäke

nimi:nim

**LEXICON NMN_46SWBWR ** sõbõr:sõbõr Oblique plural in o

kubõl:ku%{pb%}õl Oblique plural in õ

pini:pi%{nØ%}ni

pini:pi%{nØ%}ni

pung:pung

kuld:kul%{dl%}

kuld:kul%{dl%}

Derived from PUHM, Gradation=”yes”, stem=”+Sg+Nom” stem_vowel=”o”

LEXICON NMN_46HAIN jalg:jalg gradation: no

LEXICON NMN_46TARK jalg:jal%{gØ%} gradation: yes

SINGULAR GENITIVE STEMS

PLURAL ALLATIVE STEMS

TAGS THAT CAN BE FOLLOWED BY CLITICS “K”

PLURAL TAGS

SINGULAR TAGS

LEXICON Harm_Neutr_SG_INE_hn RARE

TAGS THAT CANNOT BE FOLLOWED BY CLITICS

CASES ONLY

TAGS THAT CAN BE FOLLOWED BY CLITICS

TAGS WITH NO ADDED MORPHOLOGY THAT CANNOT BE FOLLOWED BY CLITICS

digits


This (part of) documentation was generated from src/fst/morphology/affixes/nouns.lexc


src-fst-morphology-affixes-numerals.lexc.md

Noun inflection for Võro

kipõnʼ:kipõn

allaś:allas

veteĺ:vetel

tukõv:tukõv

elläi:elläi

verrev:verrev

gradation: no

gradation: yes distinguished from 14RITS1KAS due to gradation

distinguished from 14RITS1KAS due to word final h

distinguished from 14RITS1KAS due to word final h

kotus:kotus

inemine:inemi

abilinõ:abili

LEXICON NUM_22VWROKWNW võrokõnõ:võrokõ

LEXICON NUM_22NAANW naanõ:naa

Gradation: No

vowel_harmony: front

Gradation: No

tarõ:tar

pesä:tar

nimi:nim

kokr:ko%{kg%}r

sõbõr:sõbõr

LEXICON NUM_43KANARIK

LEXICON NUM_44SWDA sõda:sõda

vro-digits


This (part of) documentation was generated from src/fst/morphology/affixes/numerals.lexc


src-fst-morphology-affixes-postpositions.lexc.md

Postpositions The Võro language postpositions …

POSTPOSITIONS WITH READY CASE ENDINGS


This (part of) documentation was generated from src/fst/morphology/affixes/postpositions.lexc


src-fst-morphology-affixes-pronouns.lexc.md

Pronoun inflection The Võro language pronouns inflect in the same cases as regular nouns, but with a colon (‘:’) as separator.

PERSONAL PRONOUN

CHECKME vowel harmony

LEXICON PERS_PL1 maq:m

LEXICON PERS_PL2 saq:

LEXICON PERS_PL3 timä:

DEMONSTRATIVE PRONOUNS

INDEFINITE PRONOUNS

INTERROGATIVE PRONOUNS


This (part of) documentation was generated from src/fst/morphology/affixes/pronouns.lexc


src-fst-morphology-affixes-propernouns.lexc.md

Proper noun inflection The Võro language proper nouns inflect in the same cases as regular nouns, but with a colon (‘:’) as separator.

LEXICON PROP_1HANS1A 1 hanśa:hanśa

LEXICON PROP_1VIU 1 viu:viu

LEXICON PROP_1HERRAE 1 herrä:herrä

LEXICON PROP_3ALADU aladu:aladu

LEXICON PROP_VERE Rakvere:Rakv

harmony: front

kipõń:kipõń

sallai:sallai

elläi:elläi

tukõv:tukõv

LEXICON PROP_10AMEERIGA Ameeriga:Ameerik cf. _10HWRAK

LEXICON PROP_10ESAEK esäk:esäk

LEXICON PROP_10LEMBIT Lembit:Lembi%{td%}

LEXICON PROP_10VIDRIK vidrik:vidrik gradation: no

Gradation: No

Gradation: No

Gradation: No

LEXICON PROP_14HAMMAS hammas:hamba, saabas:saapa gradation: yes distinguished from 14RITS1KAS due to gradation

distinguished from 14RITS1KAS due to word final h

distinguished from 14RITS1KAS due to word final h

kotus:kotus

kotus:kotus

kotus:kotus

Gradation: No

LEXICON PROP_16ABILINW abilinõ:abili

Gradation: No

Gradation: No

Gradation: No

gradation: yes vowel_harmony: front

gradation: yes vowel_harmony: front

gradation: yes

Gradation: No

gradation: yes

gradation: yes

tarõ:tar

nimi:nim

pesä:pesä

pesä:pesä

LEXICON PROP_36TUUM1 tuuḿ:t%{ou%}%{ou%}m :%{back%} NMN_36TUUM1/XX1-SG_OBL ; This allows for place names, which, for the most part, have nominative singulars that are identical to their genitive singulars.

LEXICON PROP_36SAERG1 särǵ:särgʼ

LEXICON PROP_36PAEIV päiv:päiv

kogõr:kogõr

LEXICON PROP_37PINI pini:pini

LEXICON PROP_37WLI pini:pini

LEXICON PROP_40TALO talo:talo

LEXICON PROP_40UJA uja:uja

LEXICON PROP_41ASK ask:asko

LEXICON PROP_44SWDA sõda:sõda

LEXICON PROP_46HAIN hain:hain


This (part of) documentation was generated from src/fst/morphology/affixes/propernouns.lexc


src-fst-morphology-affixes-quantifiers.lexc.md

Quantifier inflection The Võro language quantifiers inflect in cases.


This (part of) documentation was generated from src/fst/morphology/affixes/quantifiers.lexc


src-fst-morphology-affixes-symbols.lexc.md

Symbol affixes


This (part of) documentation was generated from src/fst/morphology/affixes/symbols.lexc


src-fst-morphology-affixes-verbs.lexc.md

Verb inflection Võro language verbs inflect for person and number.

There are other verbs here, cf. V_ELAEMAE

There are other verbs here, cf. V_ELAEMAE

There are other verbs here, cf. V_ELAEMAE

There are other verbs here, cf. V_ELAEMAE

There are other verbs here, cf. V_ELAEMAE

Pss_PrfPrc: sadat

taplõma:tapõl

võitlõma:võitõl

kullõma+V:ku%{lØ%}l%{õØ%}%{lĺ%}

+Pss+Ind+Prs+Sg1, +Pss+Ind+Prt+Sg1 +Pss+PrsPrc, +Pss+PrfPrc

+Act+Ind+Prs+Sg1, +Act+Ind+ConNegII, +Act+Imprt+Sg2 +Act+Ind+Prs+Neg, +Act+Ind+Prt+Neg, +Act+Ind+ConNegI

+Act+Ind+Prs+Sg3, +Act+Ind+Prs+Pl3

+Act+Ind+Prs+Sg2, +Err/Dial+Act+Ind+Prs+Sg2, +Act+Ind+Prs+Pl1, +Act+Ind+Prs+Pl2

+Act+Ind+Prt+Sg3

argnõma:arg

+Pss+Ind+Prt +Sg1-+Pl3, ConNeg

THIS FAR 2016-08-27

Act_Ind_Prs_Pl3: essüseq

V_Inf/mA: miildümä

Pss+PrfPrc, Pss+PrsPrc

Retain consonant and stem vowel

Weaken consonant and semi-retension of stem vowel

Act+Ind+Prs+Sg1/Sg2/Pl1/Pl2, Ind+ConNegII, Ind+Prs+ConNeg Pss+Ind

Retain consonant and stem vowel

Weaken consonant and replace stem vowel with i

Retain consonant remove stem vowel and add i

+Jus

Pss+PrfPrc, Pss+PrsPrc

Retain consonant and stem vowel

Weaken consonant and semi-retension of stem vowel

Act+Ind+Prs+Sg1/Sg2/Pl1/Pl2, Ind+ConNegII, Ind+Prs+ConNeg Pss+Ind

Retain consonant and stem vowel

Weaken consonant and replace stem vowel with i

Retain consonant remove stem vowel and add i

+Jus

Retain consonant and stem vowel

Pss+PrfPrc, Pss+PrsPrc

Weaken consonant and semi-retension of stem vowel

Weaken consonant and semi-retension of stem vowel

Act+Ind+Prs+Sg1/Sg2/Pl1/Pl2, Ind+ConNegII, Ind+Prs+ConNeg Pss+Ind

Retain consonant and stem vowel

Weaken consonant and replace stem vowel with i

Retain consonant remove stem vowel and add i

Remainder is in exceptions.lexc minemä to go/ mennä

Retain consonant and stem vowel

Retain consonant and stem vowel

Strengthen consonant

Retain consonant and stem vowel

Retain consonant and add õ

Retain consonant and stem vowel

Strengthen consonant and replace stem vowel with i

consonant and add i

Retain consonant and stem vowel

Strengthen consonant

Retain consonant and stem vowel

Retain consonant

Retain consonant and add õ

Act+Ind+Prs+Sg1/Sg2/Pl1/Pl2, Ind+ConNegII, Ind+Prs+ConNeg Pss+Ind

Retain consonant and stem vowel

Strengthen consonant and replace stem vowel with i

Strengthen consonant and add ʼ

tegemä to do/ tehdä

nägemä to see/nähdä

IS THIS RIGHT? 2015-09-02

sõida

IS THIS RIGHT? 2015-09-02

sõida

HERE is the distinction 2016-10-04

IS THIS RIGHT? 2015-09-02

IS THIS RIGHT? 2015-09-02

IS THIS RIGHT? 2015-09-02

sõida

sõida

SETS BY CONSONANT QUALITY

INDICATIVE PRESENT ACTIVE CONJUGATION

JUS

CHECK THIS

PASSIVE INDICATIVE PRESENT CONJUGATION

INDICATIVE PRETERIT SUBJECT CONJUGATION

PASSIVE INDICATIVE PRETERIT CONJUGATION

NON-FINITES

PASSIVE DISTRIBUTION


This (part of) documentation was generated from src/fst/morphology/affixes/verbs.lexc


src-fst-morphology-clitics.lexc.md

Clitics in Võro


This (part of) documentation was generated from src/fst/morphology/clitics.lexc


src-fst-morphology-phonology.twolc.md

The Võro morphophonological/twolc rules file

This file documents the phonology.twolc file

Special letters

Vowel harmony with “(t)a/ä”

 %{aä%}:0    — Vowel harmony with "(t)a/ä" AÄ1:a AÄ1:ä AÄ1:0
 %{ae%}:a   — Vowel harmony with "a/e/õ" passive tahetu
 %{aõ%}:a   — Vowel harmony with "a/e/õ" passive sõidõtu
 %{äe%}:ä    — Vowel harmony with "ä/e/õ" passive
 %{eõ%}:0    — Vowel harmony with "e/õ"
 %{uü%}:0    — Vowel harmony with "u/ü"
 %{öü%}:ö    — Vowel raising
 %{ou%}:o    — Vowel raising
 %{ei%}:e    — Vowel raising
 %{õy%}:õ    — Vowel raising
 %{ao%}:a    — Vowel raising

 %{eØ%}:e    — ütlemä:üt%{eØ%}l  
 %{õØ%}:õ    — ütlemä:üt%{eØ%}l  
 %{Øõ%}:0    — juurdlõma:juur%{dØ%}%{0õ%}l

 %{dØ%}:d    — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{dv%}:d    — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{dn%}:d    — HJK and KimmoK ideas lammas:lam%{bm%}a%{sØ%}
 %{dl%}:d    — HJK and KimmoK ideas lammas:lam%{bm%}a%{sØ%}

 %{ij%}:i    ellä%{ij%}
 %{gv%}:g    — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{gl%}:g    — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{gØ%}:g    — HJK and KimmoK ideas argnõma:ar%{gØ%}
 %{uv%}:u    — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{üv%}:ü    — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{hØ%}:h    — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{jØ%}:j    — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{kØ%}:k    — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{lØ%}:l    — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{mØ%}:m    — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{nØ%}:n    — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{pØ%}:p    — HJK and KimmoK ideas oppama:o%{pØ%}pama
 %{rØ%}:r    — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{sØ%}:s    — HJK and KimmoK ideas närväs:när%{bv%}ä%{sØ%}
 %{vØ%}:v    — HJK and KimmoK ideas kana:ka%{nØ%}na

 %{pØ%}:0    — häbü:häbü+N:hä%{pØ%}%{pbØ%}ü
 %{tØ%}:0    — koda:ko%{tØ%}%{tdØ%}a
 %{kØ%}:0    — nägo:nä%{kØ%}%{kgØ%}o

 %{bv%}:b    — HJK and KimmoK ideas närväs:när%{bv%}ä%{sØ%}
 %{dr%}:d    — HJK and KimmoK ideas parras:par%{dr%}a%{sØ%}
 %{bm%}:b    — HJK and KimmoK ideas lammas:lam%{bm%}a%{sØ%}
 %{pb%}:p    — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{pb%}:b    — HJK and KimmoK ideas kana:ka%{nØ%}na

 %{tØ%}:t    — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{t́Ø%}:t    — HJK and KimmoK ideas jaht́lõma:jah%{t́Ø%}%{eØ%}%{lĺ%}
 %{td%}:t    — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{t́d́%}:t́    — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{kg%}:k    — HJK and KimmoK ideas kaigas:kai%{kg%}as

 %{pbØ%}:p   — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{pbØ%}:b   — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{pbØ%}:0   — HJK and KimmoK ideas kana:ka%{nØ%}na

 %{pbv%}:p   %{pbv%}:b   %{pbv%}:v   — tõbi: tõvõ tõpõ tõppõ

 %{tdØ%}:d   — HJK and KimmoK ideas kana:ka%{nØ%}na

 %{kgØ%}:k   — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{kgØ%}:g   — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{kgØ%}:0   — HJK and KimmoK ideas kana:ka%{nØ%}na

 %{jiØ%}:i   — HJK and KimmoK ideas vari:var%{jiØ%}o
 %{qmn%}:q   — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{qn%}:q    — HJK and KimmoK ideas kana:ka%{nØ%}na

 %{dd́Ø%}:d   
 %{dd́n%}:d   
 %{dd́r%}:d   
 %{dd́v%}:d   
 %{dd́Ø%}:d   
 %{gǵv%}:g   
 %{gǵØ%}:g   
 %{kḱg%}:k    %{kḱg%}:ḱ    %{kḱg%}:g   
 %{kḱØ%}:k   
 %{pṕb%}:p   %{pṕb%}:ṕ    %{pṕb%}:b   
 %{tt́d%}:t    %{tt́d%}:t́    %{tt́d%}:d   
 %{tt́Ø%}:t    täh%{tt́Ø%}
 %{pṕØ%}:p   

Palatalization of consonants

 %{bb́%}:b    — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{cć%}:c    — HJK and KimmoK ideas Isaać:Isaa%{cć%}:ci
 %{dd́%}:d    — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{ff́%}:f    — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{gǵ%}:g    — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{hh́%}:h    — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{kḱ%}:k    — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{lĺ%}:l     — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{lĺ%}:ĺ     — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{mḿ%}:m    — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{nń%}:n    — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{pṕ%}:p    — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{rŕ%}:r    — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{sś%}:s    — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{sś%}:ś    — HJK and KimmoK ideas vaśma:va%{sØ%}%{sś%}
 %{tt́%}:t    — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{vv́%}:v    — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{ḱǵj%}:ḱ   — HJK and KimmoK ideas laǵa:la%{ḱǵj%}a
 %{zź%}:z    — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{dd́n%}:d 

Miscellaneous other symbols

 %{XV%}:0    — This is used for echoing the previous vowel
 %{XC%}:0    — This is used for lengthening a consonant
 %^I7:0      — This appears in stem vaoma:va%^I7o for vaio
 %^K7:0      — This appears in stem väemä:vä%^K7e for väkeq
 %^V7:0      — This appears in stem häömä:hä%^V7ö for hävvü
 %^T7:0      — This appears in stem kaoma:ka%^T7o for katoq
 %^Y7:õ      — This appears for syna = s%^Y7na and is rendered as õ in the norm

Triggers

   %^OO2Õ:0    — joo%^OO2Õ%>i:j0õ0%>i
   %^CC2C:0    — att%^CC2C%>m%{aä%} atma
 %^PSS:0       vowel in passive tahetu, sõidõtu, eletü
 %^ÄI2ÄÄ:0    — päiv%^ÄI2ÄÄ%>ä: päävä
 %{front%}:0    — front harmony
 %{back%}:0    — back harmony
%^ErrorBack:0  — +Err/Orth+Clt:%>kinaq in front harmony context BHARM disallowance
 %{PrsSg1%}:0  — this helps with %{eõ%}:i̬

 %{td%}:t 	 HJK and KimmoK ideas kana:ka%{nØ%}na
 %{kg%}:k 	 HJK and KimmoK ideas kaigas:kai%{kg%}as

 %{qmn%}:q 	 HJK and KimmoK ideas kana:ka%{nØ%}na
 %{qn%}:q 	 HJK and KimmoK ideas kana:ka%{nØ%}na
 %{XV%}:0		 This is used for echoing the previous vowel
 %{XC%}:0	 This is used for lengthening a consonant
 %^I7:0          This appears in stem vaoma:va%^I7o for vaio
 %^K7:0           This appears in stem väemä:vä%^K7e for väkeq
 %^V7:0          This appears in stem häömä:hä%^V7ö for hävvü
 %^T7:0          This appears in stem kaoma:ka%^T7o for katoq

**%^Y7:õ  **  This appears for syna = s%^Y7na and is rendered as õ in the norm
%^NoGrad:0     — This will be placed after a stem to break Gradation
%^APOCH:0      — This causes apochope: puhksama vs puhastaq
%^StrD2T:0     — This changes g,d,b => k,t,p

%^G1:0	       — This is used with %{pØ%} %{pbØ%} for 0 0, also t, k
%^G2:0	       — This is used with %{pØ%} %{pbØ%} for 0 b, also t, k
%^G3:0	       — This is used with %{pØ%} %{pbØ%} for 0 p, also t, k
%^G4:0	       — This is used with %{pØ%} %{pbØ%} for p p, also t, k

%^WGStem:0     — This weakens "kipõń" to "kibõna", "ompel" to "ommel"
%^StrGStem:0   — This strengthens "perädü" to "perätüt"
%^ShortGStem:0   — This shortens "pu%{tØ%}tu" to "putma", an orthographic convension
%^LongGStem:0     — This lengthens "pu%{tØ%}tu" to "puttuq"

%^Pen:0        — This moves us to penultimate coda
%^PAL:0	       — Palatalization
%^NoPAL:0	       — NoPalatalization

%^JI20:0	       — in vari: vaŕo
%^JI2I:0	       — in vari vari
%^JI2J:0	       — in vari: varjo

%^PenWGStem:0  — This weakens "kipõń" to "kibõna"
%^PenVowRM:0   — syncope tapõld : taplõma 
%^D2S:0        — The ti => si
%^TS2S:0       — The -ts- => -s-
%^I2J:0        — The i => j change
%^PLPRT:0      — The a:o attested in Plural kana:kanno and prt
%^VOWRaise:0   — Raises vowel
%^VOWLower:0   — Lowers vowel
%^XLowerVow:0  — Lowers vowel two levels
%^VOWLowerDelab:0   — Lowers vowel and delabializes it
%^XLowerVowDelab:0  — Lowers vowel two levels and delabializes it
%^U2E:0        — lowers u:õ and ü:e delabializes and lowers
%^U2A:0        — lowers u:a and ü:ä delabializes and lowers
%^VowRM:0      — this will remove stem final vowel
%^CnsRM:0      — this will remove stem final consonant tervüs:tervü

Onset consonant or word boundary

Right context for gradation

Rules

VOWEL HARMONY

Vowel harmony suffixes Front

%{aä%}:a

%{aä%}:ä

%{uü%}:u

%{uü%}:ü

%{eõ%}:õ

%{eõ%}:e

%{ae%}:e tahtma+V+Pss+PrfPrc+Sg+Nom: want/haluta

%{aõ%}:õ

%{äe%}:e

VOWEL LOWERING

u:o

ü:ö

o2õ

u2õ

ö2e

Delabializing o and ö

VOWEL RAISING

Delabializing o and ö

PALATALIZATION

n2ń palatalization all kestmä+V+Act+Ind+Prt+Sg3:

akaŕ+A+Sg+Nom

asi+N+Sg+Gen:

alostama+V+Act+Ind+Prt+Sg3:

%{kḱ%}:ḱ kakma

n2n no palatalization all

rehksämä+V+Inf/mA:

{dd́n}:d́ palatalization for 3-way

särǵ+N+Sg+Nom: roach/särki

{dd́n}:n weaken 3-way

andma+V+Act+Ind+Prs+Sg1

püüdmä+V+Act+Ind+Prs+Sg1

%{dd́v%}:v

%{pṕb%}:p loroṕ+N+Sg+Par:

%{tt́d%}:t

hainatama+V+Inf/mA

%{kḱg%}:k

%{pṕb%}:ṕ loroṕ

%{tt́d%}:t́

%{kḱg%}:ḱ

kõiḱ+Pron+Sg+Nom

VOWEL CHANGE WITH PLURAL

tegemä+V+Act+Ind+Prs+Sg1: do

õ2õ̭

o2u̬

Vx%{ou%}:Vyo

hoolas+A+Sg+Nom:

Vx%{ou%}2Vyu̬ nuuĺ+N+Sg+Nom: arrow

kiiĺ+N+Sg+Gen: tongue/kieli

i2e pini+N+Pl+Par: dog/koira

i:ä päiv+N+Sg+Gen: day/päivä

a2o

* *ka%{nØ%}na%{back%}%^Pen%^StrGStem%^PLPRT*
* *kanno0000*

{ao}o

* *ka%{nØ%}n%{ao%}%{back%}%^G3%^PLPRT*
* *kanno000*

VOWEL LOSS

a:0 a _ (HarmDummiesVar) %> i ;

sõda+N+Pl+Par:

ä:0 pügämä+V+Pss+PrfPrc:

U:0 Vx

* *hirnu{back}^Pen^CC2C^VowRM>m{aä}*
* *hirn00000>ma*
* *kut{sś}u{back}^Pen^VOWRaise^Pen^PAL^VowRM*
* *kutś0000000*
* *tervüs{front}^VowRM^CnsRM>i>t*
* *terv00000>i>t*
juusk+N+Sg+Nom: ____
* *j{ou}{ou}s{kØ}u{back}^VOWRaise^VowRM*
* *ju̬u̬sk0000*

* *kuu{back}^VOWLower^VowRM>i>d*
* *ku0000>i>d*

[ Cns: |ArchCns:| Vow: ] _ (s:) (HarmDummiesVar) (%^Pen: %^CC2C:|%^Pen: %^G3:|%^Pen: %^G4:|PenVOWHite %^Pen: %^G1:) %^VowRM: ;

e:0

o:0 juuma+V+Inf

Vx%{ou%}:0 juuma+V+Inf

Vx%{äe%}:0 Passive stem vowel nõstma+V+Inf/mA

ö:0

i:0 hüdsi+N+Sg+Par:

õ:0

%{eØ%}: 0

%{õØ%}: 0

VOWEL LENGTHENING

%{XV%}:u

%{XV%}:ü

%{XV%}:o

%{XV%}:a

%{XV%}:ä

%{XV%}:õ kannõĺ+N+Sg+Gen: kantele

%{XV%}:i

i2j

%{ij%}:j

%{jiØ%}:j

%{jiØ%}:i

%{jiØ%}:0 vari+N+Sg+Gen: shadow/varjo

%{jØ%}:0 vari+N+Sg+Gen: shadow/varjo

u2v depricate to “%{uv%}:v”

%{uv%}:v

{üv}:v

%^I7:i

%^I7:i

CONSONANT %{pṕØ%}:ṕ

**%{tt́Ø%}:t́ **

**%{tt́Ø%}:t **

täht́+N+Err/Orth-no-pal+Sg+Nom: star/tähti

%{kḱØ%}:ḱ

SECONDARY CONSONANT LENGTHENING

%{pØ%}:p

* *hä%{pØ%}%{pbØ%}ü%{front%}%^Pen%^G4*
* *häppü000*
* *tõ%{pØ%}%{pbv%}%{back%}%^G4%>%{eõ%}*
* *tõpp00%>õ*
* *se%{pØ%}p%{front%}%^StrGStem*
* *sepp00*
* *nu%{pØ%}pu%{back%}%^Pen%^VOWRaise%^Pen%^StrGStem%^VowRM*
* *nupp0000000*

{tØ}:t

%{t́Ø%}:t́

%{Øk%}:k igä+N+Sg+Ill

%{XC%}:s

%{XC%}:l

%{XC%}:ĺ

%{XC%}:k

%{cć%}:ć

%{cć%}:c

Consonant weakening

kToZero

%{pṕØ%}:0

%{tt́Ø%}:0

%{kḱØ%}:0

%{sØ%}:0

%{vØ%}:0
kruv́ma+V+Inf/mA

%{rØ%}:0

%{nØ%}:0

%{lØ%}:0

%{mØ%}:0

%{kØ%}:0

nätsk+A+Sg+Gen

kakma:

kõiḱ+Pron+Sg+Nom

pToZero

%{pØ%}:0

XØToZero agras+A+Sg+Gen

XØToSelf villui+A+Sg+Nom

kevväi+N+Sg+Gen: spring

%{sØ%}:s ratas+N+Sg+Nom

%{hØ%}:h hamõh+N+Sg+Nom

%{kØ%}:k rehksämä+V+Inf/mA:

%{pb%}:p

%{t́d́%}:d́

%{t́d́%}:t́

%{td%}:t

%{kg%}:k akaŕ+A+Sg+Nom

%{kg%}:g apteḱ+N+Sg+Gen:

nõkõś+N+Sg+Ill

%{td%}:d

kaotama+V+Act+Ind+Prs+Sg1:

%{tt́d%}:d kergütämä+V+Act+Ind+Prs+Sg1:

tToZero hüdsi+N+Sg+Par:

%{tØ%}:0

sõda+N+Sg+Gen:

%{t́Ø%}:0

CONSONANT QUALITY CHANGE

%{pṕb%}:b

%{pb%}:b habras+A+Sg+Nom

p2b

b20

%{pbØ%}:b

%{dr%}:r murrõq+N+Sg+Nom

%{dr%}:d murrõq+N+Sg+Gen

%{ḱǵj%}:ǵ

%{ḱǵj%}:ḱ

%{ḱǵj%}:0

%{tdØ%}:d

%{dØ%}:d väärdlemä+V+Inf/mA

kaardas+N+Sg+Nom

%{kgØ%}:g jõgi+N+Sg+Nom: river / joki

%{pbv%}:b

hammas

%{bm%}:m

%{bm%}:b

%{bv%}:v

%{dn%}:n kannõĺ+N+Sg+Nom: kantele

%{dl%}:l

%{dv%}:v

VdVToVtV

dTos

tTos

tTod kaotama+V+Act+Ind+Prs+Sg1:

There should always be a trigger

** %{dn%}:d**

j2i

**{kḱg}:g **

kõiḱ+Pron+Sg+Gen

k2g

igä+N+Sg+Ill

bTop

%{pbv%}:p

%{pbØ%}:p

%{tdØ%}:t

%{kgØ%}:k

STEM-FINAL CONSONANT LOSS

s20 kirotus+N+Pl+Gen:

usś+N+Sg+Par door

vaśma+V+Inf/mA

%{bv%}:b närväs+A+Sg+Gen:

%{gØ%}:g liig+A+Sg+Nom:

d20

%{dØ%}:0

g20 deprication to {gǵØ}:0

%{gØ%}:0

{gǵØ}:0 särǵ+N+Sg+Gen: roach/särki

{gǵØ}:g särǵ+N+Sg+Ill: roach/särki

%{pbv%}:v

%{pbØ%}:0

%{tdØ%}:0

%{kgØ%}:0

püüdmä+V+Act+Ind+Prs+Sg3

pereq

naŕma

Other marks

Disallow %^ErrorBack:0 in BHARM

Disallow %^ErrorBack:0 in BHARM


This (part of) documentation was generated from src/fst/morphology/phonology.twolc


src-fst-morphology-root.lexc.md

Võru tags and basic lexica

Definitions for Multichar_Symbols

Analysis symbols

The morphological analyses of wordforms for the Võro language are presented in this system in terms of the following symbols. (It is highly suggested to follow existing standards when adding new tags).

The parts-of-speech are:

The parts of speech are further split up into:

The Usage extents are marked using following tags:

The nominals are inflected in the following Case and Number

The possession is marked as such: There are no possessive markers

The comparative forms are:

Verb personal forms are:

Subject conjugation

Passive conjugation

Special symbols are classified with:

Question and Focus particles:

Tags distinguishing different versions of the same lemma (before POS)

Derivations are classified under the morphophonetic form of the suffix, the source and target part-of-speech.

Morphophonology

To represent phonologic variations in word forms we use the following symbols in the lexicon files:

 %{aä%}    — Vowel harmony with "(t)a/ä" AÄ1:a AÄ1:ä AÄ1:0
 %{ae%}   — Vowel harmony with "a/e/õ" passive tahetu
 %{aõ%}   — Vowel harmony with "a/e/õ" passive sõidõtu
 %{äe%}    — Vowel harmony with "ä/e/õ" passive
 %{eõ%}    — Vowel harmony with "e/õ"
 %{uü%}    — Vowel harmony with "u/ü"
 %{öü%}    — Vowel raising
 %{ou%}    — Vowel raising
 %{ei%}    — Vowel raising
 %{õy%}    — Vowel raising
 %{ao%}    — Vowel raising
 %{eØ%}    — ütlemä:üt%{eØ%}l  
 %{õØ%}    — ütlemä:üt%{eØ%}l  
 %{Øõ%}    — juurdlõma:juur%{dØ%}%{0õ%}l
 %{XV%}    — This is used for echoing the previous vowel
 %{XC%}    — This is used for lengthening a consonant
 %{dØ%}    — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{tØ%}    — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{t́Ø%}    — HJK and KimmoK ideas jaht́lõma:jah%{t́Ø%}%{eØ%}%{lĺ%}
 %{dv%}    — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{ij%}    ellä%{ij%}
 %{gv%}    — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{gl%}    — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{gØ%}    — HJK and KimmoK ideas argnõma:ar%{gØ%}
 %{uv%}    — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{üv%}    — HJK and KimmoK ideas kana:ka%{nØ%}na

Gemination

 %{hØ%}    — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{jØ%}    — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{kØ%}    — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{lØ%}    — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{mØ%}    — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{nØ%}    — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{pØ%}    — HJK and KimmoK ideas oppama:o%{pØ%}pama
 %{rØ%}    — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{sØ%}    — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{vØ%}    — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{Øp%}    — häbü:hä%{Øp%}%{pbØ%}ü
 %{Øt%}    — koda:ko%{Øt%}%{tdØ%}a
 %{Øk%}    — nägo:nä%{Øk%}%{kgØ%}o

Strong and weak

 %{pb%}    — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{td%}    — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{t́d́%}    — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{kg%}    — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{bv%}    — HJK and KimmoK ideas närväs:när%{bv%}ä%{sØ%}
 %{dr%}    — HJK and KimmoK ideas parras:par%{dr%}a%{sØ%}
 %{bm%}    — HJK and KimmoK ideas lammas:lam%{bm%}a%{sØ%}
 %{dn%}    — HJK and KimmoK ideas lammas:lam%{bm%}a%{sØ%}
 %{dl%}    — HJK and KimmoK ideas lammas:lam%{bm%}a%{sØ%}
 %{pbØ%}   — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{pbv%}   — tõbi: tõvõ tõpõ tõppõ
 %{tdØ%}   — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{kgØ%}   — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{jiØ%}   — HJK and KimmoK ideas vari:var%{jiØ%}o
 %{qmn%}   — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{qn%}    — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{dd́Ø%}   
 %{dd́n%}   
 %{dd́r%}   
 %{dd́v%}   
 %{dd́Ø%}   
 %{gǵv%}   
 %{gǵØ%}   
 %{tt́d%}   
 %{tt́Ø%}    täh%{tt́Ø%}
 %{kḱg%}   
 %{kḱØ%}   
 %{pṕb%}   
 %{pṕØ%}   

 %{dśtv%}    tä%{üv%}%{śtv%}
 %{djśt%}    vii%{jśt%}
 %{drśt%}    var%{rśt%}

Palatalization

 %{bb́%}    — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{dd́%}    — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{ff́%}    — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{gǵ%}    — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{hh́%}    — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{kḱ%}    — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{lĺ%}    — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{mḿ%}    — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{nń%}    — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{pṕ%}    — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{rŕ%}    — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{sś%}    — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{tt́%}    — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{vv́%}    — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{ḱǵj%}   — HJK and KimmoK ideas laǵa:la%{ḱǵj%}a
 %{zź%}    — HJK and KimmoK ideas kana:ka%{nØ%}na
 %{dd́n%}  

** %^I7 ** This appears in stem vaoma:va%^I7o for vaio ** %^K7 ** This appears in stem väemä:vä%^K7e for väkeq ** %^V7 ** This appears in stem häömä:hä%^V7ö for hävvü ** %^T7 ** This appears in stem kaoma:ka%^T7o for katoq ** %^Y7 ** This appears for syna = s%^Y7na and is rendered as õ in the norm

And following triggers to control variation %^ErrorBack +Err/Orth+Clt:%>kinaq in front harmony context BHARM disallowance

%^CC2C att%^CC2C%>m%{aä%} atma %^OO2Õ joo%^OO2Õ%>i:j0õ0%>i %^PSS vowel in passive tahetu, sõidõtu, eletü

%^ÄI2ÄÄ päiv%^ÄI2ÄÄ%>ä: päävä %{PrsSg1%} — this helps with %{eõ%}:i̬

%^StrD2T This changes g,d,b, => k,t,p

** %^VowRM ** this will remove stem final vowel ** %^CnsRM ** this will remove stem final consonant tervüs:tervü ** %^StrGStem ** This strengthens “perädü” to “perätüt” %^NoGrad ** %^WGStem ** This weakens %^G1 — This is used with %{pØ%} %{pbØ%} for 0 0, also t, k %^G2 — This is used with %{pØ%} %{pbØ%} for 0 b, also t, k %^G3 — This is used with %{pØ%} %{pbØ%} for 0 p, also t, k %^G4 — This is used with %{pØ%} %{pbØ%} for p p, also t, k “sõda” to “sõtta” %^ShortGStem — This shortens “pu%{tØ%}tu” to “putma”, an orthographic convension %^LongGStem — This lengthens “pu%{tØ%}tu” to “puttuq” %^Pen This moves us to penultimate coda %^PAL — Palatalization %^NoPAL — NoPalatalization %^JI20 — in vari: vaŕo %^JI2I — in vari vari %^JI2J — in vari: varjo

%^PenWGStem This weakens “kipõń” to “kibõna”

** %^PenVowRM ** syncope tapõld : taplõma

**%^D2S ** käsi, susi %^TS2S The -ts- => -s- %^I2J The i => j change

** %^PLPRT ** The a:o attested in Plural kana:kanno and prt **%^VOWRaise ** Raises vowel **%^VOWLower ** Lowers vowel **%^XLowerVow ** Lowers vowel two levels **%^VOWLowerDelab ** Lowers vowel and delabializes it **%^XLowerVowDelab ** Lowers vowel two levels and delabializes it %^U2E lowers u:õ and ü:e delabializes and lowers %^U2A lowers u:a and ü:ä delabializes and lowers

= a symbol used in front of # to block backtracking and mwe reanalysis in hfst-tokenise (e.g. in dynanic compounds). Makes it possible to distinguish lexical and dynamic compounds in rules. It is converted to zero together with #.

Flag Explanation
@D.ErrOrth.ON@  
@C.ErrOrth@  
@P.ErrOrth.ON@  
@R.ErrOrth.ON@  

Oahpa Place names and case used

The tagged part of the compound should make a compound using:

Flag diacritics

We have manually optimised the structure of our lexicon using the following flag diacritics to restrict morhpological combinatorics - only allow compounds with verbs if the verb is further derived into a noun again: | @P.NeedNoun.ON@ | (Dis)allow compounds with verbs unless nominalised | @D.NeedNoun.ON@ | (Dis)allow compounds with verbs unless nominalised | @C.NeedNoun@ | (Dis)allow compounds with verbs unless nominalised

For languages that allow compounding, the following flag diacritics are needed to control position-based compounding restrictions for nominals. Their use is handled automatically if combined with +CmpN/xxx tags. If not used, they will do no harm. | @P.CmpFrst.FALSE@ | Require that words tagged as such only appear first | @D.CmpPref.TRUE@ | Block such words from entering ENDLEX | @P.CmpPref.FALSE@ | Block these words from making further compounds | @D.CmpLast.TRUE@ | Block such words from entering R | @D.CmpNone.TRUE@ | Combines with the next tag to prohibit compounding | @U.CmpNone.FALSE@ | Combines with the prev tag to prohibit compounding | @P.CmpOnly.TRUE@ | Sets a flag to indicate that the word has passed R | @D.CmpOnly.FALSE@ | Disallow words coming directly from root.

Use the following flag diacritics to control downcasing of derived proper nouns (e.g. Finnish Pariisi -> pariisilainen). See e.g. North Sámi for how to use these flags. There exists a ready-made regex that will do the actual down-casing given the proper use of these flags.

@U.Cap.Obl@ Allowing downcasing of derived names: deatnulasj.
@U.Cap.Opt@ Allowing downcasing of derived names: deatnulasj.
@U.Case.Abe@ Abessive
@U.Case.Abl@ Ablative
@U.Case.Ade@ Adessive
@U.Case.All@ Allative
@U.Case.Com@ Comitative
@U.Case.Ela@ Elative
@U.Case.Gen@ Genitive
@U.Case.Ill@ Illative
@U.Case.Ine@ Inessive
@U.Case.Nom@ Nominative
@U.Case.Par@ Partitive
@U.Case.Ter@ Terminative
@U.Case.Tra@ Translative
@U.Number.Pl@ Plural
@U.Number.Sg@ Singular

The following flag diacritics are being applied for vowel harmony variation | @U.VowHarm.B@ | Back harmony, used with subsequent Err/Orth-front | @U.VowHarm.F@ | Front harmony, used with subsequent Err/Orth-back

The Root lexicon

The word forms in the Võro language start from the lexeme roots of basic word classes, or optionally from prefixes:

Incoming

less complex word classes


This (part of) documentation was generated from src/fst/morphology/root.lexc


src-fst-morphology-stems-acronyms.lexc.md

Acronyms Veps acronyms …


This (part of) documentation was generated from src/fst/morphology/stems/acronyms.lexc


src-fst-morphology-stems-adjectives_newwords.lexc.md

This is where new words are added as lexc entries before they are added to the xml source files. kõhna+A:kõhna A_1HANS1A “” ;

ADD NOUNS BELOW

| —


This (part of) documentation was generated from src/fst/morphology/stems/adjectives_newwords.lexc


src-fst-morphology-stems-adpositions_newwords.lexc.md

This is where new words are added as lexc entries before they are added to the xml source files. perrä:perrä PO_ “(eng) /(est) /(fin) “ ;

ADD NOUNS BELOW


This (part of) documentation was generated from src/fst/morphology/stems/adpositions_newwords.lexc


src-fst-morphology-stems-adverbs_newwords.lexc.md

CHECKME


This (part of) documentation was generated from src/fst/morphology/stems/adverbs_newwords.lexc


src-fst-morphology-stems-determiners_newwords.lexc.md

This is where new words are added as lexc entries before they are added to the xml source files. perrä:perrä PO_ “(eng) /(est) /(fin) “ ;

ADD DETERMINERS BELOW


This (part of) documentation was generated from src/fst/morphology/stems/determiners_newwords.lexc


src-fst-morphology-stems-exceptions.lexc.md

ADVERBS

ADJECTIVES

CONJUNTIONS

GENITIVE ATTRIBUTES

NOUNS

PROPER NOUNS

PLURAL NOUNS

NUMERALS

POSTPOSITIONA

PRONOUNS

VERBS

andma to give/antaa

VERBS WITH FORMS TO STUDY

kündma to plow/kyntää

nakkama to begin/ alkaa

olõma to be/ olla

nakkama to start/ alkaa

pandma to put/panna

pidämä to keep/ pitää

tundma to feel/tuntea


This (part of) documentation was generated from src/fst/morphology/stems/exceptions.lexc


src-fst-morphology-stems-interjections_newwords.lexc.md

This is where new words are added as lexc entries before they are added to the xml source files.

ADD INTERJECTIONS BELOW


This (part of) documentation was generated from src/fst/morphology/stems/interjections_newwords.lexc


src-fst-morphology-stems-nouns.lexc.md

hanśa+N:hanśa N_1HANS1A “” ;


This (part of) documentation was generated from src/fst/morphology/stems/nouns.lexc


src-fst-morphology-stems-nouns_newwords.lexc.md

This is where new words are added as lexc entries before they are added to the xml source files. hanśa+N:hanśa N_1HANS1A “” ;

ADD NOUNS BELOW

N_HAIDAK, N_10ESAEK in -gu N_10AABITS in -dsa, -ga N_10HWRAK in -ga ~ -gu N_10HEERITS in -dsä N_10RAAMAT, N_LEMBIT in -du/dü

two-syllable

Three-syllable words


This (part of) documentation was generated from src/fst/morphology/stems/nouns_newwords.lexc


src-fst-morphology-stems-verbs.lexc.md

atma+V:atta, ikma+V:ikkõ petmä+V:pettä


This (part of) documentation was generated from src/fst/morphology/stems/verbs.lexc


src-fst-morphology-stems-verbs_newwords.lexc.md

This is where new words are added as lexc entries before they are added to the xml source files.

ADD VERBS BELOW

verb type split

atma+V:atta, ikma+V:ikkõ petmä+V:pettä


This (part of) documentation was generated from src/fst/morphology/stems/verbs_newwords.lexc


src-fst-phonetics-txt2ipa.xfscript.md

retroflex plosive, voiceless t ʈ 0288, 648 ( = ASCII 096) retroflex plosive, voiced d ɖ 0256, 598 labiodental nasal F ɱ 0271, 625 retroflex nasal n ɳ 0273, 627 palatal nasal J ɲ 0272, 626 velar nasal N ŋ 014B, 331 uvular nasal N\ ɴ 0274, 628

bilabial trill B\ ʙ 0299, 665 uvular trill R\ ʀ 0280, 640 alveolar tap 4 ɾ 027E, 638 retroflex flap r ɽ 027D, 637 bilabial fricative, voiceless p\ ɸ 0278, 632 bilabial fricative, voiced B β 03B2, 946 dental fricative, voiceless T θ 03B8, 952 dental fricative, voiced D ð 00F0, 240 postalveolar fricative, voiceless S ʃ 0283, 643 postalveolar fricative, voiced Z ʒ 0292, 658 retroflex fricative, voiceless s ʂ 0282, 642 retroflex fricative, voiced z` ʐ 0290, 656 palatal fricative, voiceless C ç 00E7, 231 palatal fricative, voiced j\ ʝ 029D, 669 velar fricative, voiced G ɣ 0263, 611 uvular fricative, voiceless X χ 03C7, 967 uvular fricative, voiced R ʁ 0281, 641 pharyngeal fricative, voiceless X\ ħ 0127, 295 pharyngeal fricative, voiced ?\ ʕ 0295, 661 glottal fricative, voiced h\ ɦ 0266, 614

alveolar lateral fricative, vl. K alveolar lateral fricative, vd. K\

labiodental approximant P (or v) alveolar approximant r\ retroflex approximant r` velar approximant M\

retroflex lateral approximant l` palatal lateral approximant L velar lateral approximant L
Clicks

bilabial O\ (O = capital letter) dental |
(post)alveolar !\ palatoalveolar =\ alveolar lateral ||
Ejectives, implosives

ejective > e.g. ejective p p> implosive < e.g. implosive b b< Vowels

close back unrounded M close central unrounded 1 close central rounded } lax i I lax y Y lax u U

close-mid front rounded 2 close-mid central unrounded @\ close-mid central rounded 8 close-mid back unrounded 7

schwa ə @

open-mid front unrounded E open-mid front rounded 9 open-mid central unrounded 3 open-mid central rounded 3\ open-mid back unrounded V open-mid back rounded O

ash (ae digraph) { open schwa (turned a) 6

open front rounded & open back unrounded A open back rounded Q Other symbols

voiceless labial-velar fricative W voiced labial-palatal approx. H voiceless epiglottal fricative H\ voiced epiglottal fricative <\ epiglottal plosive >\

alveolo-palatal fricative, vl. s\ alveolo-palatal fricative, voiced z\ alveolar lateral flap l\ simultaneous S and x x\ tie bar _ Suprasegmentals

primary stress “ secondary stress % long : half-long :\ extra-short _X linking mark -
Tones and word accents

level extra high _T level high _H level mid _M level low _L level extra low _B downstep ! upstep ^ (caret, circumflex)

contour, rising contour, falling _F contour, high rising _H_T contour, low rising _B_L

contour, rising-falling _R_F (NB Instead of being written as diacritics with _, all prosodic marks can alternatively be placed in a separate tier, set off by < >, as recommended for the next two symbols.) global rise global fall Diacritics

voiceless 0 (0 = figure), e.g. n_0 voiced _v aspirated _h more rounded _O (O = letter) less rounded _c advanced _+ retracted _- centralized _” syllabic = (or _=) e.g. n= (or n=) non-syllabic _^ rhoticity `

breathy voiced _t creaky voiced _k linguolabial _N labialized _w palatalized ‘ (or _j) e.g. t’ (or t_j) velarized _G pharyngealized _?\

dental d apical _a laminal _m nasalized ~ (or _~) e.g. A~ (or A~) nasal release _n lateral release _l no audible release _}

velarized or pharyngealized _e velarized l, alternatively 5 raised _r lowered _o advanced tongue root _A retracted tongue root _q


This (part of) documentation was generated from src/fst/phonetics/txt2ipa.xfscript


src-fst-transcriptions-transcriptor-abbrevs2text.lexc.md

We describe here how abbreviations are in Võro are read out, e.g. for text-to-speech systems.

For example:


This (part of) documentation was generated from src/fst/transcriptions/transcriptor-abbrevs2text.lexc


src-fst-transcriptions-transcriptor-numbers-digit2text.lexc.md

Ordinal numerals begin


This (part of) documentation was generated from src/fst/transcriptions/transcriptor-numbers-digit2text.lexc


tools-grammarcheckers-grammarchecker.cg3.md

[ L A N G U A G E ] G R A M M A R C H E C K E R

DELIMITERS

TAGS AND SETS

Tags

This section lists all the tags inherited from the fst, and used as tags in the syntactic analysis. The next section, Sets, contains sets defined on the basis of the tags listed here, those set names are not visible in the output.

Beginning and end of sentence

BOS EOS

Parts of speech tags

N A Adv V Pron CS CC CC-CS Po Pr Pcle Num Interj ABBR ACR CLB LEFT RIGHT WEB PPUNCT PUNCT

COMMA ¶

Tags for POS sub-categories

Pers Dem Interr Indef Recipr Refl Rel Coll NomAg Prop Allegro Arab Romertall

Tags for morphosyntactic properties

Nom Acc Gen Ill Loc Com Ess Ess Sg Du Pl Cmp/SplitR Cmp/SgNom Cmp/SgGen Cmp/SgGen PxSg1 PxSg2 PxSg3 PxDu1 PxDu2 PxDu3 PxPl1 PxPl2 PxPl3 Px

Comp Superl Attr Ord Qst IV TV Prt Prs Ind Pot Cond Imprt ImprtII Sg1 Sg2 Sg3 Du1 Du2 Du3 Pl1 Pl2 Pl3 Inf ConNeg Neg PrfPrc VGen PrsPrc Ger Sup Actio VAbess

Err/Orth

Semantic tags

Sem/Act Sem/Ani Sem/Atr Sem/Body Sem/Clth Sem/Domain Sem/Feat-phys Sem/Fem Sem/Group Sem/Lang Sem/Mal Sem/Measr Sem/Money Sem/Obj Sem/Obj-el Sem/Org Sem/Perc-emo Sem/Plc Sem/Sign Sem/State-sick Sem/Sur Sem/Time Sem/Txt

HUMAN

PROP-ATTR PROP-SUR

TIME-N-SET

Syntactic tags

@+FAUXV @+FMAINV @-FAUXV @-FMAINV @-FSUBJ> @-F<OBJ @-FOBJ> @-FSPRED<OBJ @-F<ADVL @-FADVL> @-F<SPRED @-F<OPRED @-FSPRED> @-FOPRED> @>ADVL @ADVL< @<ADVL @ADVL> @ADVL @HAB> @<HAB @>N @Interj @N< @>A @P< @>P @HNOUN @INTERJ @>Num @Pron< @>Pron @Num< @OBJ @<OBJ @OBJ> @OPRED @<OPRED @OPRED> @PCLE @COMP-CS< @SPRED @<SPRED @SPRED> @SUBJ @<SUBJ @SUBJ> SUBJ SPRED OPRED @PPRED @APP @APP-N< @APP-Pron< @APP>Pron @APP-Num< @APP-ADVL< @VOC @CVP @CNP OBJ

-OTHERS SYN-V @X ## Sets containing sets of lists and tags This part of the file lists a large number of sets based partly upon the tags defined above, and partly upon lexemes drawn from the lexicon. See the sourcefile itself to inspect the sets, what follows here is an overview of the set types. ### Sets for Single-word sets INITIAL ### Sets for word or not WORD NOT-COMMA ### Case sets ADLVCASE CASE-AGREEMENT CASE NOT-NOM NOT-GEN NOT-ACC ### Verb sets NOT-V ### Sets for finiteness and mood REAL-NEG MOOD-V NOT-PRFPRC ### Sets for person SG1-V SG2-V SG3-V DU1-V DU2-V DU3-V PL1-V PL2-V PL3-V ### Pronoun sets ### Adjectival sets and their complements ### Adverbial sets and their complements ### Sets of elements with common syntactic behaviour ### NP sets defined according to their morphosyntactic features ### The PRE-NP-HEAD family of sets These sets model noun phrases (NPs). The idea is to first define whatever can occur in front of the head of the NP, and thereafter negate that with the expression **WORD - premodifiers**. ### Border sets and their complements ### Grammarchecker sets * * * This (part of) documentation was generated from [tools/grammarcheckers/grammarchecker.cg3](https://github.com/giellalt/lang-vro/blob/main/tools/grammarcheckers/grammarchecker.cg3) --- # tools-tokenisers-tokeniser-disamb-gt-desc.pmscript.md # Tokeniser for vro Usage: ``` $ make $ echo "ja, ja" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst $ echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa boasttu olmmoš, man mielde lahtuid." | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst $ echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst $ echo "márffibiillagáffe" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst ``` Pmatch documentation: <https://github.com/hfst/hfst/wiki/HfstPmatch> Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words: * Punct contains ASCII punctuation marks * The symbol after m-dash is soft-hyphen `U+00AD` * The symbol following {•} is byte-order-mark / zero-width no-break space `U+FEFF`. Whitespace contains ASCII white space and the List contains some unicode white space characters * En Quad U+2000 to Zero-Width Joiner U+200d' * Narrow No-Break Space U+202F * Medium Mathematical Space U+205F * Word joiner U+2060 Apart from what's in our morphology, there are 1. unknown word-like forms, and 2. unmatched strings We want to give 1) a match, but let 2) be treated specially by `hfst-tokenise -a` Unknowns are made of: * lower-case ASCII * upper-case ASCII * select extended latin symbols ASCII digits * select symbols * Combining diacritics as individual symbols, * various symbols from Private area (probably Microsoft), so far: * U+F0B7 for "x in box" ## Unknown handling Unknowns are tagged ?? and treated specially with `hfst-tokenise` hfst-tokenise --giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it's safer to let hfst-tokenise handle them. Finally we mark as a token any sequence making up a: * known word in context * unknown (OOV) token in context * sequence of word and punctuation * URL in context * * * This (part of) documentation was generated from [tools/tokenisers/tokeniser-disamb-gt-desc.pmscript](https://github.com/giellalt/lang-vro/blob/main/tools/tokenisers/tokeniser-disamb-gt-desc.pmscript) --- # tools-tokenisers-tokeniser-gramcheck-gt-desc.pmscript.md # Grammar checker tokenisation for vro Requires a recent version of HFST (3.10.0 / git revision>=3aecdbc) Then just: ``` $ make $ echo "ja, ja" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst ``` More usage examples: ``` $ echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa boasttu olmmoš, man mielde lahtuid." | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst $ echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst $ echo "márffibiillagáffe" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst ``` Pmatch documentation: <https://github.com/hfst/hfst/wiki/HfstPmatch> Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words: * Punct contains ASCII punctuation marks * The symbol after m-dash is soft-hyphen `U+00AD` * The symbol following {•} is byte-order-mark / zero-width no-break space `U+FEFF`. Whitespace contains ASCII white space and the List contains some unicode white space characters * En Quad U+2000 to Zero-Width Joiner U+200d' * Narrow No-Break Space U+202F * Medium Mathematical Space U+205F * Word joiner U+2060 Apart from what's in our morphology, there are 1) unknown word-like forms, and 2) unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a * select extended latin symbols * select symbols * various symbols from Private area (probably Microsoft), so far: * U+F0B7 for "x in box" TODO: Could use something like this, but built-in's don't include šžđčŋ: Simply give an empty reading when something is unknown: hfst-tokenise --giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it's safer to let hfst-tokenise handle them. Finally we mark as a token any sequence making up a: * known word in context * unknown (OOV) token in context * sequence of word and punctuation * URL in context * * * This (part of) documentation was generated from [tools/tokenisers/tokeniser-gramcheck-gt-desc.pmscript](https://github.com/giellalt/lang-vro/blob/main/tools/tokenisers/tokeniser-gramcheck-gt-desc.pmscript) --- # tools-tokenisers-tokeniser-tts-cggt-desc.pmscript.md # TTS tokenisation for smj Requires a recent version of HFST (3.10.0 / git revision>=3aecdbc) Then just: ```sh make echo "ja, ja" \ | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst ``` More usage examples: ```sh echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa \ boasttu olmmoš, man mielde lahtuid." \ | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" \ | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst echo "márffibiillagáffe" \ | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst ``` Pmatch documentation: <https://kitwiki.csc.fi/twiki/bin/view/KitWiki/HfstPmatch> Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words: * Punct contains ASCII punctuation marks * The symbol after m-dash is soft-hyphen `U+00AD` * The symbol following {•} is byte-order-mark / zero-width no-break space `U+FEFF`. Whitespace contains ASCII white space and the List contains some unicode white space characters * En Quad U+2000 to Zero-Width Joiner U+200d' * Narrow No-Break Space U+202F * Medium Mathematical Space U+205F * Word joiner U+2060 Apart from what's in our morphology, there are 1) unknown word-like forms, and 2) unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a * select extended latin symbols * select symbols * various symbols from Private area (probably Microsoft), so far: * U+F0B7 for "x in box" TODO: Could use something like this, but built-in's don't include šžđčŋ: Simply give an empty reading when something is unknown: hfst-tokenise --giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it's safer to let hfst-tokenise handle them. Needs hfst-tokenise to output things differently depending on the tag they get * * * This (part of) documentation was generated from [tools/tokenisers/tokeniser-tts-cggt-desc.pmscript](https://github.com/giellalt/lang-vro/blob/main/tools/tokenisers/tokeniser-tts-cggt-desc.pmscript)