Multichar_Symbols and Root lexicon for Veps
Miscellaneuos tags
- +TYÄ
- +Span
Thes are to be evaluated (are they in use?) TODO: Have a look at these:
- +Arab
- +CLBfinal
- +Cmp
- +CmpNP/First
- +CmpNP/None
- +Cmp/SplitR
- +Cmp/Hyph
- +Coll
- +Com
- +Err/Hyph
- +Err/Lex
- +Err/SpaceCmp
- +Err/MissingSpace
- +MWE
- +OLang/ENG
- +OLang/FIN
- +OLang/NNO
- +OLang/NOB
- +OLang/RUS
- +OLang/SMA
- +OLang/SME
- +OLang/SWE
- +OLang/UND
- +Prf
- +PrfPrs
-
+Rom
- +Use/-PMatch
- +Use/Circ
- +Use/NG do not generate
- +Use/GC ??? typo?, occurs once.
- +Use/PMatch
- +Use/SpellNoSugg
- +Use/TTS – only retained in the HFST Text-To-Speech disambiguation tokeniser
-
+Use/-TTS – never retained in the HFST Text-To-Speech disambiguation tokeniser
- +Hom1
- +Hom2
-
+Hom3
- +v1
- +v2
- +v3
- +v4
- @C.ErrOrth@
- @D.ErrOrth.ON@
- @P.ErrOrth.ON@
-
@R.ErrOrth.ON@
-
@P.Pmatch.Backtrack@
- @P.Pmatch.Loc@
Grammatical tags
The morphological analyses of wordforms of Veps are presented in this system in terms of the following symbols. (It is highly suggested to follow existing standards when adding new tags).
The parts-of-speech
- +A = adjective
- +Adp = adposition
- +Adv = adverb
- +CS = subordinating conjunction
- +CC = coordinating conjunction
- +Det = determiner
- +Interj = interjection
- +N = noun
- +Num = numeral
- +Pcle = particle
- +Pr = preposition
- +Po = postposition
- +Pron = pronoun
- +Qnt = quantifier
- +V = verb
Subtags
Noun subtags
- +Prop = proper
Pronoun tags
- +Dem = demonstrative
- +Indef = indefinite
- +Interr = interrogative
- +Pers = personal
- +Recipr = reciprocal
- +Refl = reflexive
- +Rel = relative
Verb tags
Voice and transitivity
- +Aux = auxiliary verb
- +Act = active voice
- +Pss = passive voice
- +TV = transitive and
- +IV = intransitive verbs
Verb moods are:
- +Cond = conditional
- +Ind = indicative
- +Imprt = imperative
- +Pot = potential linne-
Tenses
- +Prs =
- +Prt =
- +Pos =
Verb personal forms are:
- +Sg1 Singular First Person
- +Sg2 Singular Second Person
- +Sg3 Singular Third Person
- +Pl1 Plural First Person
- +Pl2 Plural Second Person
- +Pl3 Plural Third Person
- +RcSg1
- +RcSg2
- +RcSg3
- +RcPl1
- +RcPl2
- +RcPl3 =
- +RcSg
- +RcPl
- +ScSg
- +ScPl =
Other verb forms are
- +Inf
- +Ger
- +Neg =
- +ConNeg
- +ConNegII =
- +ImprtII =
- +PrsPrc
- +PrfPrc = nu
- +Sup
- +VGen
- +VAbess =
Nominal tags
- +Sg = singular
- +Pl = plural
- +Abe = abessive
- +Acc = accusative
- +Abl = ablative case
- +Ade = adessive
- +All = allative
- +Dat = dative case
- +Ela = elative
- +Ess = essive
- +Exe = essive
- +Gen = genitive case
- +Ill = illative
- +Ine = inessive
- +Ins = instrumental
- +Instr = instructive -IN
- +Lat = Lative
- +Loc = Locative
- +Nom = nominative case
- +Par = partitive
- +Prl = prolative
- +Tra = translative
- +Voc = Vocative
- +Pros =
- +Adc =
- +Apr =
- +Egr =
- +Ter1 =
- +Ter2 =
- +Ter3 =
- +Add1 =
- +Add2 =
- +Apr1 =
- +Apr2 =
- +EssInst =
Possessive suffixes:
- +PxSg1 = -in
- +PxSg2 = -iž
- +PxSg3 = -ze
- +PxPl1 = -moi
- +PxPl2 = -toi
- +PxPl3 = -ze
Comparative tags:
- +Comp =
- +Superl =
Subtags for Numerals:
- +Attr =
- +Card =
- +Ord =
ADVERBS
- +Manner =
- +Spat =
- +Temp =
Abbreviated words are classified with:
- +ABBR
- +Symbol = independent symbols in the text stream, like £, €, ©
- +ACR =
Special symbols are classified with:
- +CLB
- +PUNCT
- +LEFT
- +RIGHT +MIDDLE =
Special multiword units are analysed with:
- +Multi =
Guess tag, used to catch new wores
- +Guess =
Question and Focus particles:
- +Qst
- +Foc =
- +Clt =
- +Foc/i = perhaps +Addative
- +Foc/ki =
- +Foc/žo =
- +Appr =
- +Advc =
- +Ter =
- +Pro =
- +Car =
- +PstI =
- +PstII =
Error (non-standard language) tags
- +Err/Orth substandard, not in normative fst
- +Err/Orth-no-pal = palatalization mark missing
Usage tags:
- +Use/-Spell Orthographically correct, typically perifer words, excluded in speller because they cause trouble for frequent words
- +Use/-PLX Excluded in PLX-speller
- +Use/SpellNoSugg recognized but not suggested in speller
- +Use/Circ circular paths (old ^C^)
- +Use/CircN circular paths for the numerals (old ^N^)
- +Use/MT Generate for MT only, for restricting analyses needed for MT generation not to pop up elsewhere (NOT IN FUNCTION)
- +Use/NG not-generate, for ped generation isme-ped.fst and MT
- +Use/NGminip Not for miniparadigm in NDS dicts
- +Use/PMatch means that the following is only used in the analyser feeding the disambiguator
- +Use/-PMatch Do not include in fst’s made for hfst-pmatch
- +Use/GC – only retained in the HFST Grammar Checker disambiguation analyser
- +Use/-GC – never retained in the HFST Grammar Checker disambiguation analyser
-
+Use/TTS – only retained in the HFST Text-To-Speech disambiguation tokeniser
- +Cmp/SgNom = compound words
Semtags
- +Sem/Mal
- +Sem/Fem
- +Sem/Sur =
- +Sem/Plc =
- +Sem/Org =
- +Sem/Obj =
- +Sem/Ani =
- +Sem/Hum =
- +Sem/Plant =
- +Sem/Group =
- +Sem/Time =
- +Sem/Txt =
- +Sem/Route =
- +Sem/Measr =
- +Sem/Wthr =
- +Sem/Build =
- +Sem/Edu =
- +Sem/Veh =
- +Sem/Clth =
More semtags
- +Sem/Amount
- +Sem/Build-room
- +Sem/Cat
- +Sem/Curr
- +Sem/Date
- +Sem/Domain
- +Sem/Domain_Hum
- +Sem/Dummytag
- +Sem/Edu_Hum
- +Sem/Event
- +Sem/Food-med
- +Sem/Group_Hum
- +Sem/ID
- +Sem/Lang
- +Sem/Mat
- +Sem/Money
- +Sem/Obj-el
- +Sem/Obj-ling
- +Sem/Org_Prod-audio
- +Sem/Org_Prod-vis
- +Sem/Part
- +Sem/Prod-vis
- +Sem/Rule
- +Sem/Sign
- +Sem/State
- +Sem/State-sick
- +Sem/Substnc
- +Sem/Time-clock
- +Sem/Tool-it
- +Sem/Year
Derivations
Derivations are classified under the morphophonetic form of the suffix, the source and target part-of-speech.
- +V→N
- +V→V
- +V→A =
- +Der =
- +Der/NomAg = tehta : tegii
- +Der/Uz1 = sur»uz’ A»N
- +Der/Ta =
- +Der/Te =
- +Der/ma =
- +Der/Tu =
- +Der/IA =
- +Der/Toi = nime»toi N»A
- +Der/Matoi = V»A
- +Der/Mine = V»N
- +Der/V = V»V
Morphophonology
To represent phonologic variations in word forms we use the following symbols in the lexicon files:
Archiphonemes and fluctuation symbols
- %{eØ%} vowel loss in oiged:oiktan
- %{uØ%} vowel loss in sapug:sapkan
- %{iØ%} vowel loss in paltin:paltnan
- %{aØ%} vowel loss in samal:samlan
- %{oØ%} vowel loss in zerkol:zerklon
- {aä}
- {oö}
- {uü}
More archiphonemes (Protoletters for xfst)
- %^DEVOICE = haikta: haig
- %^PEN Control final vs penultimate
- QAQ1
- QAO1
- QÄQ1
- EH1
- QEQ1
- INE1
- ZD1
- ZS1
- V1
- AO1
- A1
- EI1
- ZS1
- ZD1 These are for developing underlying morphology rules
- D1
- E1
- U1
- I1
- AÄ1
- OÖ1
- UY1
- V1
- V2
- V3
And following triggers to control variation
- {front} front vowel stems
- {back} back vowel stems
- %^RmVow for removing vowels
- %^WGStem
- %^TS
- %^RVow
- %^LVow
- %^LCns
- %^WCns
- %^AtoO
- %^ÄtoÖ
- %^OddSyll
- %^StretchSyll2
- %^SyllBr
- %^E1
Boundary symbols
- %> =
- %- =
Flag diacritics
We have manually optimised the structure of our lexicon using following flag diacritics to restrict morhpological combinatorics - only allow compounds with verbs if the verb is further derived into a noun again:
| Flag | Explanation |
|---|---|
| @P.NeedNoun.ON@ | (Dis)allow compounds with verbs unless nominalised |
| @D.NeedNoun.ON@ | (Dis)allow compounds with verbs unless nominalised |
| @C.NeedNoun@ | (Dis)allow compounds with verbs unless nominalised |
For languages that allow compounding, the following flag diacritics are needed to control position-based compounding restrictions for nominals. Their use is handled automatically if combined with +CmpN/xxx tags. If not used, they will do no harm.
| Flag | Explanation |
|---|---|
| @P.CmpFrst.FALSE@ | Require that words tagged as such only appear first |
| @D.CmpPref.TRUE@ | Block such words from entering ENDLEX |
| @P.CmpPref.FALSE@ | Block these words from making further compounds |
| @D.CmpLast.TRUE@ | Block such words from entering R |
| @D.CmpNone.TRUE@ | Combines with the next tag to prohibit compounding |
| @U.CmpNone.FALSE@ | Combines with the prev tag to prohibit compounding |
| @P.CmpOnly.TRUE@ | Sets a flag to indicate that the word has passed R |
| @D.CmpOnly.FALSE@ | Disallow words coming directly from root. |
Use the following flag diacritics to control downcasing of derived proper nouns (e.g. Finnish Pariisi -> pariisilainen). See e.g. North Sámi for how to use these flags. There exists a ready-made regex that will do the actual down-casing given the proper use of these flags.
| Flag | Explanation |
|---|---|
| @U.Cap.Obl@ | Allowing downcasing of derived names: deatnulasj. |
| @U.Cap.Opt@ | Allowing downcasing of derived names: deatnulasj. |
| Flag diacritic | Explanation |
|---|---|
| @U.number.one@ | Flag used to give arabic numerals in smj different cases ; |
| @U.number.two@ | Flag used to give arabic numerals in smj different cases ; |
| @U.number.three@ | Flag used to give arabic numerals in smj different cases ; |
| @U.number.four@ | Flag used to give arabic numerals in smj different cases ; |
| @U.number.five@ | Flag used to give arabic numerals in smj different cases ; |
| @U.number.six@ | Flag used to give arabic numerals in smj different cases ; |
| @U.number.seven@ | Flag used to give arabic numerals in smj different cases ; |
| @U.number.eight@ | Flag used to give arabic numerals in smj different cases ; |
| @U.number.nine@ | Flag used to give arabic numerals in smj different cases ; |
| @U.number.zero@ | Flag used to give arabic numerals in smj different cases ; |
| @P.number.one@ | Flag used to give arabic numerals in smj different cases ; |
| @P.number.two@ | Flag used to give arabic numerals in smj different cases ; |
| @P.number.three@ | Flag used to give arabic numerals in smj different cases ; |
| @P.number.four@ | Flag used to give arabic numerals in smj different cases ; |
| @P.number.five@ | Flag used to give arabic numerals in smj different cases ; |
| @P.number.six@ | Flag used to give arabic numerals in smj different cases ; |
| @P.number.seven@ | Flag used to give arabic numerals in smj different cases ; |
| @P.number.eight@ | Flag used to give arabic numerals in smj different cases ; |
| @P.number.nine@ | Flag used to give arabic numerals in smj different cases ; |
| @P.number.ten@ | Flag used to give arabic numerals in smj different cases ; |
| @P.number.zero@ | Flag used to give arabic numerals in smj different cases ; |
Lexc lexica
Root lexicon
The word forms in Veps start from the lexeme roots of basic word classes.
Other lexica
CC_
CS_
INTERJ_
ADV_
ADV_MANNER
ADV_ADE ADV_ABL ADV_ALL ADV_ELA ADV_ILL ADV_INE ADV_LAT
ADV_TEMP
This (part of) documentation was generated from src/fst/morphology/root.lexc