Finnish NLP Grammar

Finite state and Constraint Grammar based analysers, proofing tools and other resources

View the project on GitHub giellalt/lang-fin

Page Content

Finnish language model documentation

All doc-comment documentation in one large file.


src-cg3-disambiguator.cg3.md

Idiomatic cases


This (part of) documentation was generated from src/cg3/disambiguator.cg3


src-cg3-functions.cg3.md

S Y N T A C T I C F U N C T I O N S F O R S Á M I

Sámi language technology project 2003-2018, University of Tromsø #

This file adds syntactic functions. It is common for all the Saami

LEFT RIGHT because of apertium

Syntactic tags

Tag sets

These sets model noun phrases (NPs). The idea is to first define whatever can occur in front of the head of the NP, and thereafter negate that with the expression WORD - premodifiers.

The set NOT-NPMOD is used to find barriers between NPs. Typical usage: … (*1 N BARRIER NPT-NPMOD) … meaning: Scan to the first noun, ignoring anything that can be part of the noun phrase of that noun (i.e., “scan to the next NP head”)

ADLVCASE

These were the set types.

Numeral outside the sentence

HABITIVE MAPPING

sma object

SUBJ MAPPING - leftovers

OBJ MAPPING - leftovers

MAPPING for MT - experimental

HNOUN MAPPING

missingX adds @X to all missings

therestX adds @X to all what is left, often errouneus disambiguated forms

For Apertium:

The analysis give double analysis because of optional semtags. We go for the one with semtag.


This (part of) documentation was generated from src/cg3/functions.cg3


src-fst-guess-patterns.lexc.md

Guesser

A rule-based morphological guesser is based on using the paradigms from the dictionary based analyser but replacing the roots with patterns. For Finnish we have quite neat paradigms with well-defined stem patterns: vowel harmony, stem vowels and some with specific syllable counts

Symbols used for guesser Multichar_Symbols

Guesser uses a subset of the morphological analyser’s alphabet. For documentation c.f. morphology root.


This (part of) documentation was generated from src/fst/guess-patterns.lexc


src-fst-morphology-affixes-abbreviations.lexc.md

Continuation lexicons for Finnish abbreviations

Lexica for adding tags and periods

The sublexica

Continuation lexicons for abbrs both with and witout final period

Lexicons without final period

Lexicons with final period


This (part of) documentation was generated from src/fst/morphology/affixes/abbreviations.lexc


src-fst-morphology-affixes-acronyms.lexc.md

Acronym inflection

Acronyms are inflected using a colon, followed by the inflectional endings, which depend on either last letter of the word or inflection class of the last word of the abbreviation. The exception to the inflection scheme is the singular nominative, which appears without colon. Pronouncable abbreviations such as aids, hiv, kela, alko etc. are actually counted as regular words with regular inflection patterns. c.f. VISK § 169

Acronyms ending in numbers inflect like the numbers are pronounced.


This (part of) documentation was generated from src/fst/morphology/affixes/acronyms.lexc


src-fst-morphology-affixes-adjectives.lexc.md

Adjective inflection

The adjectives are inflected like regular nouns. The only morphological difference in adjectives compared to other nouns is higher likelihood of comparative derivations–they are fully productive. For adjectives that do not compare, use qualifiers classification instead. VISK § 300

Adjective stem variation and allomorph selection

Adjective stems are formed like noun stems, with similar patterns. Adjectives have additionally the productive comparative derivations, which may have their own stems, particularly an e-stem for a-stem words. The examples in this chapter are the same set of cases as with nouns: singular nominative, singular essive, singular inessive, plural essive, plural elative, singular partitives, singular illatives, plural partitives plural genitives, plural illatives and the compound forms. And also the comparative derivations: comparative singular nominative and superlative singular nominative. Majority of adjeciteves are equivalent to corresponding noun classes, so some examples have been omitted.

Bisyllabic / derivational adjective stems without stem variation

The most basic adjective stems do not have any stem internal variation. They end in o, u, y or ö, and with some limited set of new words, e. This class has the fewest allomorphs. There are a number of productive adjective classes in this section, including all lexicalised nut participle’s passives (-tu, -ty), moderative derivations (-hko, -hkö) and … Examples follow in specific sub-classes.

The words in this class ending in o belong to ADJ_TUMMAHKO, the old dictionaries use class ¹. The stems should be entered in dictionary like: tummahko+A:tummahko A_TUMMAHKO ; This class includes back vowel moderative derivations. N.B. the comparative derivation of moderatives is semantically awkward, but morphologically plausible.

The stems ending in u are in class ADJ_VALKAISTU, and in old dictionaries the class is ¹. These stems should be entered in dictionary like: valkaistu+A:valkaistu A_VALKAISTU ; Common part of this class is formed by nut participle passive’s back vowel versions after s stem verbs:

The stems ending in y are in class ADJ_HÄPÄISTY, and in old dictionaries the class is ¹. Common part of this class is formed by nut participle passive’s front vowel versions after s stem verbs:

The words in this class ending in ö belong to ADJ_HÖLÖ, the old dictionaries use class ¹. This class includes front vowel moderative derivations.

The new words with e stem have same allomorph selection as old short unchanging bisyllabic u, y, o and ö stems, and no stem-internal variation. The classification for the back vowel variant of this class is ADJ_TOOPE, and old dictionaries used the class ⁸.

The front variation of unchanging e stems is class @LEXNAME, and in old dictionaries ⁸.

Trisyllabic and longer non-derived adjectuve stems

The trisyllabic and longer words with stem vowels o, u, y and ö also have no stem variation either, but selection of suffix allomorphs for plural genitives and partitives is wider than for bisyllabic and derived ones.

The o final trisyllabic stems are in class ADJ_KOHELO, and the old dictionaries used ².

And the trisyllabic ö stem is classified ADJ_LÖPERÖ.

Unchanging long vowel stems

The words with stem vowels o, u, y and ö preceded by vowels still have no stem variation, but use yet another pattern of allomorphs for singular and plural partitives and plural genitive

The class for o final long vowel stems is ADJ_AUTIO, and old old dictionaries used ³.

The front voweled stems with ö after vowels go to class ADJ_RIIVIÖ, and used the old dictionary class ³.

There are no examples of new loan word adjectives ending in long vowel Furthermore there are no examples of adjectives in other classes without stem variation yet. There are some examples of these in nouns if you need new classes at some point.

The abovementioned o, u, y and ö stems as well as new e stems can all form combinations with gradation feature as well. Not all combinations are yet found for adjectives, for full reference, read the noun classes.

The quantitative k gradations with o bisyllabic o stem use class ADJ_KOLKKO, and old dictionaries use classes ¹⁻A and ¹⁻D.

The quantitative k gradations with u bisyllabic o stem use class ADJ_VIRKKU, and old dictionaries use classes ¹⁻A and ¹⁻D.

The quantitative k gradations with y bisyllabic o stem use class ADJ_SÄIKKY, and old dictionaries use classes ¹⁻A and ¹⁻D.

The quantitative k gradations with o bisyllabic ö stem use class ADJ_KÖKKÖ, and old dictionaries use classes ¹⁻A and ¹⁻D.

there is no unvarying e final adjective with k ~ 0 gradation.

The quantitative gradation of p before o is in class ADJ_SUIPPO and old dictionaries would use ¹⁻B.

The quantitative gradation of p before u is in class ADJ_IKÄLOPPU and old dictionaries would use ¹⁻B. It is only a nominal compound based adjective that ends in u and has p ~ 0 gradation here:

and none of the adjectives end in y and quantitative p gradation.

The quantitative gradation of p before ö is in class ADJ_LÖRPPÖ and old dictionaries would use ¹⁻B.

The quantitative gradation of t before o is in class ADJ_VELTTO, which was ¹⁻C in the dictionary.

The quantitative gradation of t before u is in class ADJ_VIMMATTU, which was ¹⁻C in the dictionary. The u stems with quantitative t gradation are commonest with nut participle passive derivation’s back form (-ttu).

The quantitative gradation of t before y is in class ADJ_YLENNETTY, which was ¹⁻C in the dictionary. The u stems with quantitative t gradation are commonest with nut participle passive derivation’s front (-tty).

The quantitative gradation of t before y is in class ADJ_KYYTTÖ, which was ¹⁻C in the dictionary.

The quantitave k gradation has a variant that allows use of apostrophe instead of nothing in the weak grade.

The class for o final bisyllabic stems with optional ’ is ADJ_LAKO, this is a subset of dictionary class ¹⁻D.

There’s no k to optional apostrophe with u. nor with y and k: nor ö with k: There’s none with k gradating to always apostrophe either. For examples of these, see noun classes

The qualitative gradation of p between vowels in o stems goes to v, the class for this is ADJ_KELPO, the dictionary class for this is ¹⁻E.

There are none ending in vowel + pu nor with y and p nor ö with p.

The gradation of t ~ d after o is in class ADJ_MIETO, the dictionary class for this is ¹⁻F.

The gradation of t ~ d after u is in class ADJ_VIIPALOITU, the dictionary class for this is ¹⁻F. The commonest t ~ d variation in u stems comes from nut participle’s passive’s back form (-tu).

The gradation of t ~ d after u is in class ADJ_YKSILÖITY, the dictionary class for this is ¹⁻F. The commonest t ~ d variation in u stems comes from nut participle’s passive’s front form (-ty).

And there’s none with t and ö

The adjectives with -nko stem belong to class ADJ_LENKO, and the dictionary class was ¹⁻G.

There’s no adjectives ending in nku nor with y and nk And ö with nk There are no gradating p’s after m’s in unchanging stems For all these, check the noun patterns. The quantitative gradation of t after l in o stems is in class ADJ_MELTO, which corresponds to dictionary class ¹⁻I.

The quantitative gradation of t after l in o stems is in class ADJ_PARANNELTU, which corresponds to dictionary class ¹⁻I. The common u stem after l is in nut participles passive (-tu):

The quantitative gradation of t after l in o stems is in class ADJ_VÄHÄTELTY, which corresponds to dictionary class ¹⁻I. As with y and t:

There’s no adjective ending -ltö in our lexical database.

The quantitative gradation of t after n in o stems is in class ADJ_VENTO, which corresponds to dictionary class ¹⁻J.

The quantitative gradation of t after n in u stems is in class ADJ_PANTU, which corresponds to dictionary class ¹⁻I. The common u stem after n is in nut participle’s passive’s back form (-tu):

The quantitative gradation of t after n in y stems is in class ADJ_MENTY, which corresponds to dictionary class ¹⁻I. The common u stem after n is in nut participle’s passive’s front form (-ty):

There are no adjectives ending in -ntö

The quantitative gradation of t after r in o stems is in class ADJ_MARTO, which corresponds to dictionary class ¹⁻J.

The quantitative gradation of t after r in u stems is in class ADJ_PURTU, which corresponds to dictionary class ¹⁻J. The common u stem after r is in nut participle’s passive’s back form (-tu):

The quantitative gradation of t after r in y stems is in class ADJ_PIERTY, which corresponds to dictionary class ¹⁻J. The common u stem after r is in nut participle’s passive’s front fomr (-ty):

There are no adjectives ending in -rtö. Just as well, the class for UkU : UvU- is limited to few nouns we know.

The special illative alternation with k gradation, unaltering stems

The trisyllabic words ending with gradating long k have plural illative in both strong and weak forms.

The class for trisyllabic -kko stems is ADJ_HUPAKKO, the corresponding dictionary class is ⁴⁻D.

Adjective stems with i:e variations

The i stems of new i final words have i : e : 0 variation. These classes include new loans ending in consonant, which use -i to form inflectional stems. The i stems combined with gradation of will form five separate stem variants:

The i finals with back vowel harmony go to class ADJ_ABNORMI, where old dictionary classification was ⁵.

The i finals with front vowel harmony go to class ADJ_STYDI, where old dictionary classification was ⁵.

Stems with quantitative k gradation, i final and back harmony are in class ADJ_OPAAKKI and dictionary class ⁵⁻A or ⁵⁻D.

Stems with quantitative k gradation, i final and front harmony are in class ADJ_PINKKI and dictionary class ⁵⁻A or ⁵⁻D.

There’s no back vowel version of the bisyllabic gradating -ppi form.

Stems with quantitative p gradation, i final and front harmony are in class ADJ_SIPPI and dictionary class ⁵⁻B.

Stems with quantitative t gradation, i final and back harmony are in class ADJ_HURTTI and dictionary class ⁵⁻C.

Stems with quantitative t gradation, i final and front harmony are in class ADJ_VÄÄRTTI and dictionary class ⁵⁻C.

There are no bisyllabic adjectives ending in vowel and gradating -pi.

Stems with t ~ d gradation, i final and back harmony are in class ADJ_TUHTI and dictionary class ⁵⁻F.

Stems with t ~ d gradation, i final and front harmony are in class ADJ_REHTI and dictionary class ⁵⁻F.

There are no adjectives with i stems with other gradations.

Trisyllabic and longer i stems

The i stems with trisyllabic allomorph sets have class ADJ_ABNORMAALI, and dictionary class of ⁶.

The i stems with trisyllabic allomorph sets have class ADJ_ÖYKKÄRI, and dictionary class of ⁶.

There are no adjectives acting like nouns where i-final nominatives have singular e stems.

Bisyllabic A-stem adjectives

The a stems differ from regular nouns in one more feature: some of them, but not all, have a:e variation before the comparative stem. The selection of this feature may be phonological, but it is complex, and often there is variation between speakers. The plural partitives and genitives are a good indicator of classificaion of a, ä final stems, as are the vowels of comparative and plural stems.

Bisyllabic a stems with e comparative and j plurals are in class ADJ_AAVA, and dictionary class ⁹.

The ka stem with e comparative and j plurals is ADJ_TARKKA, and the dictionary class is ⁹-A or ⁹⁻D.

No a final adjectives with quantitative p gradation.

The ta stem with j plurals is ADJ_MATTA, and the dictionary class is ⁹-C.

The pa : va stem with e comparative and j plurals is ADJ_HALPA, and the dictionary class is ⁹⁻E.

The ta : da stem with a comparative and j plurals is ADJ_EHTA, and the dictionary class is ⁹⁻F.

None with k:g gradation.

The pa : ma stem with e comparative and j plurals is ADJ_RAMPA, and the dictionary class is ⁹⁻H.

No a stems with t:l gradations

The ta : na stem with a comparative and j plurals is ADJ_VIHANTA, and the dictionary class is ⁹⁻J.

And finally, no a stems with t:r or t:l gradations.

Some other trisyllabic a finals with a : 0 plurals

Mostly regular a comparatives, no a : o variation and more syllables. Common for va participles, and other derivations

The a : 0 stem is in class ADJ_AALTOILEVA and the old dictionary used ¹⁰ as the paradigm class.

The ä : 0 stem is in class ADJ_TYÖLLISTETTÄVÄ and the old dictionary used ¹⁰ as the paradigm class.

For a:e comparatives in a:0 class use ADJ_RUMA. No dictionary classification or ~¹⁰,

For ä:e comparatives in a:0 class use ADJ_TYHMÄ. No dictionary classification or ~¹⁰,

THE EPSILON-to-ZILCH SEEMS TO INTERFER WITH COMPILATION 2015-08-23, Jaska 0:  » 0:0 SO TOMMI FINDS THEM

THE EPSILON-to-ZILCH SEEMS TO INTERFER WITH COMPILATION 2015-08-23, Jaska 0:  » 0:0 SO TOMMI FINDS THEM

The quantitative k and t gradations are not found for adjectives with this a stem.

The t:d is missing from this a stem.

p:m is missing from this a, ä stems.

T:l and t:n are missing from this a stem and t:l from ä stem.

K:j is missing.

Certain trisyllabic or longer a stems allow a lot of allomorphs and both a : o : 0 variations:

Certain trisyllabic or longer a stems allow a lot of allomorphs and both a : o : 0 variations:

Special illative gradation in a stems

The a : o stem variation combines with trisyllabic class of special illatives

A-final words with long vowels and syllable boundary

Lexicalised comparatives

Most of the lexicalised comparatives are adjectives that go to this class. The comparatives that are not lexicalised inflect exactly the same, though some versions of morphology may cut off long comparative chains.

Long vowel stems

There are no other bisyllabic long vowel stems in adjectives

There are no other monosyllabic long vowel stems for adjectives. For full listing of possibilities, see nouns.

Some loan words inflect irregularly, either more along the written form or the pronunciation.

There are not many direct adjective loans in general.

Old e stems with i nominative

Some of the old e stems have i nominative but e as stem vowel for singular forms. Most of these are not adjectives though, see full listing from the noun pages.

There are no adjective examples of other gradation variants or consonant cluster simplifications in this class.

Consonant-final stems

The consonant stems use inverted gradation if applicable, that is, the nominatives have end in consonants and their gradating consonants are in weak form. Most of these are rarer for adjectives than nouns.

There are no back vowel variants or gradating words in the basic e conjoining pattern.

There are no other examples of n:m final variation before conjoining e.

Caritives

The common case of n:m variation with conjoining a before singular stems is from caritive suffix -tOn, that forms adjectives productively.

This one word, hapan, also takes the same variation as normative variant. The expected e variant is not normative, but used.

Lexicalised superlatives

Vasen inflects almost like superlative

nen suffixes

Adjectives are commonly formed with nen derivatonns.

s-final adjectives

Most of the cases here are nouns from noun derivations.

No adjectives end in -päs

Gaps.

Gaps

t-finals

Lexicalised nut-participles

Majority of lexicalised nut participles are adjectives.

Old -e^ final stems

Gapping almost all variants of gradations with e, as well as all dual nominative stems.

Exceptional adjectives

The ones that do not fit in the official classes shown in dictionaries.

THE EPSILON-to-ZILCH SEEMS TO INTERFER WITH COMPILATION 2015-08-23, Jaska 0:  » 0:0 SO TOMMI FINDS THEM

THE EPSILON-to-ZILCH SEEMS TO INTERFER WITH COMPILATION 2015-08-23, Jaska 0:  » 0:0 SO TOMMI FINDS THEM

Plurales tantum?

Adjectives aren’t typically plural words, but there are some in the dictionaries.

Adjective inflection proper

The superlative derivation is formed by in suffix, which creates a new adjective baseform. This baseform is handled separately to avoid double superlatives.

The comparative derivation is formed by mpi suffix, which creates a new adjective baseform. This adjective is handled separately to avoid double comparative forms.

This inflectional part attached to adjective comparative stems to avoid circularity in comparative derivations:

This inflectional part is attached to adjective superlative stems to avoid circularity in superlative derivations:

Regular adjective inflection

The adjective inflection apart from the comparative and superlative derivations is same as with nouns. I will only show examples here.

Adjectives can usually be derived into sti adverbs productively


This (part of) documentation was generated from src/fst/morphology/affixes/adjectives.lexc


src-fst-morphology-affixes-adv.lexc.md

Adverb inflection

Most adverbs are morphologically either sti-derivations of adjectives or some specific form of an existing or archaic noun, and they have limited inflection in form of possessives and clitics carried over. According to modern dictionaries different forms of same root are separate adverbs, so they are not inflected here, but listed in roots.


This (part of) documentation was generated from src/fst/morphology/affixes/adv.lexc


src-fst-morphology-affixes-digits.lexc.md

Digit strings inflect with colons, lot like abbreviations.

The digit strings ending in digit 1 pronounced as number

The digit strings ending in digit 2 pronounced as number

The digit strings ending in digit 3 pronounced as number

The digit strings ending in digit 4 pronounced as number

The digit strings ending in digit 5 pronounced as number

The digit strings ending in digit 6 pronounced as number

The digit strings ending in digit 7 pronounced as number

The digit strings ending in digit 8 pronounced as number

The digit strings ending in digit 9 pronounced as number

The digit strings ending in digit 0 pronounced as number

The digit string ending in 0 pronounced as tens

The digit string ending in 00 pronounced as hundreds

The digit string ending in 000 pronounced as thousands

The digit string ending in 0’s pronounced as millions

The digit string ending in 0’s pronounced as milliards

The digit string ending in 1. pronounced as first

The digit string ending in 2. pronounced as second

The digit string ending in 3, 6, 8, 00 or 2 pronounced as ordinal

The digit string ending in 4, 5, 7, 9, 0, or 1. pronounced as ordinal

The roman digit string ending in II, III, VI, VIII, C, L or M

The roman digit string ending in I, IV, V, VII, IX, X or


This (part of) documentation was generated from src/fst/morphology/affixes/digits.lexc


src-fst-morphology-affixes-nouns.lexc.md

Noun inflection and derivation

Noun stem variation and allomorph selection

The nominal stems were classified according to what is the stem variation and what allomorphs they select. This section lists all the possible variations and all their allomorph selections. Each class description gives the key forms that can be used when classifying new words, the examples of inflection and negative examples for the most obvious missed allomorphs and differentiating factors. The example list should be at least: singular nominative, singular essive, singular inessive, plural essive, plural elative, singular partitives, singular illatives, plural partitives plural GENITIVEs, plural illatives and the compound forms.

Noun stems without stem variation

The most basic noun stem does not have any stem internal variation and uses few commonest allomorphs. The words in this class are either bisyllabic or have one of common derivational suffixes: sto, …. The nouns in this class end in o, u, y or ö, which determines their illative suffix and therefore exact classification:

The stems ending in u are also without variationm and the bisyllabic ones have the same simple allomorph pattern:

Basic gradation cases

The basic stems without variations other than consonant gradation.

Between two vowels, the weak grade of k is optionally an apostrophe instead. For k that is not optionally ’, for example when it is after a consonant other than s, the variation is k ~ 0 instead (e.g. NOUN_UKKO).

Between three vowels where k is surrounded by same vowels the k becomes obligatorily ’. When the vowels are different, it becomes optionally ’ (as in NOUN_TEKO), and after consonant other than s it is k ~ 0 (as in NOUN_UKKO).

There is a gap in paradigms in y and ö finals of k:’

Other gradations can be more easily caught from the preceding context.

There is a gap in paradigms in y finals with p:m variation

There is a gap that misses all other stem vowels of r:t variation except o

The k:v variation is unique to handful of words of form CUkU, such as

The trisyllabic and longer words with stem vowels o, u, y, ö have no stem variation either, but selection of suffix allomorphs for plural GENITIVEs and partitives is different:

The words with stem vowel o, u, y, ö preceded by vowels still have no more stem variation than other cases, but give yet another pattern of allomorphs for plural partitives and GENITIVEs:

Similar inflection exists in limited amounts in new loan words that are written as pronounced (thus taking no stem variation but still ending with long vowels with a syllable boundary):

This class includes a set of new proper nouns that get nativised a bit:

Optional gradation with illatives

In some trisyllabic words ending with quantitative consonant gradation, the illative form can attach to either strong or weak stem, even in standard written Finnish. Otherwise words in this class behave like other trisyllabic stems.

I final stems

The basic variation of stem final i in nominative is that it becomes e in plural stems. In plural GENITIVE form, the stem vowel disappears making suffix allomorph -ien, instead of -jen.

There is a gap in i final words for p:v variation and front harmony

There is a gap in i final paradigm with t:l variation and back vowels.

New loan words ending in consonant may be inflected as i stem words

The trisyllabic -i final stems work like their o, u, y. ö counterparts ; they combine the e:i and e:0 variation to the additional allomorphs for plural partitive and GENITIVE:

I-final nominatives with e stems

Some of the words with i-final nominative forms have i:e variation for singular stem, and i:0 for whole plural stems:

New e-final stems

The new words with e stem work exactly like the bisyllabic o, u, y, and ö stems ; no stem variation and same allomorph set. This variant can be recognised from singular illative then:

A-final stems

The basic variation of a-final stems is a:o in plural forms:

Notably, the basic a:o paradigm does not support many ä:ö cases.

Some -A stems do not have the a:o variation, exhibiting a:0 variation for plural stems instead. This class notably contains all the -jA suffixed deverbal nouns.

There is a gap for ä-final word with p:m variation

There is a gap for k:j gradation in a-final stems.

Certain trisyllabic stems allow both variations of a:o and a:0 for plural forms.

Some a stems with a:o variation have slightly different set of allomorphs

There is a possible class for further variation of a:o in the old dictionary dictionary that is worth re-evaluating.

There’s one more allomorph pattern.

The trisyllabic a-final words with quantitative consonant gradation allow same illative variation as o, u, y, and ö finals described earlier.

The a-final words ending in long vowels with syllable boundary have a:0 variation and more allomorphs for plyral GENITIVE or illative.

Lexicalised comparatives

Lexicalised comparatives have the same special inflection pattern as comparatives have: stem final i varies with a, and mp gradates into mm. There are not many comparatives that lexicalise into nouns.

Long vowel stems

The long vowel stems have shortening variation in plural inflection, and special singular illatives, partitives.

Bisyllabic and longer stems with long vowels have -seen illative suffix. This class has some of the old -UU derivations.

Monosyllabic long vowel stems have illative suffixes of form -hVn.

Opening diphthong stems

In old opening diphthong final words, the dipthong simplifies for plural forms by dropping the initial vowel. For new words of this class, no stem mutations happen and they are in above mentioned classes, (e.g. NOUN_ZOMBIE).

THIS WAS MISSING 2015-08-23, REDIRECTING Jaska

Newer loan words

The loan words that end in long vowel, and have been modified to Finnish orthography, have combination of long vowel stem’s allomorphs for e.g illatives. Sometimes rules for these classes of words are vague.

There’s a gap in -ii final loan stems.

Some loan words don’t end in long vowel but work like they would. The official dictionary says these words should avoid plural GENITIVE itten but not iden, in reality they may be in absolutely free variation. In general the rules for loan words are vague and do not always seem to work.

The rules are even more wonky when the vowel harmony does not follow the orthography, or the orthography leaves things open to interpretation. Only way to even begin to understand the norm is to look up examples on RILF site (to me, some of the forms on that normative guide are just bizarre).

The loan words that end in consonant when written but vowel when pronounced are inflected with an apostrophe ’. With half-vowels the rule is a bit shaky, but officially apostrophe is the only way.

I-final stems (old e stems)

Some i-final stems have i:e variation in singular forms, as they are originated from -e forms, only nominative has -i. They also have some consonant stem forms that are archaic for other classes of words. The difference between these classes are in the selection of singular partitives and plural GENITIVEs (but the boundaries of norm are not clear-cut, and most variants are found in the wild):

It is noteworthy of the official dictionary classification, that classes with numbers 24 and 26 are identical. The distinction should probably not be retained in future versions.

The -mi stems will rarely undego m:n variation for consonant stem forms.

The -si words that originate from old -te stems have the consonant gradation patterns left in their stems. The si is only in nominative stem and this class mainly concerns stems that are old enough to have undergone ti>si transformation.

A few -psi, -ksi, -tsi stems have consonant simplification for consonant stems. Other variation with these stems is the selection of plural GENITIVE allomorphs.

The -ksi stem in haaksi includes k:h variation.

Consonant final nouns

The consonant stems use inverted gradation if applicable, that is, the nominatives have end in consonants and their gradating consonants are in weak form. The singular forms of consonant final words have intervening e before suffixes. The basic consonant final words have no stem modifications.

Some of the n-final stems have n:m variation.

-tOn suffixes

The caritive suffix -ton inflects with A before the singular suffixes.

Lexicalised superlatives

The lexicalised superlatives have special inflection pattern.

Vasen inflects almost like superlative:

-nen suffixed forms

Number of derivations end in -nen, that has special alternation pattern.

-s final words

The s final words have some variation patterns that are determined lexically.

The basic variation is s:ks, with e before the singular suffixes.

Some of the s final stems have additional s:t:d variation in singular stems. Most notably, the UUs derivations are in this class.

Some s final words have special lengthening inflection.

The word mies has special s:h variation pattern.

t-final words

The t-final words have t:0 variation in the stem, and the singular suffixes are as usual joined with e. It is common to see non-standard forms of these words.

Few t-final words have lengthening in singular stems

Nominalised nut participles have special inflection just as well.

Old e^ stems

The e final words that have lost final consonant inflect like consonant final words, including the inverse consonant gradation. This class includes all deverbal -e suffixed nouns.

Dual nominative paradigms

A handful of words can use two completely distinct inflection patterns where a bit of overlapping inflection has been cut out. These words have two nominatives, and thus often two dictionary entries: one which is regular entry from the e^ class of words (like NOUN_ASTE), and one which is consonant final, and may have inverse gradation.

Exceptions to dictionary inflection

There are few cases where dictionaries traditionally have never indicated correct inflection by classification. In computational implementation we need to assign some classes or exceptional paths to them, and they are described here.

Two words have exceptions in their vowel harmony patterns:

It is not noted anywhere that the common inflection pattern for veli is exceptional:

A few of ika-final nouns, but not all, have the shifting half vowel written as j in normative orthography.

Noun forms of numerals with special inflection

The numerals are not really nouns in this morphology, for details see [numeral-stems], but some of their compounds are nouns, and the following classes are for those that have special stem or suffix patterns not available with other nouns. The numerals 1 and 2 are in paradigm that is currently left with one other noun, haaksi, so nominals with 2 go to that class but 1 gets a new class for being front vowelled. 3 Has its own paradigm, 4 is like koira, 5 like hiisi and 6 like kausi. The numerals 7, 8, 9 have their own paradigm, which, other than the nominative having extra n at the end, is same as the tri-syllabic a, ä stems, similarly 10 is quite like sisar with the extra en in nominative.

Plurales tantum

For some words, the singular forms are rare, odd, or even deemed ungrammatical, these words have separate classes for them. In Finnish words to commonly be in this class are events like häät (wedding), juhlat (party), etc. then all things that are semantically coupled, like clothes with two somethings (as with English): farkut (jeans), housut (pants). It is noteworthy, that sometimes dictionaries classify common words as plurale tantum for semantic reasons: joukot (troops) versus joukko (group). We don’t need that at this point. The compounds of plurale tantum words are made from singular forms: farkkukangas (jean fabric), hääjuhla (wedding party).

Adjective-initial Compounds with Agreeing Inflection

The words in dictionary paradigm class ⁵¹ refer to old closed class of adjective initial compounds, which follow agreeing compound pattern same as numbers. The amount of these words is relatively small, so they have been spelled out here in full form rather than using more complex methods of agreeing compounding, Further reading: VISK § 420 FIXME: some are still missing

Nouns and other nominals inflect in number, cases, possessives and with clitics, in that order. Combinations that this regular inflection can form is approximately 2×15×5×26=4900, so we do not show all variants in test cases and examples, but just central ones that are interesting and potential to break.

Nominatives

Singular nominative is the dictionary reference form for most of the words.

The plural nominative attaches to singular stem. For plural words it is also the form that is used for dictionary lookups:

Singular case inflection

The nouns can inflect in 15 regular cases. Most of the cases have one or two case endings, with only varying part being the harmony vowel. For nouns with direct consonant gradation, the most of the singular case suffixes attach to the weak singular stem:

For words with direct gradation the only forms that attach to strong singular stem may be essive and possessive’s of nominative or genitive:

Plural inflection

The strong plural stem of words with direct gradation contains only essive and comitative:

Cases with allomorphic variation

The nominal cases which can take several different suffixes are singular and plural partitives, singular and plural illatives and plural genitives. After stem variation the selection of these allomorphs is the main factor of morphological classification of nouns.

The reconstructed historical suffix for partitives of Finnish is ða, ðä, the current variants according to that theory would be the different realisations of extinct ð.

For basic vowel stems the partitive suffix is a, ä.

In stems other than -a, -ä stems, the 3rd possessive suffix may appear in -an, -än form after the partitive suffix.

The consonant stems and long vowel stems regularly take -ta, -tä suffix for singular partitives. It is up to interpretation whether the partitive suffix of the -e^ stem is considered to be -tta, -ttä, or just -ta, -tä. In principle the consonant that disappeared from -e^ stems could be from that.

The singular illatives have more variants.

The variant with intervening h attaches to long vowel stems:

The stems with short vowel do not have the intervening h.

Bisyllabic long vowel stems have illative suffix of -seen.

Plural genitives have the most variants. Especially, many words have more or less free variation between handful of choices.

The -in suffix for plural genitive that goes with singular stem is always markedly archaic. Most commonly it appears in compound words.

Plural partitive has a few variants:

And plural illative has few variants:

Possessive suffixes

Possessives come optionally after the case suffixes. For consonant final cases the possessives assimilate or eat the final part of the case ending or stem.

The possessive suffix of form -an, -än, attaches to some long vowel stems:

Noun clitics

Clitics can attach to any word-form, including one that already has a clitic. Clitics do not modify the form they attach to and are simply concatenated to the end.

Noun compounding

Nouns form compounds productively. The non-final parts of regular compounds are singular nominatives, or singular or plural genitives of nouns. The final parts are nominals and inflect regularly.


This (part of) documentation was generated from src/fst/morphology/affixes/nouns.lexc


src-fst-morphology-affixes-numerals.lexc.md

Numeral inflection

Numeral inflection is like nominal, except that numerals compound in all forms which requires great amount of care in the inflection patterns.

Original file


This (part of) documentation was generated from src/fst/morphology/affixes/numerals.lexc


src-fst-morphology-affixes-symbols.lexc.md

Symbol affixes


This (part of) documentation was generated from src/fst/morphology/affixes/symbols.lexc


src-fst-morphology-affixes-verbs.lexc.md

Verb inflection and derivation

The verbs’ conjugation includes voice (in Finnish grammars also verbal genus), tense/mood (tempus/modus), personal endings and negation marker The verbs also have productive derivations to defective nouns as infinitives, and to adjectives as participles, which are considered to be part of inflection and thus included in most versions (VISK § 105). The morphology of participles and historical 4th infinitive is further detailed in section Deverbal nouns’ morphology. The analysis strings of verb are not as systematic as nouns, as many categories collapse together in forms, e.g. the tense and mood are only distinct with indicative past and non-past, otherwise mood implies tense in semantic sense.

Verb stem variation

Verbs have no allomorphic variation per se, except for some assimilation and variation of ð forms, but the stem variation is the same as in nouns. The examples for verb stems are given for each class: A infinitive’s lative, e infinitive’s inessive, indicative present 1st singular, indicative present 3rd singular, indicative past 1st singular, indicative past 3rd singular, conditional past 3rd singular, imperative 2nd plural, potential 1st singular, present passive, past passive, nut participle passive

Verb stems without stem variation

The u stems have no stem variation:

The o stems have no stem variation:

The ö stems have no stem variation:

The y stems have no stem variation:

Verb stems with only gradation

Verbs with -a stems

Some of the a stems have t:s in past stems by ti>si variation.

In some cases t:s variation is optionally alongside the regular gradation:

Other a stems undergo a:o variation

In some of the a:o variations the t:s variant is also possible.

Verbs with e stems

Some of the e stems allow for optional t:s variation in past form.

The rare ht:ks kind of variation is also possible.

Verbs with i stems

Verbs with long vowel stem

These verbs also have da variant of a infinitive forms.

Monosyllabic verbs with widening diphthong

Widening diphthongs are simplified before past and conditional suffix i’s by removal of first component.

In past and conditional forms of käydä, the glide before suffix is marked even in normative orthography.

Verbs with consonant stems

Verbs with momentane derivation are common consonant stems.

Verbs with n, r, l, s stems

Notably, the a infinitive forms d assimilates to preceding consonant.

Frequentative derivations are most common source of l stemmed verbs.

tse stuff

Some verbs have possible optional heteroclitic indicative stems:

In these stems the tse formed stem is only one.

Few words have special consonant cluster simplification for ks forms.

nähdä has special h:k variation.

Verbs with -ne- stems

Vowel lengthening(?) stems

Vowel stems with t:s variation

Verbs with defective paradigms

For some verbs, the normative inflection does not allow full set of forms:

Verbs with exceptional inflection patterns

There is a handful of verbs that does not fit to the patterns of old dictionaries.

The verb olla has very peculiar and heteroclitic inflection with lot of common short forms in standard spoken Finnish:

Verb inflection proper

Present vowel stems

The strong form of present indicative endings in strong stems and ma infinitive, and maisilla derivation (nee infinitive).

Verb 3rd singular forms

The third singular form of present tense has few allomorphs according to preceding vowel context, either lengthening or zero after long vowel stem:

Past forms

Imperatives

Conditionals

Potentials

Passive forms

The passive forms usually contain -ta-, -tä-, -da-, -dä-, element in them. The variation between the realisations is one key factor of determining the classification of the verb roots.

The form of present passive assimilates leftwards, varying between -ta, -tä, -da, -dä, -la, -lä, -ra, -rä, -na, -nä.

Infinite verb forms

Participles

Possessives for infinite verb forms

Verb clitics

Deverbal derivations

Part of the deverbal derivation system in Finnish is so regular that it has been included as part of inflectional morphology in many traditional systems. These derivations are treated as inflection in our system as well.

-minen, “Fourth infinitive”

Participles


This (part of) documentation was generated from src/fst/morphology/affixes/verbs.lexc


src-fst-morphology-compounding.lexc.md

Prefixing and compounding

Prefixes are not put here so far

The circular lexicon

The compound part sub-set NOMINAL The nominal forms can be used as non-initial parts of typical compounds


This (part of) documentation was generated from src/fst/morphology/compounding.lexc


src-fst-morphology-phonology.twolc.md

This file documents the phonology.twolc file


This (part of) documentation was generated from src/fst/morphology/phonology.twolc


src-fst-morphology-root.lexc.md

Morphology

The morphological division of Finnish words has three classes: verbal, nominal and others. The verbs are identified by personal, temporal, modal and infinite inflection. The nominals are identified by numeral and case inflection. The others are, apart from being the rest, identified by defective or missing inflection.

Symbols used for analysis Multichar_Symbols

The Finnish morphological implementation uses analysis symbols mainly to encode morphological analyses, the rest are implemented else where. Some non-morphological analyses or classifications are retained for interoperability and historical reasons. There are further details and examples of this classification in other parts of this documents; this page merely summarises the codes used in this version of the system.

Parts-of-speech

The main morphological division of words is merely: Verbs, Nominals, Rest. The syntactic and semantic subdivision is realised in POS tags. The nominals consist of nouns (substantiivi), adjectives, pro words and numerals. Verbs are non-divisible, but include infinitive and particple forms. The others are subdivided into adpositions, adverbs and particles. Further reading: [VISK s.v. sanaluokka|http://scripta.kotus.fi/visk/visk_termit.cgi?h_id=sCABBIDAI], VISK § 438.

Temporary list of added tags

These tags were added as part of a tag unification, and should be put where they belong.

Are there tags not declared in root.lexc or misspelled? Have a look at these:

Parts of speech

The part-of-speech analyses are typically the first:

Nouns

In nominal analyses, the proper nouns have additional subanalysis. Proper nouns are usually written with initial capitals–or more recently, totally arbitrary capitalisations, such as in brand names nVidia and ATi. Proper nouns do have full inflectional morphology exactly as other nouns, but work slightly differently in derivation and compounding. Some capitalised nouns may also lose capitalisation in derivation. VISK § 98

The code for proper nouns:

Proper noun tag follows noun analysis:

Pronouns

Pronominal analyses have some semantic classes. VISK § 101–104. Codes for various semantic classes:

Semantic tags follow pronoun analyses:

Numerals

In numeral analyses, there are multiple analyses. The numerals have semantic subcategories (VISK § 770). The classical ordinal numbers have been adjectivised in current descriptions (VISK § 771), the ordinal interpretation is still spelled out in subcategories. The numbers are often written with digits or other specific notations. Numeral class tags:

Particles

The particles are subcategorised syntax-wise into conjunctions for all words, that govern subclauses (VISK § 812). The conjunctions are further divided, whether the subclause is coordinant or subordinant to the governing clause and few other syntactic types (VISK § 816). N.B. that the division to subordinating and coordinating conjucntions is motivated by other systems, including legacy systems, whereas the grammar presents also different categorisations for conjunctions (including naming subordination adverbials). Conjunction syntax tags:

The conjunction tags take place of part-of-speech tags for legacy reasons:

Adpositions

In adposition anlayses, the syntactic tendencies are shown in sub-analyses; whether they appear typically before or after their heads VISK § 687.

Adposition syntax tags:

Adpositions are tagged in POS position:

Tags for sub-POS

Bound root morphs

The lexical items that appear as bound morphemes before head word are classified as prefixes ([VISK § 172|http://scripta.kotus.fi/visk/sisallys.php?p=172]). Prefixes are rare and mostly of foreign origin. The singular forms of plurale tantums are also potential prefixes.

Suffixes are typically word forms or derivations that only appear as bound morphs. Other than that Finnish does not really have proper suffixes. This means that suffixed words are in effect compounds of where the last word just doesn’t appear as free morph.

Symbols

Symbols are not part of linguistic data per se so we classify them according to the needs of end user applications

The analyses for symbols are like POSes:

Nominal analyses

The analyses of nominals show the inflection in number. Nominals inflect in number, to mark plurality of the word. The number for nouns is either singular or plural. Further reading: VISK § 79 Number tags:

Number tags are next to POSes in nominal analyses, and in order of morphs:

The analyses of nominals have case inflection marked. The nominals have case inflection (VISK § 81) to mark syntactic roles (nominative, partitive, accusative-genitive) and semantics (others, partially even syntactic cases).

The case is next to number and last obligatory analysis in nominals:

The analyses of a infinitive short form have lative ending; this is largely historical (VISK § 120). Some adpositions might have same analysis in diachronic analyses.

The analyses of certain nominals give explicit analysis for accusative case. The accusative case has distinctive marker in few pronouns and these are only cases that are analysed as accusatives. (VISK § 81). Other accusatives have the same case marking as genitive form, and only use that analysis in synchronic analyses.

Adverbs and adpositions may have some special analyses in diachronic analyses. Further reading: [VISK § 371|http://scripta.kotus.fi/visk/sisallys.php?p=371] – 385

Possessives

The analyses of nominals include possessive if present. Posessive ending indicates ownership. The possessive can take six possible values from singular and plural, first, second and third person references, where third person form is always ambiguous over plurality. Further reading: VISK § 95

Compound forms

In compound analyses, the derived compound form that is not a free morph is marked with special analysis. Some words have forms only appearing in compounds. Further reading: VISK § 406 Compound form

Finite verbs

All verb analyses contain voice marking. For finite verb forms active voice is tied to personal forms and passive voice to non-personal verb endings. The voice is also marked in the infinite verb forms. Further reading: VISK § 110

It is the first analysis of verb strings:

Finite verb form analyses have a reading for tense. The tense has two values. For moods other than indicative the tense is not distinctive in surface form, and therefore not marked in the analyses. The morphologically distinct forms in Finnish are only past and non-past tenses, while other are created syntactically and not marked in morphological analyses. Further reading: VISK § 111 – 112

The tense is marked in indicative forms after mood:

Finite verb form analyses have a reading for mood. Mood has four central readings and few archaic and marginal. The mood is marked in analyses for all finite forms, even the unmarked indicative. Further reading: VISK § 115 – 118

The mood is after voice in the analysis string and in morph order:

Finite verb form analyses have a reading for person. Personal ending of verb defines the actors. The person analysis has seven possible values, six for the singular and plural groups of first, second and third person forms, and one specifically for passive. The passive personal form is encoded as fourth person passive, which had been the common practice in past systems. Further reading: VISK § 106 – 107

The person is the last required analysis for verbs, after the mood:

Negation and verbs

The analyses of verb for the forms that require negation verb have a special analysis for it.

The suitable negation verbs have sub-analysis that can be matched to negated forms on syntactic level.

Infinite verb forms

Infinitive verb forms have infinitive or nominal derivation analyses. In traditional grammars the infinitive forms were called I, II, III, IV and V infinitive, the modern grammar replaces the first three with A, E and MA respectively. The IV infinitive, which has minen suffix marker, has been re-analysed as derivational and this is reflected in |omorfi|. The V infinitive is also assumed to be mainly derivational, but included here for reference. Further reading: VISK § 120 – 121 The infinitives have limited nominal inflection.

Infinitive analysis comes after voice, followed by nominal analyses:

Participles

Participial verb forms have participle readings. There are 4 participle forms. Like infinitives, participles in traditional grammars were named I and II where NUT and VA are used in modern grammars. The agent and negation participle have sometimes been considered outside regular inflection, but in modern Finnish grammars are alongside other participles and so they are included in inflection in omorfi as well. In some grammars the NUT and VA participles have been called past and present participles respectively, drawing parallels from other languages. The modern grammar avoids them as misleading but this description uses them Further reading: VISK § 122

Participle analyses are right after voice, followed by adjectival analyses:

There are number of implementations that mix up MA infinitives and Agent participles, and they share part of the same forms but no semantics and very little of syntax.

Comparation

Adjective and some adverbial analyses are marked for comparation. The non-marked forms are comparative and superlative. For adjectives, comparative suffixes precede the nominal inflection. c.f. VISK § 300

The comparison analysis occupies derivation spot, after POS:

Enclitic focus particles

All word forms can have clitics which are analysed by their orthography. Clitics are suffixes which can attach almost anywhere in the ends of words, both verb forms and nominals. They also attach on end of other clitics, theoretically infinite chains. In practice it is usual to see at most three in one word form. Two clitics have limited use: -s only appears in few verb forms and combined to other clitics and -kA only appears with few adverbs and negation verb. VISK § 126 – 131

Derivation

The derivation is not a central feature of this morphology, it is mainly used to collect new roots for dictionaries. This is roughly in order of perceived productivity already:

Usage

The analyses of some words and word-forms indicate limitedness of usage. This includes common mispellings, archaic words and forms and otherwise rare words and forms. Especially, the forms that are in parentheses in lexical sources and word-forms that had the usage annotation in there have been carried over.

Usage tags are pushed wherever appropriate:

Homonym tags

Dialects

The informal language use contains different Finnish than the literary standard, this is marked as standard dialect (yleispuhekieli): common features include dropping final vowels, dropping final i components of unstressed diphtongs, few other shortenings. Other dialects are also sometimes analysed; the geographical division has three levels: East versus West, East containing Savo and South-East (North?) West containing North, perä, keski and eteläpohjalaiset, southwest and Häme The third level dialect division is traditionally by “town” borders, be cautious when adding these though; it’s not the main target of this mrophology.

Tags for language of unassimilated name

Compounding tags

The tags are of the following form:

This entry / word should be in the following position(s):

If unmarked, any position goes.

The tagged part of the compound should make a compound using:

Unmarked = Default, ie +CmpN/SgN for SME.

The second part of the compound may require that the previous (left part) is:

These tags describe the parts of the compound.

The prefix (before “/”) is Cmp.

Others

The boundaries of compounds that are not lexicalised in the dictionary will have compound analyses, the compounds may also have usage tags. The compounding analyses concern also syntagmatic melting mishmash.

Compound boundary

The word and morpheme boundaries are used to limit the effective range of far-reaching rules, such as vowel harmony. The boundaries are marked by curly bracketed hashes or underscores. The word boundaries are marked by #, The lexical item boundaries by ##, the inflectional morpheme boundaries by >, the derivational morpheme boundaries by », and some etymological and soft boundaries by _.

Flag diacritics

We have manually optimised the structure of our lexicon using following flag diacritics to restrict morhpological combinatorics - only allow compounds with verbs if the verb is further derived into a noun again: | @P.NeedNoun.ON@ | (Dis)allow compounds with verbs unless nominalised | @D.NeedNoun.ON@ | (Dis)allow compounds with verbs unless nominalised | @C.NeedNoun@ | (Dis)allow compounds with verbs unless nominalised

@C.ErrOrth@
@D.ErrOrth.ON@
@P.ErrOrth.ON@

For languages that allow compounding, the following flag diacritics are needed to control position-based compounding restrictions for nominals. Their use is handled automatically if combined with +CmpN/xxx tags. If not used, they will do no harm. | @P.CmpFrst.FALSE@ | Require that words tagged as such only appear first | @D.CmpPref.TRUE@ | Block such words from entering ENDLEX | @P.CmpPref.FALSE@ | Block these words from making further compounds | @D.CmpLast.TRUE@ | Block such words from entering R | @D.CmpNone.TRUE@ | Combines with the next tag to prohibit compounding | @U.CmpNone.FALSE@ | Combines with the prev tag to prohibit compounding | @P.CmpOnly.TRUE@ | Sets a flag to indicate that the word has passed R | @D.CmpOnly.FALSE@ | Disallow words coming directly from root. Use the following flag diacritics to control downcasing of derived proper nouns (e.g. Finnish Pariisi -> pariisilainen). See e.g. North Sámi for how to use these flags. There exists a ready-made regex that will do the actual down-casing given the proper use of these flags. | @U.Cap.Obl@ | Allowing downcasing of derived names: deatnulasj. | @U.Cap.Opt@ | Allowing downcasing of derived names: deatnulasj.

The start of the dictionary Root The Finnish morphological description starts from any of the parts of speech dictionaries, prefix or hyphenated suffix


This (part of) documentation was generated from src/fst/morphology/root.lexc


src-fst-morphology-stems-adjectives.lexc.md

Adjective classification

Adjectives are words that are inflected like nouns, with few additions. For adjectives, the comparative derivations are usually allowed and the possessive suffixes are unlikely. The syntactic adjectives that do not have comparative derivations are nouns, if they have nominal inflection, or particles, if they do not inflect. The examples you need to find the correct classification are same as for nouns, with addition of comparative and superlative.

The classification of adjectives combines the stem changes, the final allomorph selection and the harmony. See the list from:


This (part of) documentation was generated from src/fst/morphology/stems/adjectives.lexc


src-fst-morphology-stems-adverbs.lexc.md

Adverb classification

Adverbs are a heterogenous mass of words with defective inflectional, usually sourced from various forms of nominals. It would be possible to classify adverbs along etymology and semantics, but we do not yet use such classification. Only the morphology is recorded in the continuation classes and analyses.

The classification of the adverbs in morphology goes along the possessives and clitics they take or require:


This (part of) documentation was generated from src/fst/morphology/stems/adverbs.lexc


src-fst-morphology-stems-conjunctions.lexc.md

Coordinating conjunctions Coordinating conjunctions combine equal clauses and phrases. As subset of particles, they do not inflect. The classification is solely syntactic and semantic, but it is used in this system for compatibility with other stuff.

The coordinating conjunctions are: eli(kkä), ja, joko – tai, kuin – myös, ‑kä, mutta, niin – kuin ‑kin/ myös, sekä, sekä – että, sun; tai, vaan, vai, ynnä, (saati), (sillä) Further reading: VISK § 816)


This (part of) documentation was generated from src/fst/morphology/stems/conjunctions.lexc


src-fst-morphology-stems-determiners.lexc.md

Determiners

Finnish don’t traditionally have determiners. Some claim that few pronouns are used like determiners, and can be analysed as such.


This (part of) documentation was generated from src/fst/morphology/stems/determiners.lexc


src-fst-morphology-stems-digits.lexc.md

Digits and such expressions

Digit-strings are used in place of numerals. They inflect with colon, like acronyms, and compound with hyphen only.

Digits are constructed as several cyclic structures: integers, decimals or roman numerals. Zero alone works quite differently:

**LEXICON ARABICLOOP_pirinen ** essentially allows any number-sign combination, but is like the other lgs

**LEXICON ARABICLOOP_pirinen ** is for entries not looping back

The digit strings that end in 10 to 12 + 6n 0’s are inflected alike:

The digit strings that end in 6 to 9 + 6n 0’s are inflected alike:

Decimal digit strings start with any number of digits 0 to 9, followed by decimal separator comma. The decimal dot may be allowed as substandard variant.

The decimal digit strings end in any number of digits 0 to 9, inflected along the last part.

The decimal digit strings with dot may be allowed as sub-standard option with respective analysis.

Roman numerals with inflection

Roman numerals are composed the symbols M, D, C, L, X, V, I in ascending scale and some combinations, they denote ordinal numbers and inflect like ones.

Main lexicon for roman digits

This lexicon divides into four groups

Roman numerals according to digital class, one by one

Roman thousands

Thousands can be followed by any of other parts

Roman hundreds

Hundreds can be followed by anything but thousands:

Roman tens

Tens can be followed by ones:

Roman ones

Ones come alone


This (part of) documentation was generated from src/fst/morphology/stems/digits.lexc


src-fst-morphology-stems-exceptions.lexc.md

Exceptions are quite strange word-forms. the ones that do not fit anywhere else. This file contains all enumerated word forms that cannot reasonably be created from lexical data by regular inflection. Usually there should be next to none exceptions, it’s always better to have a paradigm that covers only one or few words than an exception since these will not work nicely with e.g. compounding scheme or possibly many end applications.

negation verb has partial inflection:

Some verbs only have few word-forms left:

The noun ruoka has irregular forms:

The adjective hyvä has heteroclitic comparative derivations too:

Some of the nouns have archaic consonat stem forms left:

few verbs have shortened forms in standard spoken Finnish


This (part of) documentation was generated from src/fst/morphology/stems/exceptions.lexc


src-fst-morphology-stems-fin-abbreviations.lexc.md

Continuation lexicons for Finnish abbreviations

Abbreviations are shortened forms that do not inflect. They have whatever classification they would have if they were read out, mostly that of nouns. Lot of abbreviations end in a full stop, which may complicate analysis and tokenisation in real-world applications


This (part of) documentation was generated from src/fst/morphology/stems/fin-abbreviations.lexc


src-fst-morphology-stems-fin-acronyms.lexc.md

Acronym classification

Acronyms are shortenings that inflect. They all have two inflection patterns, one read letter by letter, and one word-by-word. They are separate entries in this dictionary. For example OY will have singular illatives OY:hyn and OY:öön for yyhyn and yhtiöön resp., although latter is much rarer. A big majority of popular acronyms in everyday use comes from English, and the word-based inflection is virtually non-existent and would be very confusing so there’s no high priority for adding that.

The first classification for acronyms should be along the final letter, then if the final word inflection is used, the class of that word.


This (part of) documentation was generated from src/fst/morphology/stems/fin-acronyms.lexc


src-fst-morphology-stems-interj.lexc.md

Interjections

Interjections are mainly parts of spoken language that are minimal turns in dialogue, curses, onomatopoeia and such. Interjections are a subset of particles, and do not inflect. They are quite productive kind of, though limited in form ; they stem from arbitrary combinations of characters to

Only add new interjections that are found from corpora.


This (part of) documentation was generated from src/fst/morphology/stems/interj.lexc


src-fst-morphology-stems-nouns.lexc.md

Nouns and their classification

Noun is the part-of-speech for words which require declination in number and case. Additionally nouns may have optional possessive suffixes and clitics combined freely at the end. While some of the nouns may exhibit limited comparative derivations, generally words that can undergo comparation must be classified into adjectives. The proper nouns that are written in initial capital letters except when derived are handled separately under proper nouns, but the classification is the same.

The nominals are classified by combination of the stem variations, suffix allomorphs and the vowel harmony. The nouns have number, case, possessive and clitic suffixes:

naan+N:naan also naan is an Indian bread with NOUN_PUNK paradigm


This (part of) documentation was generated from src/fst/morphology/stems/nouns.lexc


src-fst-morphology-stems-numerals.lexc.md

Finnish Numerals

Numerals have been split in three sections, the compounding parts of cardinals and ordinals, and the non-compounding ones:

The compounding parts of cardinals are the number multiplier words.

The suffixes only appear after cardinal multipliers

The compounding parts of ordinals are the number multiplier words.

The suffixes only appear after cardinal multipliers

There is a set of numbers or corresponding expressions that work like them, but are not basic cardinals or ordinals:

Numeral stem variation

Numerals follow the same stem variation patterns as nouns, some of these being very rare to extinct for nouns.


This (part of) documentation was generated from src/fst/morphology/stems/numerals.lexc


src-fst-morphology-stems-particles.lexc.md

Particles

The particles are all words that do not inflect at all. For compatibility reasons subsets of particles have been set off to classes like conjunctions, adverbs, adpositions, and interjections. The ones that are not in those classes are left here as particles.

Examples:


This (part of) documentation was generated from src/fst/morphology/stems/particles.lexc


src-fst-morphology-stems-pp.lexc.md

Adpositions

Adpositions are morphologically nominals that have defective inflection patterns. Some of them come from forms of nominals that are no longer used. The adpositions are classified along whether they take possessives clitics, or not. They also have slight syntactic and semantic differences, the syntactic differences are coded in the analyses to be compatible with other languages, but for most intents and purposes all adpositions can appear in both syntactic positions, after and before the head word.

Examples:


This (part of) documentation was generated from src/fst/morphology/stems/pp.lexc


src-fst-morphology-stems-prefixes.lexc.md

Prefixes

Prefixes are bound morphs that can appear in beginning of the compounds, mostly forms of nominals. Finnish does not have almost any real prefix morphemes.


This (part of) documentation was generated from src/fst/morphology/stems/prefixes.lexc


src-fst-morphology-stems-pron.lexc.md

Pronouns

Pronouns are a closed special sub class of nouns. Morphologically pronouns have often defective, heteroclitic or otherwise irregular inflectional patterns, and certain pronouns have an morphophonologically distinct accusative case, extinct from other noun classes Further reading: [VISK §§ 100|http://scripta.kotus.fi/visk/sisalto.php?p=100] – 104, Semantics … VISK § 7XX

Pronouns are subdivided into categories by semantic and syntactic means. Semantic categories delimit the type of referents (humane, sentient, object), qualification and quantification. (interrogative, quantor). Morphosyntactically distinct is class of proadjectives, that inflect and act like adjectives. There are six personal pronouns for the six deictic references used; first, second and third singular and plural. The personal pronouns have separate accusative cases marked by t suffix. The pronouns in standard literary Finnish are minä (I), sinä (you), hän (he), me (we), te (you), he (they). Further reading: VISK § 100

The personal pronouns are among the most dialectally varied words of the Finnish language. The pronouns forms are one of the factors separating eastern dialects from the western ones. The personal pronouns of eastern dialects are mie, sie, (hää, hiä), myö, työ, hyö resp.; The third singular being rare in modern use. |citation-needed|

In the western dialects the pronouns are mää, sää for first and second singular, and more variedly meitti, teitti, heitti for plurals.

In standard spoken Finnish, and in many cases even in written form, the words and are more common and preferred to longer minä and sinä for first and second singular respectively. In practice the distinction is much like between Estonian corresponding pronouns, but official norm still recommends only the long forms. For third singular the nominative form is hän as in standard written language, however the inflection is without intervening -ne- part. In old literary Finnish and poetic language the forms ma and sa are still used.

There are six demonstrative pronouns for six non-personal references. In standard written Finnish these are tämä (this), tuo (that), se (it), nämä (these), nuo (those), ne (those).

Further reading: VISK § 101

In standard spoken Finnish the demonstrative pronouns are commonly tää, toi, nää, noi instead of tämä, tuo, nämä, nuo.

Interrogative pronouns are used in question clauses. The basic interrogatives in standard written Finnish are kuka (who), mikä (what), kumpi (which); millainen (what kind of), kuinka (how), miksi (what for). Further reading: VISK §734

The stem of kuka is shortened by from kene to ke in spoken language.

Few forms of kuka based on ken stem and ku stem have become archaic. Fuhrer reading: VISK §102 Also the short form of mi is archaic and limited to poetic use. |citation-needed|

Relative pronouns are kuka, joka and mikä (which, whose). VISK §735| They are morphologically indistinct from corresponding interrogative pronouns.

Quantor pronouns correspond to existential and universal quantifiers and their negations. The generic quantors are joku (someone), jokin (something), jokainen (everyone), kaikki (everything), kukin (each one), kukaan (no one), mikään (nothing), jokunen, muutama, harva (a few), moni (many) and useampi (more). The dual quantors, quantifying over set of two objects are jompikumpi (either or), kumpikin, molemmat (both), kumpikaan (neither). VISK §740 The quantor pronouns subsume the class of indefinite pronouns used in older grammar defintions. VISK §742 The indefinite quantifiers are classified as indefinite quantors for the sake of compatibility. This covers joku, jokin, jompikumpi, as well as specific eräs, muuan (some), yksi (one).
Further reading VISK §746 – 749.

Reflexive pronoun is the word itse refering to self, usually but not always coupled with possessive suffix to denote the referent. Further reading: VISK §729

Reciprocal pronoun is toinen refering to each other. It uses possessive suffix to delimit the reciprocal group. Further reading: VISK §732

Proadjectives are pronouns that act in place of adjectives syntactically. They are formed by compounds (or derivations) of pronoun and lainen or moinen (such as). Further reading: VISK §715

Proadverbs are the pronouns that have lexicalised into adverbs by their syntax and semantics. Further reading: VISK §715

forms of jompi may not exist as free morphs. The marginal forms of monias are extinct. Oddly enough, the semireduplicative intensifier monituinen is nowhere to be found in VISK either.

Marginally in the pro word category are nouns, adjectives and adverbs refering to equivalence in comparative context since they are also otherwise lacking meaning like other pro words. This fgroup includes words sama (same), eri (different), muu (other), toinen (another), and their derivations. Further reading: VISK §766

In spoken language the supposedly non-inflecting eri has common inflected forms.


This (part of) documentation was generated from src/fst/morphology/stems/pron.lexc


src-fst-morphology-stems-propernouns.lexc.md

Proper nouns

Proper nouns are morphologically indistinct subset of nouns. They have some orthographical differences, required capitalisations and compounding with hyphens. The derivations may lowercase. They may be classified semantically to match other giellatekno things in the future.

details see [noun-stems.html]. The proper nouns are classified and inflected along noun patterns, for

Many of Proper nouns inflect like nouns… however, compound differently


This (part of) documentation was generated from src/fst/morphology/stems/propernouns.lexc


src-fst-morphology-stems-punct.lexc.md

Other symbols

Punctuation characters detailed here are the characters that appear commonly in Finnish texts, but are not part of words or linguistic content. The punctuations control clause and sentence level annotations, and range from full stops and commas to brackets. While punctuation symbols might have limited use as inflecting units, the ones described here refer to punctuation symbols as used in their primary purpose, in isolation. The part of language norms controlling punctuation are from orthography and the references of punctuation in good language use are not from the grammar but issues of Kielikello journal on good Finnish language use. The most current issue on punctuation was [Kielikello 2/2006|http://arkisto.kielikello.fi/index.php?mid=2&pid=12&maid=110] (N.B. you may need to buy subscription or route through university servers).

The primary punctuation marks are sentence final punctuation, they mark the end of a sentence. The most typical of these is full stop symbol, which ends neutral sentences. The exclamation mark and question mark end exclamative and questioning sentences respectively. An elliptic or unfinished sentence is ended with three successive full stops. In sloppy writing style it is common to use two, four or more full stops to mark an elliptic sentence. The Unicode compatibility character ellipsis has never been used for Finnish language and must not be used. Same applies for other combinations of sentence ending punctuation marks, the most common of these have separate analyses.

The clause level punctuation marks are used in clause boundaries. The most typical of these is comma. The colon and semicolon are too. The clause boundaries do not have separate semantics needed in applications so they only have analyses for clause boundaries.

The brackets are used to offset portions of text in opening and closing pairs. The most common pair is round brackets. Others used in Finnish are square, curly and angle brackets, in somewhat decreasing order of commonness. The angle brackets are commonly replaced by lower than symbol for opening and greater than symbol for closing bracket. The bracketed question mark is used to indicate uncertainty and bracketed exclamation mark to indicate surprise, both of these annotations are used within sentence as other bracketed constructions.

The quotation marks are used to offset quotations. The typical ones in Finnish are the 9-shaped double quotation marks and apostrophes. Angle quotation marks can also be used, primarily in books and newspapers. It is possible to replace curly quotation marks with neutral typewriter ones where technology limits. It is also common to see foreign quotation marks or accent marks in place of quotation marks in sloppy writing style.

There are two different dashes in Finnish. The hyphen is used for mainly word internally and won’t appear as itself. The dash is used to offset some sentences or mark elision. The dash symbol can be either of unicode dash symbols or replaced with dash offset by spaces. In sloppy writing, two hyphens are often used in place of dashes.

The space is used to separate words. For most applications the space has separate meaning so it rarely gets used as a symbol in applications of

Less used symbols that appear in the Finnish texts; these do not have special analyses. A slash can be used as a replacement of the meaning ‘or’, as a division slash or as a separator of verses in poem. Backslash is used only in computer systems. Underscore is used only in computer systems. The pipe is used in dictionaries as morhpeme boundary, and computer systems. At sign is used only in computer systems. An ampersand can be used as a replacement of the meaning ‘and’. Percent symbol is used after numeric expressions meaning 0.01 multiplier. Permille symbol is used after numeric expressions meaning 0.001 multiplier. § sign is used for numbering sections etc. Degree sign is used with measurements. The second and minute signs can be used in conjunction with degree sign. The plus sign, specific minus sign and plus-minus signs are used in numeric expressions. The multiplication sign is used in numeric expressions. The equals sign is used in formulae. The asterisk is used as a marker for ungrammatical constructions and computer and other expressions. Registered and trademark symbols are rarely used. Copyright symbol is rarely used. Hash sign is used in phones and computer systems. The doubled § sign has been used as chapter range sign. The pilcrow sign can be used to mark chapters. The currency signs for euro, dollar, pound sterling, cent and yen can be used. I don’t think anyone uses the currency sign ¤ ever. The lines below this one are not from any referenced source


This (part of) documentation was generated from src/fst/morphology/stems/punct.lexc


src-fst-morphology-stems-subjunctions.lexc.md

Adverbial conjunctions

The adverbial conjunctions join two unequal clauses or phrases together. The traditional term for this is sub-ordinating conjunction, it is assumed here for compatibility with other languages. Adverbial conjunctions are a subset of particles, so they do not inflect at all.

The adverbial conjunctions are: ellei, että, jahka, jollei, jos, joskin, jos kohta, jotta, koska, kun, kunhan, mikäli, vaikka, (kunnes). Further reading: VISK § 818


This (part of) documentation was generated from src/fst/morphology/stems/subjunctions.lexc


src-fst-morphology-stems-suffixes.lexc.md

Suffixes

Suffixes are bound morphs that come after nominals in compounds. Finnish doesn’t quite have real suffixes, these are mostly compound parts.

Examples:


This (part of) documentation was generated from src/fst/morphology/stems/suffixes.lexc


src-fst-morphology-stems-verbs.lexc.md

Verbs are the words that inflect in tense, mood, personal suffixes, and clitics, but verbs also have s.c. infinite inflection pattern which is basically nominal derivations. The dictionary entries of verbs are A-infinitive forms, there are no verbs in dictionary that do not end in a or ä. Verbs are very distinct from other classes, their classification is not difficult. The key to find unique class for a verb is to pick stems and suffixes from: indicative non-past 1st singular and 3rd singular, indicative past 1st singular, …

The auxiliary verbs require infinintive verbal phrase objects. Infinitives usually: aion tappaa, joudun kuolemaan

The verbs are classified along the stem mutations suffix assimilation, and harmony:


This (part of) documentation was generated from src/fst/morphology/stems/verbs.lexc


src-fst-transcriptions-transcriptor-numbers-digit2text.lexc.md

Number transcriptions

Transcribing numbers to words in Finnish is not completely trivial, one reason is that numbers in Finnish are written as compounds, regardless of length: 123456 is satakaksikymmentäkolmetuhattaneljäsataaviisikymmentäkuusi. Another limitation is that inflections can be unmarked in running text, that is digit expression is assumed to agree the case of the phrase it is in, e.g. 27 is kaksikymmentäseitsemän, and 27:lle kahdellekymmenelleseitsemälle but in a phrase: “tarjosin 27 osanottajalle” 27 assumes the allative case without marking and it is preferred grammatical form in good writing.

Flag diacritics

Flag diacritics in number transcribing are used to control case agreement: in Finnish numeral compounds all words agree in case except in nominative singular where 10’s exponential multipliers are in singular partitive.

Morphotactics of digit strings

The morphotactics related to numbers and their transcriptions is that we need to know the whole digit string to know how the length of whole digit string to know what to start reading, and zeroes are not read out but have an effect to readout. The numerals are systematic and perfectly compositional: the implementation of 100 000–999 999 is almost exactly same as 100 000 000–999 000 000 and everything afterwads with the change of word tuhat~tuhatta, miljoona~miljoonaa, miljardia, biljoonaa, biljardia and so forth–that is along the long scale British (French) system where American billion = milliard etc. The numbers are built from ~single word length blocks in decreasing order with the exception of zig-zagging over numbers 11–19 where the second digit comes before first. The rest of this documentation describes the morphotactic implementation by the lexicon structure in descending order of magnitude with examples.

Lexicon HUNDREDSMRD contains numbers 2-9 that need to be followed by exactly 11 digits: 200 000 000 000–999 999 999 999 this is to implement Nsataa…miljardia…

Lexicon CUODIMRD contains numbers 2-9 that need to be followed by exactly this is to implement Nsataa…miljardia…

Lexicon HUNDREDMRD is for numbers in range: 100 000 000 000–199 000 000 000 this is to implement sata…miljardia…

Lexicon TEENSMRD is for numbers with 11 000 000 000–19 000 000 000 this is to implement …Ntoista…miljardia…

Lexicon TEENMRD is for numbers with 11 000 000 000–19 000 000 000 this is to implement …Ntoista…miljardia…

Lexicon TENSMRD is for numbers with 20 000 000 000–90 000 000 000 this is to implement …Nkymmentä…miljardia…

Lexicon TENMRD is for numbers with 10 000 000 000–10 999 999 999 this is to implement …kymmenenmiljardia…

Lexicon LÅGEVMRD is for numbers with 20 000 000 000–90 000 000 000 this is to implement …Nkymmentä…miljardia…

Lexicon ONESMRD is for numbers with 1 000 000 000–9 000 000 000 this is to implement …Nmiljardia…

Lexicon MILJARD is for numbers with 1 000 000 000–9 000 000 000 this is to implement …Nmiljardia

Lexicon OVERMILLIONS is for the millions part of numbers greater than 1 milliard

Lexicon HUNDREDSM contains numbers 2-9 that need to be followed by exactly 8 digits: 200 000 000–999 999 999 this is to implement Nsataa…miljoonaa…

Lexicon CUODIM contains numbers 2-9 that need to be followed by exactly this is to implement Nsataa…miljoonaa…

Lexicon HUNDREDM is for numbers in range: 100 000 000–199 000 000 this is to implement sata…miljoonaa…

Lexicon TEENSM is for numbers with 11 000 000–19 000 000 this is to implement …Ntoista…miljoonaa…

Lexicon TEENM is for numbers with 11 000 000–19 000 000 this is to implement …Ntoista…miljoonaa…

Lexicon TENSM is for numbers with 20 000 000–90 000 000 this is to implement …Nkymmentä…miljoonaa…

Lexicon TENM is for numbers with 10 000 000–10 999 999 this is to implement …kymmenenmiljoonaa…

Lexicon LÅGEVM is for numbers with 20 000 000–90 000 000 this is to implement …Nkymmentä…miljoonaa..

Lexicon ONESM is for numbers with 1 000 000–9 000 000 this is to implement …Nmiljoonaa…

Lexicon MILJON is for numbers with 1 000 000–9 000 000 this is to implement …Nmiljoonaa

Lexicon UNDERMILLION is for numbers with 100 000–900 000 after milliards

Lexicon OVERTHOUSANDS is for the thousands part of numbers greater than 1 million

Lexicon HUNDREDST contains numbers 2-9 that need to be followed by exactly 5 digits: 200 000–999 999 this is to implement Nsataa…tuhatta…

Lexicon CUODIT contains numbers 2-9 that need to be followed by exactly this is to implement Nsataa…tuhatta…

Lexicon HUNDREDT is for numbers in range: 100 000–199 000 this is to implement sata…tuhatta…

Lexicon TEENST is for numbers with 11 000–19 000 this is to implement …Ntoista…tuhatta…

Lexicon TEENT is for numbers with 11 000–19 000 this is to implement …Ntoista…tuhatta…

Lexicon TENST is for numbers with 20 000–90 000 this is to implement …Nkymmentä…tuhatta…

Lexicon TENT is for numbers with 10 000 000–10 999 999 this is to implement …kymmenentuhatta…

Lexicon LÅGEVT is for numbers with 20 000–90 000 this is to implement …Nkymmentä…tuhatta..

Lexicon ONEST is for numbers with 1 000–9 000 this is to implement …Ntuhatta…

Lexicon THOUSANDS is for numbers with 1 000–9 000 this is to implement …Ntuhatta

Lexicon THOUSAND is for the ones-tens-hundreds of numbers greater than thousand

Lexicon UNDERTHOUSAND is for numbers with 100–900 after thousands

Lexicon HUNDREDS contains numbers 2-9 that need to be followed by exactly 2 digits: 200–999 this is to implement Nsataa…

Lexicon CUODI contains numbers 2-9 that need to be followed by exactly this is to implement Nsataa

Lexicon HUNDRED is for numbers in range: 100–999

Lexicon TEENS is for numbers with 11–19 this is to implement …Ntoista

Lexicon TEEN is for numbers with 11–19 this is to implement …Ntoista

Lexicon TENS is for numbers with 20–90 this is to implement …Nkymmentä…

Lexicon LÅGEV is for numbers with 20–90 this is to implement …Nkymmentä

Lexicon JUSTTEN is for number 10 this is to implement …kymmenen

Lexicon ONES is for numbers with 1–9 this is to implement yksi, kaksi, kolme…, yhdeksän

Lexicon ZERO is for number 0 nolla

Lexicon LOPPU is to implement potential case inflection with a colon.


This (part of) documentation was generated from src/fst/transcriptions/transcriptor-numbers-digit2text.lexc


tools-grammarcheckers-grammarchecker.cg3.md

F I N N I S H G R A M M A R C H E C K E R

DELIMITERS

TAGS AND SETS

Tags

This section lists all the tags inherited from the fst, and used as tags in the syntactic analysis. The next section, Sets, contains sets defined on the basis of the tags listed here, those set names are not visible in the output.

Beginning and end of sentence

BOS EOS

Parts of speech tags

N A Adv V Pron CS CC CC-CS Po Pr Pcle Num Interj ABBR ACR CLB LEFT RIGHT WEB PPUNCT PUNCT

COMMA ¶

Tags for POS sub-categories

Pers Dem Interr Indef Recipr Refl Rel Coll NomAg Prop Allegro Arab Romertall

Tags for morphosyntactic properties

Nom Acc Gen Ill Ine Com Ess Ess Tra Sg Pl

Cmp/SplitR Cmp/SgNom Cmp/SgGen Cmp/SgGen PxSg1 PxSg2 PxSg3 PxDu1 PxDu2 PxDu3 PxPl1 PxPl2 PxPl3 Px

Comp Superl Attr Ord Qst IV TV Prt Prs Ind Pot Cond Imprt ImprtII Sg1 Sg2 Sg3 Pl1 Pl2 Pl3 Inf ConNeg Neg PrfPrc VGen PrsPrc Ger Sup Actio VAbess

Err/Orth

Semantic tags

Sem/Act Sem/Ani Sem/Atr Sem/Body Sem/Clth Sem/Domain Sem/Feat-phys Sem/Fem Sem/Group Sem/Lang Sem/Mal Sem/Measr Sem/Money Sem/Obj Sem/Obj-el Sem/Org Sem/Perc-emo Sem/Plc Sem/Sign Sem/State-sick Sem/Sur Sem/Time Sem/Txt

HUMAN

PROP-ATTR PROP-SUR

TIME-N-SET

Syntactic tags

@+FAUXV @+FMAINV @-FAUXV @-FMAINV @-FSUBJ> @-F<OBJ @-FOBJ> @-FSPRED<OBJ @-F<ADVL @-FADVL> @-F<SPRED @-F<OPRED @-FSPRED> @-FOPRED> @>ADVL @ADVL< @<ADVL @ADVL> @ADVL @HAB> @<HAB @>N @Interj @N< @>A @P< @>P @HNOUN @INTERJ @>Num @Pron< @>Pron @Num< @OBJ @<OBJ @OBJ> @OPRED @<OPRED @OPRED> @PCLE @COMP-CS< @SPRED @<SPRED @SPRED> @SUBJ @<SUBJ @SUBJ> SUBJ SPRED OPRED @PPRED @APP @APP-N< @APP-Pron< @APP>Pron @APP-Num< @APP-ADVL< @VOC @CVP @CNP OBJ

-OTHERS SYN-V @X ## Sets containing sets of lists and tags This part of the file lists a large number of sets based partly upon the tags defined above, and partly upon lexemes drawn from the lexicon. See the sourcefile itself to inspect the sets, what follows here is an overview of the set types. ### Sets for Single-word sets INITIAL ### Sets for word or not WORD NOT-COMMA ### Case sets ADLVCASE CASE-AGREEMENT CASE NOT-NOM NOT-GEN NOT-ACC ### Verb sets NOT-V ### Sets for finiteness and mood REAL-NEG MOOD-V NOT-PRFPRC ### Sets for person SG1-V SG2-V SG3-V DU1-V DU2-V DU3-V PL1-V PL2-V PL3-V ### Pronoun sets ### Adjectival sets and their complements ### Adverbial sets and their complements ### Sets of elements with common syntactic behaviour ### NP sets defined according to their morphosyntactic features ### The PRE-NP-HEAD family of sets These sets model noun phrases (NPs). The idea is to first define whatever can occur in front of the head of the NP, and thereafter negate that with the expression **WORD - premodifiers**. ### Postposition sets ### Border sets and their complements Grammarchecker rules begin here ## Grammarchecker sets ## Grammarchecker rules ### Speller rules ### Agreement rules #### regular congruence rules ### Negation verb rules ### Postposition rules ### L2 rules ### NP internal rules ### Punctuation rules ### Spacing errors * * * This (part of) documentation was generated from [tools/grammarcheckers/grammarchecker.cg3](https://github.com/giellalt/lang-fin/blob/main/tools/grammarcheckers/grammarchecker.cg3) --- # tools-tokenisers-tokeniser-disamb-gt-desc.pmscript.md # Tokeniser for fin Usage: ``` $ make $ echo "ja, ja" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst $ echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa boasttu olmmoš, man mielde lahtuid." | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst $ echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst $ echo "márffibiillagáffe" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst ``` Pmatch documentation: <https://github.com/hfst/hfst/wiki/HfstPmatch> Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words: * Punct contains ASCII punctuation marks * The symbol after m-dash is soft-hyphen `U+00AD` * The symbol following {•} is byte-order-mark / zero-width no-break space `U+FEFF`. Whitespace contains ASCII white space and the List contains some unicode white space characters * En Quad U+2000 to Zero-Width Joiner U+200d' * Narrow No-Break Space U+202F * Medium Mathematical Space U+205F * Word joiner U+2060 Apart from what's in our morphology, there are 1. unknown word-like forms, and 2. unmatched strings We want to give 1) a match, but let 2) be treated specially by `hfst-tokenise -a` Unknowns are made of: * lower-case ASCII * upper-case ASCII * select extended latin symbols ASCII digits * select symbols * Combining diacritics as individual symbols, * various symbols from Private area (probably Microsoft), so far: * U+F0B7 for "x in box" ## Unknown handling Unknowns are tagged ?? and treated specially with `hfst-tokenise` hfst-tokenise --giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it's safer to let hfst-tokenise handle them. Finally we mark as a token any sequence making up a: * known word in context * unknown (OOV) token in context * sequence of word and punctuation * URL in context * * * This (part of) documentation was generated from [tools/tokenisers/tokeniser-disamb-gt-desc.pmscript](https://github.com/giellalt/lang-fin/blob/main/tools/tokenisers/tokeniser-disamb-gt-desc.pmscript) --- # tools-tokenisers-tokeniser-gramcheck-gt-desc.pmscript.md # Grammar checker tokenisation for fin Requires a recent version of HFST (3.10.0 / git revision>=3aecdbc) Then just: ``` $ make $ echo "ja, ja" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst ``` More usage examples: ``` $ echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa boasttu olmmoš, man mielde lahtuid." | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst $ echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst $ echo "márffibiillagáffe" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst ``` Pmatch documentation: <https://github.com/hfst/hfst/wiki/HfstPmatch> Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words: * Punct contains ASCII punctuation marks * The symbol after m-dash is soft-hyphen `U+00AD` * The symbol following {•} is byte-order-mark / zero-width no-break space `U+FEFF`. Whitespace contains ASCII white space and the List contains some unicode white space characters * En Quad U+2000 to Zero-Width Joiner U+200d' * Narrow No-Break Space U+202F * Medium Mathematical Space U+205F * Word joiner U+2060 Apart from what's in our morphology, there are 1) unknown word-like forms, and 2) unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a * select extended latin symbols * select symbols * various symbols from Private area (probably Microsoft), so far: * U+F0B7 for "x in box" TODO: Could use something like this, but built-in's don't include šžđčŋ: Simply give an empty reading when something is unknown: hfst-tokenise --giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it's safer to let hfst-tokenise handle them. Finally we mark as a token any sequence making up a: * known word in context * unknown (OOV) token in context * sequence of word and punctuation * URL in context * * * This (part of) documentation was generated from [tools/tokenisers/tokeniser-gramcheck-gt-desc.pmscript](https://github.com/giellalt/lang-fin/blob/main/tools/tokenisers/tokeniser-gramcheck-gt-desc.pmscript) --- # tools-tokenisers-tokeniser-tts-cggt-desc.pmscript.md # TTS tokenisation for smj Requires a recent version of HFST (3.10.0 / git revision>=3aecdbc) Then just: ```sh make echo "ja, ja" \ | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst ``` More usage examples: ```sh echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa \ boasttu olmmoš, man mielde lahtuid." \ | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" \ | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst echo "márffibiillagáffe" \ | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst ``` Pmatch documentation: <https://kitwiki.csc.fi/twiki/bin/view/KitWiki/HfstPmatch> Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words: * Punct contains ASCII punctuation marks * The symbol after m-dash is soft-hyphen `U+00AD` * The symbol following {•} is byte-order-mark / zero-width no-break space `U+FEFF`. Whitespace contains ASCII white space and the List contains some unicode white space characters * En Quad U+2000 to Zero-Width Joiner U+200d' * Narrow No-Break Space U+202F * Medium Mathematical Space U+205F * Word joiner U+2060 Apart from what's in our morphology, there are 1) unknown word-like forms, and 2) unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a * select extended latin symbols * select symbols * various symbols from Private area (probably Microsoft), so far: * U+F0B7 for "x in box" TODO: Could use something like this, but built-in's don't include šžđčŋ: Simply give an empty reading when something is unknown: hfst-tokenise --giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it's safer to let hfst-tokenise handle them. Needs hfst-tokenise to output things differently depending on the tag they get * * * This (part of) documentation was generated from [tools/tokenisers/tokeniser-tts-cggt-desc.pmscript](https://github.com/giellalt/lang-fin/blob/main/tools/tokenisers/tokeniser-tts-cggt-desc.pmscript)