Finnish NLP Grammar

Finite state and Constraint Grammar based analysers, proofing tools and other resources

View the project on GitHub giellalt/lang-fin

Page Content

Morphology

The morphological division of Finnish words has three classes: verbal, nominal and others. The verbs are identified by personal, temporal, modal and infinite inflection. The nominals are identified by numeral and case inflection. The others are, apart from being the rest, identified by defective or missing inflection.

Symbols used for analysis Multichar_Symbols

The Finnish morphological implementation uses analysis symbols mainly to encode morphological analyses, the rest are implemented else where. Some non-morphological analyses or classifications are retained for interoperability and historical reasons. There are further details and examples of this classification in other parts of this documents; this page merely summarises the codes used in this version of the system.

Parts-of-speech

The main morphological division of words is merely: Verbs, Nominals, Rest. The syntactic and semantic subdivision is realised in POS tags. The nominals consist of nouns (substantiivi), adjectives, pro words and numerals. Verbs are non-divisible, but include infinitive and particple forms. The others are subdivided into adpositions, adverbs and particles. Further reading: [VISK s.v. sanaluokka|http://scripta.kotus.fi/visk/visk_termit.cgi?h_id=sCABBIDAI], VISK § 438.

Temporary list of added tags

These tags were added as part of a tag unification, and should be put where they belong.

Are there tags not declared in root.lexc or misspelled? Have a look at these:

Parts of speech

The part-of-speech analyses are typically the first:

Nouns

In nominal analyses, the proper nouns have additional subanalysis. Proper nouns are usually written with initial capitals–or more recently, totally arbitrary capitalisations, such as in brand names nVidia and ATi. Proper nouns do have full inflectional morphology exactly as other nouns, but work slightly differently in derivation and compounding. Some capitalised nouns may also lose capitalisation in derivation. VISK § 98

The code for proper nouns:

Proper noun tag follows noun analysis:

Pronouns

Pronominal analyses have some semantic classes. VISK § 101–104. Codes for various semantic classes:

Semantic tags follow pronoun analyses:

Numerals

In numeral analyses, there are multiple analyses. The numerals have semantic subcategories (VISK § 770). The classical ordinal numbers have been adjectivised in current descriptions (VISK § 771), the ordinal interpretation is still spelled out in subcategories. The numbers are often written with digits or other specific notations. Numeral class tags:

Particles

The particles are subcategorised syntax-wise into conjunctions for all words, that govern subclauses (VISK § 812). The conjunctions are further divided, whether the subclause is coordinant or subordinant to the governing clause and few other syntactic types (VISK § 816). N.B. that the division to subordinating and coordinating conjucntions is motivated by other systems, including legacy systems, whereas the grammar presents also different categorisations for conjunctions (including naming subordination adverbials). Conjunction syntax tags:

The conjunction tags take place of part-of-speech tags for legacy reasons:

Adpositions

In adposition anlayses, the syntactic tendencies are shown in sub-analyses; whether they appear typically before or after their heads VISK § 687.

Adposition syntax tags:

Adpositions are tagged in POS position:

Tags for sub-POS

Bound root morphs

The lexical items that appear as bound morphemes before head word are classified as prefixes ([VISK § 172|http://scripta.kotus.fi/visk/sisallys.php?p=172]). Prefixes are rare and mostly of foreign origin. The singular forms of plurale tantums are also potential prefixes.

Suffixes are typically word forms or derivations that only appear as bound morphs. Other than that Finnish does not really have proper suffixes. This means that suffixed words are in effect compounds of where the last word just doesn’t appear as free morph.

Symbols

Symbols are not part of linguistic data per se so we classify them according to the needs of end user applications

The analyses for symbols are like POSes:

Nominal analyses

The analyses of nominals show the inflection in number. Nominals inflect in number, to mark plurality of the word. The number for nouns is either singular or plural. Further reading: VISK § 79 Number tags:

Number tags are next to POSes in nominal analyses, and in order of morphs:

The analyses of nominals have case inflection marked. The nominals have case inflection (VISK § 81) to mark syntactic roles (nominative, partitive, accusative-genitive) and semantics (others, partially even syntactic cases).

The case is next to number and last obligatory analysis in nominals:

The analyses of a infinitive short form have lative ending; this is largely historical (VISK § 120). Some adpositions might have same analysis in diachronic analyses.

The analyses of certain nominals give explicit analysis for accusative case. The accusative case has distinctive marker in few pronouns and these are only cases that are analysed as accusatives. (VISK § 81). Other accusatives have the same case marking as genitive form, and only use that analysis in synchronic analyses.

Adverbs and adpositions may have some special analyses in diachronic analyses. Further reading: [VISK § 371|http://scripta.kotus.fi/visk/sisallys.php?p=371] – 385

Possessives

The analyses of nominals include possessive if present. Posessive ending indicates ownership. The possessive can take six possible values from singular and plural, first, second and third person references, where third person form is always ambiguous over plurality. Further reading: VISK § 95

Compound forms

In compound analyses, the derived compound form that is not a free morph is marked with special analysis. Some words have forms only appearing in compounds. Further reading: VISK § 406 Compound form

Finite verbs

All verb analyses contain voice marking. For finite verb forms active voice is tied to personal forms and passive voice to non-personal verb endings. The voice is also marked in the infinite verb forms. Further reading: VISK § 110

It is the first analysis of verb strings:

Finite verb form analyses have a reading for tense. The tense has two values. For moods other than indicative the tense is not distinctive in surface form, and therefore not marked in the analyses. The morphologically distinct forms in Finnish are only past and non-past tenses, while other are created syntactically and not marked in morphological analyses. Further reading: VISK § 111 – 112

The tense is marked in indicative forms after mood:

Finite verb form analyses have a reading for mood. Mood has four central readings and few archaic and marginal. The mood is marked in analyses for all finite forms, even the unmarked indicative. Further reading: VISK § 115 – 118

The mood is after voice in the analysis string and in morph order:

Finite verb form analyses have a reading for person. Personal ending of verb defines the actors. The person analysis has seven possible values, six for the singular and plural groups of first, second and third person forms, and one specifically for passive. The passive personal form is encoded as fourth person passive, which had been the common practice in past systems. Further reading: VISK § 106 – 107

The person is the last required analysis for verbs, after the mood:

Negation and verbs

The analyses of verb for the forms that require negation verb have a special analysis for it.

The suitable negation verbs have sub-analysis that can be matched to negated forms on syntactic level.

Infinite verb forms

Infinitive verb forms have infinitive or nominal derivation analyses. In traditional grammars the infinitive forms were called I, II, III, IV and V infinitive, the modern grammar replaces the first three with A, E and MA respectively. The IV infinitive, which has minen suffix marker, has been re-analysed as derivational and this is reflected in |omorfi|. The V infinitive is also assumed to be mainly derivational, but included here for reference. Further reading: VISK § 120 – 121 The infinitives have limited nominal inflection.

Infinitive analysis comes after voice, followed by nominal analyses:

Participles

Participial verb forms have participle readings. There are 4 participle forms. Like infinitives, participles in traditional grammars were named I and II where NUT and VA are used in modern grammars. The agent and negation participle have sometimes been considered outside regular inflection, but in modern Finnish grammars are alongside other participles and so they are included in inflection in omorfi as well. In some grammars the NUT and VA participles have been called past and present participles respectively, drawing parallels from other languages. The modern grammar avoids them as misleading but this description uses them Further reading: VISK § 122

Participle analyses are right after voice, followed by adjectival analyses:

There are number of implementations that mix up MA infinitives and Agent participles, and they share part of the same forms but no semantics and very little of syntax.

Comparation

Adjective and some adverbial analyses are marked for comparation. The non-marked forms are comparative and superlative. For adjectives, comparative suffixes precede the nominal inflection. c.f. VISK § 300

The comparison analysis occupies derivation spot, after POS:

Enclitic focus particles

All word forms can have clitics which are analysed by their orthography. Clitics are suffixes which can attach almost anywhere in the ends of words, both verb forms and nominals. They also attach on end of other clitics, theoretically infinite chains. In practice it is usual to see at most three in one word form. Two clitics have limited use: -s only appears in few verb forms and combined to other clitics and -kA only appears with few adverbs and negation verb. VISK § 126 – 131

Derivation

The derivation is not a central feature of this morphology, it is mainly used to collect new roots for dictionaries. This is roughly in order of perceived productivity already:

Usage

The analyses of some words and word-forms indicate limitedness of usage. This includes common mispellings, archaic words and forms and otherwise rare words and forms. Especially, the forms that are in parentheses in lexical sources and word-forms that had the usage annotation in there have been carried over.

Usage tags are pushed wherever appropriate:

Homonym tags

Dialects

The informal language use contains different Finnish than the literary standard, this is marked as standard dialect (yleispuhekieli): common features include dropping final vowels, dropping final i components of unstressed diphtongs, few other shortenings. Other dialects are also sometimes analysed; the geographical division has three levels: East versus West, East containing Savo and South-East (North?) West containing North, perä, keski and eteläpohjalaiset, southwest and Häme The third level dialect division is traditionally by “town” borders, be cautious when adding these though; it’s not the main target of this mrophology.

Tags for language of unassimilated name

Compounding tags

The tags are of the following form:

This entry / word should be in the following position(s):

If unmarked, any position goes.

The tagged part of the compound should make a compound using:

Unmarked = Default, ie +CmpN/SgN for SME.

The second part of the compound may require that the previous (left part) is:

These tags describe the parts of the compound.

The prefix (before “/”) is Cmp.

Others

The boundaries of compounds that are not lexicalised in the dictionary will have compound analyses, the compounds may also have usage tags. The compounding analyses concern also syntagmatic melting mishmash.

Compound boundary

The word and morpheme boundaries are used to limit the effective range of far-reaching rules, such as vowel harmony. The boundaries are marked by curly bracketed hashes or underscores. The word boundaries are marked by #, The lexical item boundaries by ##, the inflectional morpheme boundaries by >, the derivational morpheme boundaries by », and some etymological and soft boundaries by _.

Flag diacritics

We have manually optimised the structure of our lexicon using following flag diacritics to restrict morhpological combinatorics - only allow compounds with verbs if the verb is further derived into a noun again: | @P.NeedNoun.ON@ | (Dis)allow compounds with verbs unless nominalised | @D.NeedNoun.ON@ | (Dis)allow compounds with verbs unless nominalised | @C.NeedNoun@ | (Dis)allow compounds with verbs unless nominalised

@C.ErrOrth@
@D.ErrOrth.ON@
@P.ErrOrth.ON@

For languages that allow compounding, the following flag diacritics are needed to control position-based compounding restrictions for nominals. Their use is handled automatically if combined with +CmpN/xxx tags. If not used, they will do no harm. | @P.CmpFrst.FALSE@ | Require that words tagged as such only appear first | @D.CmpPref.TRUE@ | Block such words from entering ENDLEX | @P.CmpPref.FALSE@ | Block these words from making further compounds | @D.CmpLast.TRUE@ | Block such words from entering R | @D.CmpNone.TRUE@ | Combines with the next tag to prohibit compounding | @U.CmpNone.FALSE@ | Combines with the prev tag to prohibit compounding | @P.CmpOnly.TRUE@ | Sets a flag to indicate that the word has passed R | @D.CmpOnly.FALSE@ | Disallow words coming directly from root. Use the following flag diacritics to control downcasing of derived proper nouns (e.g. Finnish Pariisi -> pariisilainen). See e.g. North Sámi for how to use these flags. There exists a ready-made regex that will do the actual down-casing given the proper use of these flags. | @U.Cap.Obl@ | Allowing downcasing of derived names: deatnulasj. | @U.Cap.Opt@ | Allowing downcasing of derived names: deatnulasj.

The start of the dictionary Root The Finnish morphological description starts from any of the parts of speech dictionaries, prefix or hyphenated suffix


This (part of) documentation was generated from src/fst/morphology/root.lexc