North Sami NLP Grammar

Finite state and Constraint Grammar based analysers, proofing tools and other resources

View the project on GitHub giellalt/lang-sme

The disambiguation file itself

Here, we discuss the structure of the North Saami disambiguation rule files For a documentation of the linguistic considerations behind some specific linguistic issues, see this document.

File macrostructure

The disambiguation file disambiguation.cg3 consists of the following parts:

The file format is documented in Tapanainen 1996. Cf. also the general discussion on CG-2 usage.

The delimiters

Four sentence delimiters are declared in disambiguation.cg3, “.”, “?”, “!” and “¶”. This is thus where we find the difference between this file and the file sme-tdis.rle, which has only one delimiter, “¶”.

Tag and tag list declarations

This section consists of two parts. In the first part, every single lexc tag has been given a list declaration of its own.

In the second part comes the real set and list declarations.

Mapping rules

These rules define mappings from morphological tags to syntactic functions. Note that it is important here to select the TARGET and the conditions carefully. For example, if one writes MAP (@-FAUXV) TARGET (“leat”) IF (0 Inf), not only Inf readings will get the tag (@-FAUXV). The tag will be applied to every member of the relevant cohorts, even to the Sg2 and Pl1 and Pl3 readings. If one writes instead MAP (@-FAUXV) TARGET Inf IF (0 (“leat”)), only the Inf reading gets that tag, and the other readings can be taken care of by other rules.

The constraints

The constraints are organised in cycles. CG-2 makes it possible to arrange the rules in blocks. Each set of constraints is introduced with the key word CONSTRAINTS, and the last section is closed off with the word END.

First we have disambiguation rules that each refer to one single cohort (marked with “cycle 0”). Then comes local disambiguation, referring mainly to words one cohort to the left or right (possibly with intervening adverbials) (“cycle 1”). Then follows the main part of the disambiguation file (“cycle 2”),which contains rules with local and long-distance scope. The next cycle consists of rules with global scope, along with some rules that need to come as late as possible (“cycle 3”). Cycle 4 contains rules for syntactic disambiguation, and finally, there is cycle 5 with post-syntactic morphological rules.

The rules within some of the main cycles are organised in subcycles, set apart by the key word CONSTRAINTS. This gives better control of the order in which the rules apply.

Also note that rules are often organised such that a block of rules is followed by a set of examples. In these cases, the first rule will go with the first example, the second rule with the second example, and so on. The idea is that this gives a better overview than if each rule is immediately followed by an example.

The majority of the examples show where a rule hits. But in some cases, an example shows where the rule does NOT hit. Then the example is there to illustrate the need for some specific condition in the rule. Which condition is normally pointed out explicitly.

More on the cycles

This information may be somewhat outdated. The structure of the file, as well as all rules, is shown in this document (generated from the source file itself).

Cycle 0

Much cohort-internal disambiguation is taken care of in the preprocessor phase (in the script lookup2cg, see also the documentation in the script).

The rules in this section are there to reject analyses where there is a better analysis in the same cohort. Typically, this is the case where a lexicalised derivation competes with a dynamically derived form. For example, adverbs derived from adjectives are rejected if there is an alternative analysis with the lexicalised adverb. Then there are certain tag combinations that are excluded, for example, one rule says REMOVE Imprt IF (0 Qst). The idea is that a verb cannot be imperative if it has a cliticised question marker. Hence, all Imprt readings are removed from verbs with question markers. All in all, this cycle eliminates readings that we do not ever want to end up with.

Cycle 1

This cycle consists of two subcycles. In cycle 1a we deal with some personal pronouns that have homonyms in other parts of speech, and with the pronoun “eanaš/eanas”, we pick out some Px readings, and select verb readings that depend on personal pronouns. There is also a rule for “dušše” and two rules related to “ahte”, which have to come early so that only the CS reading of “ahte” survives to cycle 2. In cycle 1b most of the remaining Px readings are removed.

Cycle 2

This is the largest cycle in the disambiguation file. It consists of several subcycles. The main organising principle is that rules that are relevant to one and the same part of speech should come as one rule block. However, some exceptions to this principle have turned out to be necessary. For example, although the main verb rule block follows the main noun rule block, certain verb rules precede the noun rules, thereby improving the effect of the latter.

In the following, the main subcycles of Cycle 2 are dealt with one by one.

Noun or not?

In this subcycle some relatively certain noun readings are picked out. In most cases adjective readings are thereby removed, but also some verb readings. The selected noun readings will serve as context for later rules.

Adjectives and adverbs

In this subcycle some relatively certain adverb and adjective readings are picked out. The competing readings are verb or noun readings.

Disambiguating clitics

So far two rules here. They remove the Qst reading from the particle “dego” and from adverbs like “nugo”.

Disambiguating adpositions

Two subcycles. The first one deals with some individual adpositions, with some adpositions that take modifiers, and with adpositions of the GASKAL class, in the cases where they combine with a coordination. The second subcycle consists of rules that select and remove Po and Pr readings in a general fashion. The ordering of rules is crucial here. The present ordering is not necessarily the best one, although it has been worked on quite a lot.

Disambiguating subjunctions

Two subcycles. The first one contains two general CS rules and a number of rules related to individual subjunctions with homonyms in other POS. The second one contains rules that disambiguate those ambiguous instances that survive the first subcycle.

Disambiguating adverbs

Consists of some general adverb rules, some rules that distinguish between adverbs and other specified POS, and a series of rules related to individual adverbs. The latter set of rules deals with ambiguities that are not resolved by the general adverb rules.

Disambiguating pronouns

Since personal pronouns are dealt with in cycle 1, this cycle consists of rule blocks for interrogative pronouns, reflexive pronouns, reciprocal pronouns, numerals (placed here because of “nubbi”), indefinite pronouns, and demonstrative pronouns, in that order, so that the rules for demonstratives, for example, can build on the output of all other pronoun rules.

Note that there are particularly many rules for “dat”, since “dat” can be a demonstrative pronoun or a personal pronoun. We have chosen here to treat “dat” as a personal pronoun whenever it stands alone.

Disambiguating adjectives

This cycle is introduced by some rules for individual adjectives, followed by some rules that pick out comparatives. The next block of rules identify attributes that come between a demonstrative and a noun with compatible cases. Between these elements we expect only DP-internal material, so it is relatively easy to identify attributes in this domain. It is much harder to identify attributes that are not preceded by demonstratives, which is why the next rule block, for other attribute rules, is much longer. After the main rule block for attributes follow rules for coordinated attributes and for “buorre”. The last part of the adjective cycle contains rules for predicative adjectives, and finally, a couple of adverb rules.

Disambiguating verbs - part 1

In this cycle we deal with some verb forms that are relatively easy to disambiguate. We select ConNeg and imperative forms, and also some infinitives. In addition, in the middle of the rule block for infinitives there are some rules that remove infinitive. Then we have rules for VGen and for PrfPrc, including cases where “leamaš” appears without a finite verb. We select some Actio and some PrsPrc readings, and finally, there are rules for the individual verbs “orrut” and “addit”.

Disambiguating nouns

The first rules in this cycle deal with proper nouns-selection of proper nouns, removal of proper nouns, and choosing between different proper nouns. Then follow a series of rules that remove certain adjective readings when corresponding noun readings are available. After that the case disambiguation begins. We start with rules for some specific constructions that need to be picked out early. Then for each case there is a rule section with the following structure: 1) rules that identify single occurences of the case in question, 2) rules that find coordinated elements with that case, 3) rules that REMOVE that case in certain environments. This is to make the job easier for later case rules, by removing wrong readings as early as possible.

Since there are quite a few syntactic environments where only genitive case is found, we pick out genitive readings first. Then nominative rules follow, and after that accusative. These three sections are all relatively large, since ambiguity between nominative, accusative and genitive is so widespread in North Sámi.

The case rule sections that follow, for illative, locative, comitative and essive are much smaller. And at the end of the case rule chapter, there is a section for case rules that have to come after others have worked.

Disambiguating verbs - part 2

This is the main rule block for verb disambiguation.We first select some finite verbs, that is, verb readings that include an Ind, Pot, Imprt, ImprtII, or Cond tag. After that we have a couple of rules that disambiguate between the negation ‘it’ and the abbreviation ‘it’. Then follow a block of rules that select infinitive, and a block of rules that select or remove imperative.

Now it is time for the rules that select verb readings according to person and number. There are a few rules for Sg1, a few for Sg2, lots of rules for Sg3, and some rules for Du1, Du2 and Du3. There are also a couple of rules for Pl1 and for Pl2, and a large section of rules for Pl3. Then, since some verbs will remain ambiguous after the preceding rule blocks, the section for finite rules is rounded off by some rules for infinitive, for finite verbs and for passive.

The last part of the verb rule block contains rules for various non-finite verb forms, another rule block for finite verbs, consisting of rules that work best when as many other verb forms as possible are disambiguated, and finally, lexical rules that disambiguate between verbs that have overlapping paradigms.

Residual cases

Now that all parts of speech are disambiguated, we can apply some rules that make reference to particular forms. The rule block Residual cases contains case rules, other rules for nominals, rules for determiners, and rules for adverbs and adjectives.

Cycle 3

This is meant to be a cycle for disambiguation rules of global scope. At present, it contains only three one-cohort rules.

Cycle 4

This is where syntactic disambiguation is done. That is, the rules in this cycle pick out one syntactic tag in cases where an element has more than one. There are rules for subject, for subject predicate, for object and object predicate, for verbal functions, and for other functions.

Cycle 5

This cycle contains morphological rules that make reference to syntactic functions.

A more thorough documentation, on rule level

Here, we write some stuff that later on will be transferred to separate files..

Syntactic disambiguation rules

This is for the rules concerning @ tags, both the mapping and the disambiguating rules.

Morphological disambiguation rules

Here we look at the rules concerning the tags from the sme.fst transducer.

The Acc/Gen issue

Yes, the Acc/Gen issue…

Morphological and syntactic disambiguation

Interrelation between morphological and syntactic rules.

Semantic disambiguation

So far, we have only 5 semantic tags, related to proper nouns. Here, we say something about how we disambiguate them.