North Sami NLP Grammar

Finite state and Constraint Grammar based analysers, proofing tools and other resources

View the project on GitHub giellalt/lang-sme

Flag diacritics

Introduction

The use of flag diacritics is documented in chapter 8 of the Xerox book. The present page documents the flag diacritics format, and the use of them in the parser.

Flag diacritics are used in the Saami morphological parser in order to remove illegal compounds, and in order to handle automatic downcasing of proper names when they are converted to e.g. adjectives.

Flag diacritics format

There are four types of flag diacritics, all of them with the format @operator.feature.value@ or @operator.feature@:

Flags diacritics for Saami compounds

Introduction

Flag diacritics are used in the Saami morphological parser in order to remove illegal compounds.

The problem

Without flag diacritics, compounds with derived nouns are errouneously blocked, or, if they are accepted,

A solution

  1. The Nominative and Genitive sublexica of all NounRoot entries are lead to the R lexicon (as earlier).
  2. From there, they go to NounRoot again without flag diacritics. They are also led to VerbRoot and AdjectiveRoot, but equipped with a compound flag diacritic. Then, all affixes that turn adjectives and verbs into nouns are equipped with a corresponding diacritic.
  3. Compounds with only an unsaturated diacritic are removed, whereas compounds with a saturated diacritic are accepted.

A sketch

Here, the P and R diacritics are used, as shown with the R lexicon and two lexica for deverbal nouns, that takes verbal stems as input. The P diacritic sets the value of cmpnd to N, and the R diacritic requires a test.

Flag diacritics for downcasing of derived proper nouns

The problem

This is fixed, isn’t it?

All proper nouns may be turned into adjectives of the type London > Londoner, in Sámi, Oslo > oslolaš. In Sámi, the capital letter of the proper name must be downcased.

An earlier solution was to write a twol rule that exchanged all initial uppercase letters with an initial lowercase one if the stem was followed by the right kind of derivational suffix (this rule is still found at the end of the twol-sme.txt file, where it is commented out). This solution was abandoned, since the compilation time was simply too long

A solution involving flag diacritics

A possible solution is to use flag diacritics, in the same way as we used flag diacritics to fix compounds.

The tags @U.Cap.Obl@ @U.Cap.Opt@ were introduced (cf. the sme-lext.txt file), but so far, we don’t have a working solution. The problem is documented (for Finnish) in section 8.5.5. (of the pre-published version), and on pp. 368-372 in the published version of the B&K book.

Working on it (Trond): I copied the two files demo-lex.txt (the lexc file) and lexscript.xfst (the xfst script). I then saved the former as lex.fst (in lexc) and ran the latter (in xfst). The lexc commands were:

compile-source demo-lex.txt
obey-flags
source-to-result
save-result lex.fst

The sole xfst command was:

source paloscript.xfst

The resulting message from xfst is seen below:

xfst[0]: source paloscript.xfst
Opening file paloscript.xfst...
defined UC: 568 bytes. 2 states, 26 arcs, 26 paths.
5.2 Kb. 6 states, 356 arcs, Circular.
 >>>>This leaves the rule transducer on the stack
0: 5.2 Kb. 6 states, 356 arcs, Circular.
 >>>>Loading lex.fst onto the stack
Opening 'lex.fst'
Closing 'lex.fst'
 >>>>There should now be two networks on the stack
0: 1.5 Kb. 32 states, 35 arcs, 12 paths.
1: 5.2 Kb. 6 states, 356 arcs, Circular.
 >>>>Composing the rules under the lexicon
1.8 Kb. 38 states, 45 arcs, 18 paths.
flex scanner jammed

According to the Book, we should have two networks, so that is ok. The question is why we get the final message (flex scanner jammed), and how we shall get the demo work, and thereafter how we can make our own problem work.

Appendix: The files used

These two files were copied from the B&K book. No attempt was done do modify them, as the first goal should be to get them to work.

The lexc file demo-lex.txt

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! demo-lex.txt
!
! Simple Finnish Lexicon with Flag Diacritics
!
! Includes words like "Pariisi" (Paris) "pariisilainen"
! (Parisian), "Palo Alto", "paloaltolainen" (Palo Altan).
! The initial capital is obligatory in "Pariisi",
! optional in "pariisilainen". The internal space in
! "Palo Alto" is not present in "paloaltolainen".

Multichar_Symbols
 @U.Cap.Obl@ @U.Cap.Opt@
 +PN +Adj +Der+

LEXICON Root
 @U.Cap.Obl@ PropNoun ;
 @U.Cap.Opt@ PropNoun ;

LEXICON PropNoun
 Pariisi PNSuff ;
 Grenoble PNSuff ;
 Palo_Alto PNSuff ;

! N.B. that _ denotes the literal space
! character in this grammar

LEXICON PNSuff
 @U.Cap.Obl@ PN ;
 @U.Cap.Opt@ AdjSuff ;

LEXICON PN
+PN:0 # ;

LEXICON AdjSuff
 +Der+:0 LAINEN ;

LEXICON LAINEN
 lainen ADJ ;

LEXICON ADJ
 +Adj:0 # ;

The xfst script paloscript.xfst

clear stack
define UC A | B | C | D | E | F | G | H |
I | J | K | L | M | N | O | P | Q | R | S |
T | U | V | W | X | Y | Z ;

read regex [
# Allow optional initial downcasing after @U.Cap.Opt@
A (->) a, B (->) b, C (->) c, D (->) d, E (->) e,
F (->) f, G (->) g, H (->) h, I (->) i, J (->) j,
K (->) k, L (->) l, M (->) m, N (->) n, O (->) o,
P (->) p, Q (->) q, R (->) r, S (->) s, T (->) t,
U (->) u, V (->) v, W (->) w, X (->) x, Y (->) y,
Z (->) z
                || .#. %@U%.Cap%.Opt%@ _

.o.

# No uppercase in the middle of a downcasable word

A->a, B->b, C->c, D->d, E->e, F->f, G->g, H->h,
I->i, J->j, K->k, L->l, M->m, N->n, O->o, P->p,
Q->q, R->r, S->s, T->t, U->u, V->v, W->w, X->x,
Y->y, Z->z

|| %@U%.Cap%.Opt%@ ?+ _

.o.

# Eliminate internal spaces inside a downcasable word
# Spaces are indicated here with the literal
# underscore character
%_ -> [] || .#. %@U%.Cap%.Opt%@ ?+ _
] ;

echo >>>>This leaves the rule transducer on the stack
print stack

echo >>>>Loading lex.fst onto the stack
load stack lex.fst

echo >>>>There should now be two networks on the stack
print stack

echo >>>>Composing the rules under the lexicon
compose net

echo >>>>Composition complete