The use of flag diacritics is documented in chapter 8 of the Xerox book. The present page documents the flag diacritics format, and the use of them in the parser.
Flag diacritics are used in the Saami morphological parser in order to remove illegal compounds, and in order to handle automatic downcasing of proper names when they are converted to e.g. adjectives.
There are four types of flag diacritics, all of them with the format @operator.feature.value@ or @operator.feature@:
Flag diacritics are used in the Saami morphological parser in order to remove illegal compounds.
Without flag diacritics, compounds with derived nouns are errouneously blocked, or, if they are accepted,
Here, the P and R diacritics are used, as shown with the R lexicon and two lexica for deverbal nouns, that takes verbal stems as input. The P diacritic sets the value of cmpnd to N, and the R diacritic requires a test.
This is fixed, isn’t it?
All proper nouns may be turned into adjectives of the type London > Londoner, in Sámi, Oslo > oslolaš. In Sámi, the capital letter of the proper name must be downcased.
An earlier solution was to write a twol rule that exchanged all initial uppercase letters with an initial lowercase one if the stem was followed by the right kind of derivational suffix (this rule is still found at the end of the twol-sme.txt file, where it is commented out). This solution was abandoned, since the compilation time was simply too long
A possible solution is to use flag diacritics, in the same way as we used flag diacritics to fix compounds.
@U.Cap.Obl@ @U.Cap.Opt@ were introduced (cf. the
sme-lext.txt file), but so far, we don’t have a working solution. The
problem is documented (for Finnish) in section 8.5.5. (of the
pre-published version), and on pp. 368-372 in the published version of
the B&K book.
Working on it (Trond): I copied the two files demo-lex.txt (the lexc file) and lexscript.xfst (the xfst script). I then saved the former as lex.fst (in lexc) and ran the latter (in xfst). The lexc commands were:
compile-source demo-lex.txt obey-flags source-to-result save-result lex.fst
The sole xfst command was:
The resulting message from xfst is seen below:
xfst: source paloscript.xfst Opening file paloscript.xfst... defined UC: 568 bytes. 2 states, 26 arcs, 26 paths. 5.2 Kb. 6 states, 356 arcs, Circular. >>>>This leaves the rule transducer on the stack 0: 5.2 Kb. 6 states, 356 arcs, Circular. >>>>Loading lex.fst onto the stack Opening 'lex.fst' Closing 'lex.fst' >>>>There should now be two networks on the stack 0: 1.5 Kb. 32 states, 35 arcs, 12 paths. 1: 5.2 Kb. 6 states, 356 arcs, Circular. >>>>Composing the rules under the lexicon 1.8 Kb. 38 states, 45 arcs, 18 paths. flex scanner jammed
According to the Book, we should have two networks, so that is ok. The
question is why we get the final message
flex scanner jammed), and how we shall get the demo work,
and thereafter how we can make our own problem work.
These two files were copied from the B&K book. No attempt was done do modify them, as the first goal should be to get them to work.
The lexc file demo-lex.txt
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! ! demo-lex.txt ! ! Simple Finnish Lexicon with Flag Diacritics ! ! Includes words like "Pariisi" (Paris) "pariisilainen" ! (Parisian), "Palo Alto", "paloaltolainen" (Palo Altan). ! The initial capital is obligatory in "Pariisi", ! optional in "pariisilainen". The internal space in ! "Palo Alto" is not present in "paloaltolainen". Multichar_Symbols @U.Cap.Obl@ @U.Cap.Opt@ +PN +Adj +Der+ LEXICON Root @U.Cap.Obl@ PropNoun ; @U.Cap.Opt@ PropNoun ; LEXICON PropNoun Pariisi PNSuff ; Grenoble PNSuff ; Palo_Alto PNSuff ; ! N.B. that _ denotes the literal space ! character in this grammar LEXICON PNSuff @U.Cap.Obl@ PN ; @U.Cap.Opt@ AdjSuff ; LEXICON PN +PN:0 # ; LEXICON AdjSuff +Der+:0 LAINEN ; LEXICON LAINEN lainen ADJ ; LEXICON ADJ +Adj:0 # ;
The xfst script paloscript.xfst
clear stack define UC A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z ; read regex [ # Allow optional initial downcasing after @U.Cap.Opt@ A (->) a, B (->) b, C (->) c, D (->) d, E (->) e, F (->) f, G (->) g, H (->) h, I (->) i, J (->) j, K (->) k, L (->) l, M (->) m, N (->) n, O (->) o, P (->) p, Q (->) q, R (->) r, S (->) s, T (->) t, U (->) u, V (->) v, W (->) w, X (->) x, Y (->) y, Z (->) z || .#. %@U%.Cap%.Opt%@ _ .o. # No uppercase in the middle of a downcasable word A->a, B->b, C->c, D->d, E->e, F->f, G->g, H->h, I->i, J->j, K->k, L->l, M->m, N->n, O->o, P->p, Q->q, R->r, S->s, T->t, U->u, V->v, W->w, X->x, Y->y, Z->z || %@U%.Cap%.Opt%@ ?+ _ .o. # Eliminate internal spaces inside a downcasable word # Spaces are indicated here with the literal # underscore character %_ ->  || .#. %@U%.Cap%.Opt%@ ?+ _ ] ; echo >>>>This leaves the rule transducer on the stack print stack echo >>>>Loading lex.fst onto the stack load stack lex.fst echo >>>>There should now be two networks on the stack print stack echo >>>>Composing the rules under the lexicon compose net echo >>>>Composition complete