North Sami NLP Grammar

Finite state and Constraint Grammar based analysers, proofing tools and other resources

View the project on GitHub giellalt/lang-sme

Page Content

A flowchart over the sme files for morphological parsing

This flowchart gives an overview of how the sme sourcefiles are related. In principle, the other lg files are arranged in the same way.

The main lex file            Separate lex files for different POS (parts
                                                               of speech)
|----------------------|    |------------------|
|     sme-lex.txt      |    | noun-sme-lex.txt |
|                      |    |  viessu GOAHTI ; |  From the Root lexicon, there
|     Root   -------------> |  ...             |  are pointers to each POS.
|                      |    |        |         |  The files for nouns, verbs and
|     LEXICON GOAHTI <---------------|         |  adjectives point back to the
|      +N DEVNVCASE ;  |    |                  |  sme-lex.txt file, and are di-
|      ...             |    |------------------|  rected to their respective
|                      |                          sublexica.
|                      |    |-------------------|
|                     --->  | verb-sme-lex.txt  | (the auxiliary verbs are
|                   <--------- ...              | also found in the verb file)
|                      |    |-------------------|
|                      |
|                      |    |-------------------|
|                     --->  | adj-sme-lex.txt   |
|                   <--------- ...           |
|                      |    |-------------------|
|                      |
|                      |
|                     --->  |-------------------| The other lex files contain
|                  <- - - - - closed-sme-lex.txt| closed classes. They are smal
|----------------------|    | LEXICON Pronoun   | ler, and all the sublexica
                            |  Personal ;       | are in the same file, not in
                            |                   | the sme-lex file (well, some
                            | LEXICON Personal  | point to some sme-lex sub-
                            |  ...              | lexica). Other files are pp-
                            |-------------------| lex.txt, etc. All in all
                                                  there are ca. 10 lex files.

This is compiled together with the        ||
twol rules. These rules contain the       ||
(morpho)phonological processes,           ||
consonant gradation, etc.                 \/


|------------|    |------------|      |------------|   The sme.save file is
|twol-sme.txt| => |twol-sme.bin|  =>  |  sme.save  |   compiled in lexc, and
|------------|    |------------|      |------------|   is the merger of the
                                                       lex files and the rule
Here are the      After compi-                         file twol-sme.bin
rules them-       lation in twolc           ||
selves            they are in this          ||
                  binary file               ||
                                            ||
Then comes preprosessor files:              \/

|----------|    |------------|         ||=========||  This is the final morpho-
|case.regex| => |caseconv.fst| ======> || sme.fst ||  logical parser for
|----------|    |------------|         ||=========||  North Sami.

The case.regex file is com-
piled in xfst. The preprocessor
itself, tok.fst, is not shown here.