Finite state and Constraint Grammar based analysers, proofing tools and other resources
Just as for North Sámi, the Lule Saami preprocessing was earlier done
with the Xerox tokenize tool and the language-specific file tok.txt.
The code itself is written as a set of regular expressions, and the
source file (tok.txt) was compiled by xfst. As explained for the sme
preprocessing, this approach was replaced by a preprocessor script,
written in perl, gt/script/preprocess
.
Preprocessing is done by the perl script gt/script/preprocess
, which
is language-independent. The script is documented
here. The language dependent part of the
script shall be done via the file smj/bin/abbr.txt
Lule Saami abbreviations are handled as for North Saami.
This is a feature common to Lule and South Sami, not to be found in North Sami. The letter æ/ä and ø/ö are used interchangeably in Norway and Sweden. The parser accepts any version of them.
The xfst file to handle this is the language-independent spellrelax.regex. It contains rules like:
ń (->) ñ, ŋ (->) ñ, æ (->) ä, ø (->) ö ;
The line says that æ may optionally be replaced by ä and that ø may optionally be replaced with ö, and the same for the different ways of writing ŋ.
We plan to make parts of the spellrelax file language dependent.
There is a language independent inituppercase.regex file. Cf. the documentation for initial capitalization written for North Saami.
This has not yet been implemented.