GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology. Read more about Why. See also How to get started and our Privacy document.
This document describes the different parts of the error models used to create suggestions for the spellers, how they interact, and how one can turn the different parts on and off.
The file tools/spellcheckers/Makefile.mod-desktop-hfst.am
looks like
this, with default values as given by the und/
template (there is a corresponding file for mobile phone spellers, so that they can be made different from the desktop spellers):
## This is the default weight for all editing operations in the error model:
DEFAULT_WEIGHT=10
## Edit distanse for the Levenshtein error model:
EDIT_DISTANCE=2
## Define whether we allow changes to the initial letter(s) in the error model,
## possible values are:
## * no - no longer string edits = only the default, letter-based error model
## * txt - use only the txt file as source
## * regex - use only the regex file as source
## * both - use both the txt and regex files as sources
## NB!!! Setting this to anything but 'no' will greatly increase the size and
## search space of the error model, and thus make it much, much slower. Make sure
## you TEST the resulting error model properly and thoroughly, both for speed
## and suggestion quality.
INITIAL_EDITS=no
## Variable to define whether to enable edits of longer strings (as opposed to
## single letters). Possible values are:
## * no - no longer string edits = only the default, letter-based error model
## * txt - use only the txt file as source
## * regex - use only the regex file as source
## * both - use both the txt and regex files as sources
STRING_EDITS=txt
## Variable to specify the edit distance for the regex
## version of the strings file. The total edit distance for those operations is
## this value multiplied with the value of the DEFAULT_EDIT_DIST variable.
STRING_REGEX_EDIT_DISTANCE=2
## Variable to define whether to enable edits of word-final strings (as opposed
## to single letters). Possible values are:
## * no - no longer string edits = only the default, letter-based error model
## * txt - use only the txt file as source
## * regex - use only the regex file as source
## * both - use both the txt and regex files as sources
FINAL_STRING_EDITS=no
## Variable to define whether to enable whole-word replacements. Possible values:
## - yes
## - no
WORD_REPLACEMENTS=no
The different options are described above in the comments. In the following discussion only the relevant options are listed. We’ll start with a minimal error model:
DEFAULT_WEIGHT=10
EDIT_DISTANCE=2
INITIAL_EDITS=no
STRING_EDITS=no
FINAL_STRING_EDITS=no
WORD_REPLACEMENTS=no
That is, the error model contains only a Levenshtein edit distance 2
error
model with no additional components. It can be illustrated like this (the
multiplication factor 2
is taken from the Makefile.am
variable
EDIT_DISTANCE
, and .#.
marks the beginning and end of the word):
(Strictly speaking, the error model could have been even simpler, by specifying an edit distance of one. But that will in most cases produce a very bad speller, so we stick to the default editing distance 2 default value.)
The file used to specify the letters of the error model is:
tools/spellcheckers/editdist.default.txt
In that file you specify the whole alphabet used for the error model (that is,
all and only the letters you want to be used when generating correction
suggestions). The default weight for each modification of the input misspelling
is specified in the Makefile.am
variable:
DEFAULT_WEIGHT=10
That is, every letter change is given a default weight of 10
, in addition
to whatever weight is already present, e.g. from the corpus (see below).
One can change this default for individual letters in the alphabet in the
editdist.default.txt
file (which will then change the weight for all pairs
involving that letter), or for specific transitions:
### Inclusions: this is the real alphabet definition:
a
á 5
b
c
č 6
### Transition pairs + weight - section separator:
@@
### Transition pair specifications:
a á 4
á a 4
In the above fragment, the letters a
, b
and c
will have a default
weight of 10
for all changes involving these letters, whereas changes
involving á
and č
will have a non-default weight as specified. In
addition, the change from a
to á
(and the other way around) is given a
weight of 4
.
The STRING_EDITS
variable governs whether longer stretches than single
characters (ie strings) can be changed in one editing operation. It has four
possible values:
STRING_EDITS
operationsSTRING_EDITS
taken from a txt fileSTRING_EDITS
taken from a regex fileSTRING_EDITS
taken from both a txt and a regex fileUsing a txt file as the input file for STRING_EDITS
operations, you edit
a very simple data structure:
gi:giija -2
riikka:rihká -2
rg:rgg -2
rgg:rg -2
The format is:
The intended use is to replace sequences of characters that typically get spelled wrongly with their correct counterpart, such that the expected suggestions appear on top or among the top 5. This can be useful also in cases where the actual editing distance between input and output is only one, e.g. when the error is part of a regular but context-restricted pattern.
The filename for this file is: strings.default.txt
. The default
part
can be replaced with names for alternative writing systems or orthographies, to
be used in spellers for those writing systems or orthographies.
The string pairs in this file is compiled in as a parallel fst to the Levenshtein edit distance model, and the editing distance variable is applied to both. That is, with the following setup:
EDIT_DISTANCE=2
STRING_EDITS=txt
we get an error model that can be illustrated as follows:
EDIT_DISTANCE=2
means that one can correct up to two errors in the input
word, each of which can be either a regular Levenshtein operation or a string
replacement operation.
The file for the regex string editing model is: strings.default.regex
. The
content of that file is a standard Xerox-style regular expression, with an
additional Hfst weight specification:
{øø} -> {öö}::0 ,
ø -> {ö}::0 ;
With the Makefile.am variables set as follows:
EDIT_DISTANCE=2
STRING_EDITS=regex
STRING_REGEX_EDIT_DISTANCE=2
we get an error model that looks like:
The variable STRING_REGEX_EDIT_DISTANCE
regulates how many times the regex
file is applied - on top of the EDITDISTANCE variable. With the values
specified above, you can have _four changes applied to the input word, as
long as all changes are covered by the strings.default.regex
error model.
In this case both the txt
and regex
files are included. With the
following settings:
EDIT_DISTANCE=2
STRING_EDITS=both
STRING_REGEX_EDIT_DISTANCE=2
we get the following error model:
Beware that when using both the txt and the regex strings extensions to the Levenshtein model, there is a risk that the total error model becomes too large and powerful. This will be noticable through sluggish suggestion speed. To avoid this issue, make sure you only include strings and string patterns that are frequent and have a good effect on suggestion quality. Also have a look at the error model file size.
This part of the error model is meant to cover errors in suffixes. It comes
in addition to the previous Levenshtein + strings error model, which means that with EDIT_DISTANCE=2
, you get two edit operations (Levenshtein or string) pluss one suffix operation. This will normally not be a problem since the changes are restricted to the final parts of the word, and thus the search space for the error model does not increase very much.
The possible values for this variable are the same as for STRING_EDITS
:
FINAL_STRING_EDITS
operationsFINAL_STRING_EDITS
taken from a txt fileFINAL_STRING_EDITS
taken from a regex fileFINAL_STRING_EDITS
taken from both a txt and a regex fileEach of these values has the same meaning and consequence as for
STRING_EDITS
. The files are named final_strings.default.*
.
EDIT_DISTANCE=2
STRING_EDITS=both
STRING_REGEX_EDIT_DISTANCE=2
FINAL_STRING_EDITS=txt
EDIT_DISTANCE=2
STRING_EDITS=both
STRING_REGEX_EDIT_DISTANCE=2
FINAL_STRING_EDITS=regex
EDIT_DISTANCE=2
STRING_EDITS=both
STRING_REGEX_EDIT_DISTANCE=2
FINAL_STRING_EDITS=both
The same warning applies in this case as with the STRING_EDITS
— if you use
both the txt
and the regex
files, make sure to test for speed and size
issues.
NB! This is an experimental feature, and is not guaranteed to work as intended.
The purpose of this variable is to allow for special treatment of the initial
letter(s) of the misspellings. This has a huge price, though, in terms of search
space and thus speed of the speller. If enabled, consider redusing the editing
distance to one, and compensate with more targeted additions in the strings
and final_strings
files.
Also, as seen below, these edit operations come in addition to the regular
Levenshtein model (and final_strings operations), which means that the effective
editing distance of an error model with INITIAL_EDITS
on,
EDIT_DISTANCE=2
and FINAL_STRING_EDITS
enabled is four. That is a
very powerful model, and one that is likely to be way too slow. Reducing
EDIT_DISTANCE
to 1
will substantially limit the error model, and thus
improve suggestion speed.
The possible values for the INITIAL_EDITS
variable are:
INITIAL_EDITS
operationsINITIAL_EDITS
taken from a txt fileINITIAL_EDITS
taken from a regex fileINITIAL_EDITS
taken from both a txt and a regex fileEach of these values has the same meaning and consequence as for
STRING_EDITS
. The files to edit are initial_letters.default.*
.
EDIT_DISTANCE=2
INITIAL_EDITS=txt
STRING_EDITS=both
STRING_REGEX_EDIT_DISTANCE=2
FINAL_STRING_EDITS=both
EDIT_DISTANCE=2
INITIAL_EDITS=regex
STRING_EDITS=both
STRING_REGEX_EDIT_DISTANCE=2
FINAL_STRING_EDITS=both
EDIT_DISTANCE=2
INITIAL_EDITS=both
STRING_EDITS=both
STRING_REGEX_EDIT_DISTANCE=2
FINAL_STRING_EDITS=both
Actually, that might not be a bad idea. Enabling WORD_REPLACEMENTS
does not
really add to the complexity of the error model, but it allows targeted
promotion of individual words on the suggestion list, words with known and
frequent misspellings. To that end you can add misspelled words and their
corrections to the file words.default.txt
, in the following format:
oahppiin:ohppiin -10
váiloje:váilo -10
maŋge:mange -10
The format is:
The possible values for the WORD_REPLACEMENTS
variable are:
WORD_REPLACEMENTS
operationsWORD_REPLACEMENTS
Expanding on the settings fragment used throughout, we get the following:
EDIT_DISTANCE=2
INITIAL_EDITS=both
STRING_EDITS=both
STRING_REGEX_EDIT_DISTANCE=2
FINAL_STRING_EDITS=both
WORD_REPLACEMENTS=yes
When enabled, the file is compiled into an fst that is applied outside the rest of the error model:
As discussed next, the settings above are not a good idea. The maximum editing
distance is actually six (6! - 1 + (2*2) + 1
), which is way too much. But
it serves to illustrate the use of the settings in Makefile.am
.
It is possible to add a corpus of (preferably) correctly spelled text. The largest corpus in hse here is for North Sámi, 3.3M words of running text. When compiling the spellers, we get 3 values (here, the example is from South Sámi):
*** Weight for most frequent corpus wordform: jïh 3.220384
*** Weight for least frequent corpus wordform: BÅETIEH 11.495081
*** Default weight for out-of-corpus wordforms: 12.495081
Each suggested word get a penalty point from (the logarithmic value of) its frequency in the speller corpus, with the value of the most and least common word as upper and lower boundaries, as well as an even higher value for words outside the speller corpus. These values are added to the penalty points for going from error to suggestion.
The corpus weight of each word we get as follows:
hfst-lookup tools/spellcheckers/analyser-desktopspeller-gt-norm.hfst
In case of several values, the relevant value is the lowest one.
Divvunspell add penalty points to letter positions in the word, in a camel fashion:
The file governing this is mod.rs in the divvunspell repository.
This function may be turned off. Here are two divvunspell commands. The first includes handling of capital and small letters, and it includes the position sensitive weighting. In the second command, the flag --no-case-handling
turns off both these two features.
echo väsi|divvunspell suggest -a fit.zhfst
echo väsi|divvunspell suggest --no-case-handling -a fit.zhfst
For each correction suggestion, its value is calculated as the value of the suggestion maechanisms, as shown above, plus the corpus weight of the target form plus the position-dependent value.
DO NOT ENABLE EVERYTHING! That will very, very likely make the error model size explode, and make the speller so slow that it can’t be used. Exactly which files and what features are needed will vary from language to language, and has to be tested on a case by case basis.
The goal of a good speller is to always suggest the correct thing, or something sensible and close to the correct thing, but do not try to overdo this - it is better to not suggest something, than to need several seconds to be able to suggest.