This page first documents the two approaches to hyphenation (the two tools), thereafter documents their integration in word processor software. Warning: this is work in progress, therefore it is supplemented with a documentation of a cumbersome workaround while waiting for working solutions.
For each language, there are (or should be) two hyphenators, the pattern hyphenator and the fst-based hyphenator.
The pattern hyphenation is made of patterns generated by
patgen, which takes a
large list of pre-hyphenated words as input. The resulting pattern files are used
in TeX and LibreOffice.
The hyphenated word list is generated from the lexical hyphenation fst. One can
adjust the size of the generated word list in
tools/hyphenators/Makefile.modification-pattern.am, by changing the variable
PATTERN_WORD_LIST (default is 15 000 words). The larger the list, the better
the quality of the hyphenation patterns, but the longer it takes to build.
More details here.
The fst-based hyphenator is in
The compiled fst-based hyphenator itself is
hyphenator-gt-desc.hfst. It contains both lexicon-based hyphenation (full morphology) and generic, syllable-based rules (the pattern hyphenation above, used for unknown words).
The file is composed by these files:
where the former is a full analyser and the latter contains syllable based rules, with added weights.
The linguistic source code for the syllabification rules is in ` lang-xxx/src/hyphenation
. The script is hyphenation.xfscript
, written in the xfst`formalism.
-b 0 gives only the best weight):
hfst-tokenise tools/tokenisers/tokeniser-gramcheck-gt-desc.pmhfst |\
hfst-lookup -b 0 tools/hyphenators/hyphenator-gt-desc.hfstol