GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology. Read more about Why. See also How to get started and our Privacy document.
This page first documents the two approaches to hyphenation (the two tools), thereafter documents their integration in word processor software. Warning: this is work in progress, therefore it is supplemented with a documentation of a cumbersome workaround while waiting for working solutions.
For each language, there are (or should be) two hyphenators, the pattern hyphenator and the fst-based hyphenator.
For compilation: ./compile --enable-pattern-hyphenators
The pattern hyphenation is made of patterns generated by patgen
, which takes a
large list of pre-hyphenated words as input. The resulting pattern files are used
in TeX and LibreOffice.
The hyphenated word list is generated from the lexical hyphenation fst. One can
adjust the size of the generated word list in
tools/hyphenators/Makefile.modification-pattern.am
, by changing the variable
PATTERN_WORD_LIST
(default is 15 000 words). The larger the list, the better
the quality of the hyphenation patterns, but the longer it takes to build.
More details here.
For compilation: ./compile --enable-fst-hyphenator
The fst-based hyphenator is in lang-xxx/tools/hyphenators/
.
The compiled fst-based hyphenator itself is hyphenator-gt-desc.hfst
. It contains both lexicon-based hyphenation (full morphology) and generic, syllable-based rules (the pattern hyphenation above, used for unknown words).
The file is composed by these files:
hyphenator-gt-desc-no_fallback.hfst
hyphenator-rules-desc-weighted.hfst
where the former is a full analyser and the latter contains syllable based rules, with added weights.
The linguistic source code for the syllabification rules is in ` lang-xxx/src/hyphenation. The script is
hyphenation.xfscript, written in the
xfst`formalism.
Usage (where -b 0
gives only the best weight):
... |\
hfst-tokenise tools/tokenisers/tokeniser-gramcheck-gt-desc.pmhfst |\
hfst-lookup -b 0 tools/hyphenators/hyphenator-gt-desc.hfstol