GiellaLT

GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology. Read more about Why. See also How to get started and our Privacy document.

View GiellaLT on GitHub

Hyphenators

This page first documents the two approaches to hyphenation (the two tools), thereafter documents their integration in word processor software. Warning: this is work in progress, therefore it is supplemented with a documentation of a cumbersome workaround while waiting for working solutions.

The hyphenation tools

For each language, there are (or should be) two hyphenators, the pattern hyphenator and the fst-based hyphenator.

Pattern hyphenation

For compilation: ./compile --enable-pattern-hyphenators

The pattern hyphenation is made of patterns generated by patgen, which takes a large list of pre-hyphenated words as input. The resulting pattern files are used in TeX and LibreOffice.

The hyphenated word list is generated from the lexical hyphenation fst. One can adjust the size of the generated word list in tools/hyphenators/Makefile.modification-pattern.am, by changing the variable PATTERN_WORD_LIST (default is 15 000 words). The larger the list, the better the quality of the hyphenation patterns, but the longer it takes to build.

More details here.

FST hyphenation

For compilation: ./compile --enable-fst-hyphenator

The fst-based hyphenator is in lang-xxx/tools/hyphenators/.

The compiled fst-based hyphenator itself is hyphenator-gt-desc.hfst. It contains both lexicon-based hyphenation (full morphology) and generic, syllable-based rules (the pattern hyphenation above, used for unknown words).

The file is composed by these files:

hyphenator-gt-desc-no_fallback.hfst
hyphenator-rules-desc-weighted.hfst

where the former is a full analyser and the latter contains syllable based rules, with added weights.

The linguistic source code for the syllabification rules is in ` lang-xxx/src/hyphenation. The script is hyphenation.xfscript, written in the xfst`formalism.

Usage (where -b 0 gives only the best weight):

... |\
hfst-tokenise tools/tokenisers/tokeniser-gramcheck-gt-desc.pmhfst |\
hfst-lookup -b 0 tools/hyphenators/hyphenator-gt-desc.hfstol

Integrating hyphenators in software

LibreOffice/TeX hyphenation

Ad hoc solutions while waiting for hyphenation in word processors


Very old (2007) meetings