GiellaLT

GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology.

View GiellaLT on GitHub

Page Content

TeX /LibreOffice Hyphenators

The TeX / LibreOffice hyphenation system uses trigram patterns to identify the hyphenation points. The patterns are extracted automatically from a given list of hyphenated words. In the GiellaLT infrastructure, this list is generated from an FST with all hyphenation points in place. More information about this FST can be found on the FST hyphenation page.

The build requires that you have patgen installed on your system. patgen is part of the TexLive package, which can be installed in several ways:

If patgen is not found by ./configure, search for it as follows:

find /usr/local -name 'patgen'  
/usr/local/Cellar/texlive/20220321_4/bin/patgen
/usr/local/texlive/2022/bin/universal-darwin/patgen

find /opt/homebrew -name 'patgen'   
/opt/homebrew/bin/patgen
/opt/homebrew/Cellar/texlive/20220321_4/bin/patgen

then specify the preferred path to ./configure as follows:

./configure --enable-pattern-hyphenators --with-patgen=/usr/local/texlive/2022/bin/universal-darwin/

(But it might be enough to open a new terminal window after installing texlive.)

Continue as below.

Configuration and Build

./configure --enable-pattern-hyphenators
make

This will create the following files:

XX_hyph.tex # For use with TeX processors
hyph_XX.dic # For use with LibreOffice and compatible systems
XX.pat      # Existing hyphenation patterns, if any

XX is a language code, and will vary from language to language. It will be ISO 639-1 if available, ISO 639-3 if not.

The *.pat file

This file is empty the first time you make the hyphenation files. On subsequent runs, an existing *.tex is used as input. Thus, by running make several times, the hyphenation files can be improved, as the patterns are adjusted to accomodate new data generated by each run.

:point_right: For this reason, the *.dic and *.tex files should be stored under version controll. :point_left:

The *.tra file

Contains mappings from upper to lower case. The default file should be ok, but have a look to ensure proper mapping. The file can be empty if no mapping beyond the default ASCII is needed. Some further documentation can be found at the following places:

Makefile variables

The file tools/hyphenation/Makefile.modification-pattern.am contains some variables that can be used to fine-tune the building process. They are (with default values provided):

PATTERN_WORD_LIST=15000
HYPH_START_FINISH="1 2"
PATR_START_FINISH="2 4"
GOOD_BAD_THRESHLD="1 1 1"

Explanations:

patgen errors

Bad character

Bad representation