TeX /LibreOffice Hyphenators
The TeX / LibreOffice hyphenation system uses trigram patterns to identify the hyphenation points. The patterns are extracted automatically from a given list of hyphenated words. In the GiellaLT infrastructure, this list is generated from an FST with all hyphenation points in place. More information about this FST can be found on the FST hyphenation page.
The build requires that you have patgen installed on your system. patgen is
part of the TexLive package, which can be installed in several ways:
- Using MacPorts:
sudo port install texlive-bin-extra - Using Brew:
brew install texlive - By installing MacTex.
If patgen is not found by ./configure, search for it as follows:
find /usr/local -name 'patgen'
/usr/local/Cellar/texlive/20220321_4/bin/patgen
/usr/local/texlive/2022/bin/universal-darwin/patgen
find /opt/homebrew -name 'patgen'
/opt/homebrew/bin/patgen
/opt/homebrew/Cellar/texlive/20220321_4/bin/patgen
then specify the preferred path to ./configure as follows:
./configure --enable-pattern-hyphenators --with-patgen=/usr/local/texlive/2022/bin/universal-darwin/
(But it might be enough to open a new terminal window after installing texlive.)
Continue as below.
Configuration and Build
./configure --enable-pattern-hyphenators
make
This will create the following files:
XX_hyph.tex ## For use with TeX processors
hyph_XX.dic ## For use with LibreOffice and compatible systems
XX.pat ## Existing hyphenation patterns, if any
XX is a language code, and will vary from language
to language. It will be ISO 639-1 if available, ISO 639-3 if not.
The *.pat file
This file is empty the first time you make the hyphenation files. On subsequent
runs, an existing *.tex is used as input. Thus, by running make several
times, the hyphenation files can be improved, as the patterns are adjusted to
accomodate new data generated by each run.
For this reason, the *.dic and *.tex files
should be stored under version controll. ![]()
The *.tra file
Contains mappings from upper to lower case. The default file should be ok, but have a look to ensure proper mapping. The file can be empty if no mapping beyond the default ASCII is needed. Some further documentation can be found at the following places:
Makefile variables
The file tools/hyphenation/Makefile.modification-pattern.am contains some
variables that can be used to fine-tune the building process. They are (with
default values provided):
PATTERN_WORD_LIST=15000
HYPH_START_FINISH="1 2"
PATR_START_FINISH="2 4"
GOOD_BAD_THRESHLD="1 1 1"
Explanations:
-
PATTERN_WORD_LIST= size of generated word list for extracting patterns. The larger the better patterns are generated, but the longer the build time takes. -
HYPH_START_FINISH,PATR_START_FINISH,GOOD_BAD_THRESHLD= various settings for thepatgentool, see links topatgendocumentation above.
patgen errors
“Bad character”
-
diagnostics:
patgenprints a problematic input string, then this message. -
solution: one of the following
- have a look at the problematic string, and see if there are unexpected symbols. That includes checking for combining diacritics; these should be fixed in the lexicon/FST.
- if all symbols/letters are ok, see if one of them is missing from the
*.trafile - add it if that’s the case.
-
example:
hämit-teht-uv-sun- the letteräis not a precomposedä, buta+ combining diacricit. Find the source in theLexCfiles, and correct it there. If there exists no precomposed letter for a certain combination of base letter and combining diacritic, add it to the*.trafile (NB! This has not been tested!).
“Bad representation”
-
diagnostics:
patgenprints a problematic input string, then this message. -
solution: one of the letters in the input string is missing from the
*.trafile, and should be added -
example:
Gło-wac-kan-ges- the letterłcould be missing (it is missing from the default coming from the template)