GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology. Read more about Why. See also How to get started and our Privacy document.
The TeX / LibreOffice hyphenation system uses trigram patterns to identify the hyphenation points. The patterns are extracted automatically from a given list of hyphenated words. In the GiellaLT infrastructure, this list is generated from an FST with all hyphenation points in place. More information about this FST can be found on the FST hyphenation page.
The build requires that you have patgen
installed on your system. patgen
is
part of the TexLive package, which can be installed in several ways:
sudo port install texlive-bin-extra
brew install texlive
If patgen
is not found by ./configure
, search for it as follows:
find /usr/local -name 'patgen'
/usr/local/Cellar/texlive/20220321_4/bin/patgen
/usr/local/texlive/2022/bin/universal-darwin/patgen
find /opt/homebrew -name 'patgen'
/opt/homebrew/bin/patgen
/opt/homebrew/Cellar/texlive/20220321_4/bin/patgen
then specify the preferred path to ./configure
as follows:
./configure --enable-pattern-hyphenators --with-patgen=/usr/local/texlive/2022/bin/universal-darwin/
(But it might be enough to open a new terminal window after installing texlive.)
Continue as below.
./configure --enable-pattern-hyphenators
make
This will create the following files:
XX_hyph.tex ## For use with TeX processors
hyph_XX.dic ## For use with LibreOffice and compatible systems
XX.pat ## Existing hyphenation patterns, if any
XX
is a language code, and will vary from language
to language. It will be ISO 639-1 if available, ISO 639-3 if not.
*.pat
fileThis file is empty the first time you make the hyphenation files. On subsequent
runs, an existing *.tex
is used as input. Thus, by running make
several
times, the hyphenation files can be improved, as the patterns are adjusted to
accomodate new data generated by each run.
For this reason, the *.dic
and *.tex
files
should be stored under version controll.
*.tra
fileContains mappings from upper to lower case. The default file should be ok, but have a look to ensure proper mapping. The file can be empty if no mapping beyond the default ASCII is needed. Some further documentation can be found at the following places:
The file tools/hyphenation/Makefile.modification-pattern.am
contains some
variables that can be used to fine-tune the building process. They are (with
default values provided):
PATTERN_WORD_LIST=15000
HYPH_START_FINISH="1 2"
PATR_START_FINISH="2 4"
GOOD_BAD_THRESHLD="1 1 1"
Explanations:
PATTERN_WORD_LIST
= size of generated word list for extracting patterns.
The larger the better patterns are generated, but the longer the build time
takes.HYPH_START_FINISH
, PATR_START_FINISH
, GOOD_BAD_THRESHLD
= various settings
for the patgen
tool, see links to patgen
documentation above.patgen
errorsBad character
”patgen
prints a problematic input string, then this message.*.tra
file - add it if that’s the case.hämit-teht-uv-sun
- the letter ä
is not a precomposed ä
,
but a
+ combining diacricit. Find the source in the LexC
files, and correct
it there. If there exists no precomposed letter for a certain combination of
base letter and combining diacritic, add it to the *.tra
file (NB! This
has not been tested!).Bad representation
”patgen
prints a problematic input string, then this message.*.tra
file, and should be addedGło-wac-kan-ges
- the letter ł
could be missing (it is
missing from the default coming from the template)