GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology. Read more about Why. See also How to get started and our Privacy document.
HfstTokenizer can be compiled together with OmegaT and bundled into Mac App. Follow these instructions:
JavaAppLauncher
and jre-mac-root
to be defined
in OMEGAT_ASSETS_DIR
folder, which is searched from environmental variables.
If not found in this folder the build process looks one folder down from
where you installed OmegaT sources.
jre-mac-root
is a soft link to the folder where Java Runtime libraries are foundOMEGAT_SRC_FOLDER/lib
where
OMEGAT_SRC_FOLDER is the folder you just installed the OmegaT source files.
here
1- Copy HfstTokenizer.java
and HfstStemFilter.java
to
OMEGAT_SRC_FOLDER/src/org/omegat/tokenizer
where
OMEGAT_SRC_FOLDER is the folder you just installed the OmegaT source files. - Modify files package name if needed - Remove throws IOException
from getTokenStream
method and correct
StandardTokenizer
constructor call - diff HfstTokenizer.java against 4.x HfstTokenizer.java (see diffs below)hfst-ol.jar
to manifest-template.mf
(details below)lib/hfst-ol.jar
entry to manifest.mf
’s Class-Path
variableant mac
in OmegaT source folder, the one where you installed OmegaTDiffs:
1c1
< package org.omegat.tokenizer;
---
> package no.divvun.tokenizer;
16a17
> import org.omegat.tokenizer.BaseTokenizer;
17a19
> import org.omegat.tokenizer.Tokenizer;
60,63c62,64
< final boolean stopWordsAllowed) {
< StandardTokenizer tokenizer = new StandardTokenizer(getBehavior(),
< new StringReader(strOrig));
< // tokenizer.setReader(new StringReader(strOrig));
---
> final boolean stopWordsAllowed) throws IOException {
> StandardTokenizer tokenizer = new StandardTokenizer();
> tokenizer.setReader(new StringReader(strOrig));
71,72c72
< return new HfstStemFilter(new StandardTokenizer(getBehavior(),
< new StringReader(strOrig)), transducer);
---
> return new HfstStemFilter(tokenizer, transducer);
1c1
< package org.omegat.tokenizer;
---
> package no.divvun.tokenizer;
11a12
> import org.apache.lucene.util.AttributeSource.State;
47,49c48,49
< for (String s : res) {
< // res.forEach(anal -> {
< String stem = s.substring(0, s.indexOf("+"));
---
> res.forEach(anal -> {
> String stem = anal.substring(0, anal.indexOf("+"));
53c53
< }
---
> });
Add the following for hfst-ol.jar
to template:
Name: org.omegat.tokenizer.HfstTokenizer
OmegaT-Plugin: tokenizer