GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology. Read more about Why. See also How to get started, and our Privacy document.
How to use the hfst-tokenise pipeline to tokenise-as-you-analyse, using giella-sme on Mac as an example:
First off, update your HFST+vislcg3 by running
wget http://apertium.projectjj.com/osx/install-nightly.sh
sudo bash install-nightly.sh
This should give you the most recent SVN versions (as of last night) of HFST and vislcg3.
(Packages exist for pretty much all Unix operating systems; the Prequisites links under http://wiki.apertium.org/wiki/Installation#If_you_want_to_add_language_data_.2F_do_more_advanced_stuff should give the right URL’s.)
For now, you’ll also need to get the program cg-mwesplit (which will later be included in vislcg3). To compile cg-mwesplit, first ensure you have a recent version of Xcode with the Command Line Tools (e.g. 7.3, available from https://developer.apple.com/services-account/download?path=/Developer_Tools/Xcode_7.3/Xcode_7.3.dmg ). Then do
export CXX=clang++
export CC=clang
git clone https://github.com/unhammer/cg-mwesplit
./autogen.sh
./configure
make
sudo make install
Now, svn up
in giella-core
and langs/sme
, and run ./configure
in
langs/sme
with the option –enable-tokenisers; e.g. if you want both the
CG rules and the tokenisers, you would do
./configure --enable-tokenisers --enable-syntax
(If you use Apertium, you’d also want –enable-apertium –with-hfst, etc.)
Finally, run “make” (currently, this requires >8GB of RAM).
To run just the raw tokenisation+morphological analysis:
echo 'sánit, jna. Leago' \
| hfst-tokenise --giella-cg $GTHOME/langs/sme/tools/tokenisers/tokeniser-gramcheck-gt-desc
To include disambiguation of ambiguous multiwords:
echo 'sánit, jna. Leago' \
| hfst-tokenise --giella-cg $GTHOME/langs/sme/tools/tokenisers/tokeniser-gramcheck-gt-desc.pmhfst \
| vislcg3 -g $GTHOME/langs/sme/tools/tokenisers/mwe-dis.cg3
To include splitting disambiguated multiwords into their own cohorts:
echo 'sánit, jna. Leago' \
| hfst-tokenise --giella-cg $GTHOME/langs/sme/tools/tokenisers/tokeniser-gramcheck-gt-desc.pmhfst \
| vislcg3 -g $GTHOME/langs/sme/tools/tokenisers/mwe-dis.cg3 \
| cg-mwesplit
To include regular morphological disambiguation:
echo 'sánit, jna. Leago' \
| hfst-tokenise --giella-cg $GTHOME/langs/sme/tools/tokenisers/tokeniser-gramcheck-gt-desc.pmhfst \
| vislcg3 -g $GTHOME/langs/sme/tools/tokenisers/mwe-dis.cg3 \
| cg-mwesplit \
| vislcg3 -g $GTHOME/langs/sme/src/syntax/disambiguation.cg3
To include regular syntax tagging:
echo 'sánit, jna. Leago' \
| hfst-tokenise --giella-cg $GTHOME/langs/sme/tools/tokenisers/tokeniser-gramcheck-gt-desc.pmhfst \
| vislcg3 -g $GTHOME/langs/sme/tools/tokenisers/mwe-dis.cg3 \
| cg-mwesplit \
| vislcg3 -g $GTHOME/langs/sme/src/syntax/disambiguation.cg3 \
| vislcg3 -g $GTHOME/giella-core/giella-shared/smi/src/syntax/functions.cg3
etc.
If you use these steps often, you’ll probably want to make an alias. Open ~/.bashrc in your editor and enter for example
alias hsme='hfst-tokenise --giella-cg $GTHOME/langs/sme/tools/tokenisers/tokeniser-gramcheck-gt-desc.pmhfst'
alias hsmemwe='hsme | vislcg3 -g $GTHOME/langs/sme/tools/tokenisers/mwe-dis.cg3 --trace'
alias hsmesplit='hsmemwe | cg-mwesplit'