hfst-tokenise
How to use the hfst-tokenise pipeline to tokenise-as-you-analyse, using giella-sme on Mac as an example:
Prerequisites for Mac
First off, update your HFST+vislcg3 by running
wget http://apertium.projectjj.com/osx/install-nightly.sh
sudo bash install-nightly.sh
This should give you the most recent SVN versions (as of last night) of HFST and vislcg3.
(Packages exist for pretty much all Unix operating systems; the Prequisites links under http://wiki.apertium.org/wiki/Installation#If_you_want_to_add_language_data_.2F_do_more_advanced_stuff should give the right URL’s.)
For now, you’ll also need to get the program cg-mwesplit (which will later be included in vislcg3). To compile cg-mwesplit, first ensure you have a recent version of Xcode with the Command Line Tools (e.g. 7.3, available from https://developer.apple.com/services-account/download?path=/Developer_Tools/Xcode_7.3/Xcode_7.3.dmg ). Then do
export CXX=clang++
export CC=clang
git clone https://github.com/unhammer/cg-mwesplit
./autogen.sh
./configure
make
sudo make install
Build sme
Now, svn up in giella-core and langs/sme, and run ./configure in
langs/sme with the option –enable-tokenisers; e.g. if you want both the
CG rules and the tokenisers, you would do
./configure --enable-tokenisers --enable-syntax
(If you use Apertium, you’d also want –enable-apertium –with-hfst, etc.)
Finally, run “make” (currently, this requires >8GB of RAM).
Test
To run just the raw tokenisation+morphological analysis:
echo 'sánit, jna. Leago' \
| hfst-tokenise --giella-cg $GTHOME/langs/sme/tools/tokenisers/tokeniser-gramcheck-gt-desc
To include disambiguation of ambiguous multiwords:
echo 'sánit, jna. Leago' \
| hfst-tokenise --giella-cg $GTHOME/langs/sme/tools/tokenisers/tokeniser-gramcheck-gt-desc.pmhfst \
| vislcg3 -g $GTHOME/langs/sme/tools/tokenisers/mwe-dis.cg3
To include splitting disambiguated multiwords into their own cohorts:
echo 'sánit, jna. Leago' \
| hfst-tokenise --giella-cg $GTHOME/langs/sme/tools/tokenisers/tokeniser-gramcheck-gt-desc.pmhfst \
| vislcg3 -g $GTHOME/langs/sme/tools/tokenisers/mwe-dis.cg3 \
| cg-mwesplit
To include regular morphological disambiguation:
echo 'sánit, jna. Leago' \
| hfst-tokenise --giella-cg $GTHOME/langs/sme/tools/tokenisers/tokeniser-gramcheck-gt-desc.pmhfst \
| vislcg3 -g $GTHOME/langs/sme/tools/tokenisers/mwe-dis.cg3 \
| cg-mwesplit \
| vislcg3 -g $GTHOME/langs/sme/src/syntax/disambiguation.cg3
To include regular syntax tagging:
echo 'sánit, jna. Leago' \
| hfst-tokenise --giella-cg $GTHOME/langs/sme/tools/tokenisers/tokeniser-gramcheck-gt-desc.pmhfst \
| vislcg3 -g $GTHOME/langs/sme/tools/tokenisers/mwe-dis.cg3 \
| cg-mwesplit \
| vislcg3 -g $GTHOME/langs/sme/src/syntax/disambiguation.cg3 \
| vislcg3 -g $GTHOME/giella-core/giella-shared/smi/src/syntax/functions.cg3
etc.
If you use these steps often, you’ll probably want to make an alias. Open ~/.bashrc in your editor and enter for example
alias hsme='hfst-tokenise --giella-cg $GTHOME/langs/sme/tools/tokenisers/tokeniser-gramcheck-gt-desc.pmhfst'
alias hsmemwe='hsme | vislcg3 -g $GTHOME/langs/sme/tools/tokenisers/mwe-dis.cg3 --trace'
alias hsmesplit='hsmemwe | cg-mwesplit'