GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology. Read more about Why. See also How to get started, and our Privacy document.

hfst-tokenise

How to use the hfst-tokenise pipeline to tokenise-as-you-analyse, using giella-sme on Mac as an example:

Prerequisites for Mac

First off, update your HFST+vislcg3 by running

  wget http://apertium.projectjj.com/osx/install-nightly.sh
  sudo bash install-nightly.sh

This should give you the most recent SVN versions (as of last night) of HFST and vislcg3.

(Packages exist for pretty much all Unix operating systems; the Prequisites links under http://wiki.apertium.org/wiki/Installation#If_you_want_to_add_language_data_.2F_do_more_advanced_stuff should give the right URL’s.)

For now, you’ll also need to get the program cg-mwesplit (which will later be included in vislcg3). To compile cg-mwesplit, first ensure you have a recent version of Xcode with the Command Line Tools (e.g. 7.3, available from https://developer.apple.com/services-account/download?path=/Developer_Tools/Xcode_7.3/Xcode_7.3.dmg ). Then do

  export CXX=clang++
  export CC=clang
  git clone https://github.com/unhammer/cg-mwesplit
  ./autogen.sh
  ./configure
  make
  sudo make install

Build sme

Now, svn up in giella-core and langs/sme, and run ./configure in langs/sme with the option –enable-tokenisers; e.g. if you want both the CG rules and the tokenisers, you would do

  ./configure --enable-tokenisers --enable-syntax

(If you use Apertium, you’d also want –enable-apertium –with-hfst, etc.)

Finally, run “make” (currently, this requires >8GB of RAM).

Test

To run just the raw tokenisation+morphological analysis:

echo 'sánit, jna. Leago' \
|  hfst-tokenise --giella-cg $GTHOME/langs/sme/tools/tokenisers/tokeniser-gramcheck-gt-desc

To include disambiguation of ambiguous multiwords:

echo 'sánit, jna. Leago' \
|  hfst-tokenise --giella-cg $GTHOME/langs/sme/tools/tokenisers/tokeniser-gramcheck-gt-desc.pmhfst \
|  vislcg3 -g $GTHOME/langs/sme/tools/tokenisers/mwe-dis.cg3

To include splitting disambiguated multiwords into their own cohorts:

echo 'sánit, jna. Leago' \
|  hfst-tokenise --giella-cg $GTHOME/langs/sme/tools/tokenisers/tokeniser-gramcheck-gt-desc.pmhfst \
|  vislcg3 -g $GTHOME/langs/sme/tools/tokenisers/mwe-dis.cg3 \
|  cg-mwesplit

To include regular morphological disambiguation:

echo 'sánit, jna. Leago' \
|  hfst-tokenise --giella-cg $GTHOME/langs/sme/tools/tokenisers/tokeniser-gramcheck-gt-desc.pmhfst \
|  vislcg3 -g $GTHOME/langs/sme/tools/tokenisers/mwe-dis.cg3 \
|  cg-mwesplit \
|  vislcg3 -g $GTHOME/langs/sme/src/syntax/disambiguation.cg3

To include regular syntax tagging:

echo 'sánit, jna. Leago' \
|  hfst-tokenise --giella-cg $GTHOME/langs/sme/tools/tokenisers/tokeniser-gramcheck-gt-desc.pmhfst \
|  vislcg3 -g $GTHOME/langs/sme/tools/tokenisers/mwe-dis.cg3 \
|  cg-mwesplit \
|  vislcg3 -g $GTHOME/langs/sme/src/syntax/disambiguation.cg3 \
|  vislcg3 -g $GTHOME/giella-core/giella-shared/smi/src/syntax/functions.cg3

etc.

If you use these steps often, you’ll probably want to make an alias. Open ~/.bashrc in your editor and enter for example

alias hsme='hfst-tokenise --giella-cg $GTHOME/langs/sme/tools/tokenisers/tokeniser-gramcheck-gt-desc.pmhfst'
alias hsmemwe='hsme | vislcg3 -g $GTHOME/langs/sme/tools/tokenisers/mwe-dis.cg3 --trace'
alias hsmesplit='hsmemwe | cg-mwesplit'

Sitemap