Tokeniser for eus
Usage:
$ make
$ echo "ja, ja" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa boasttu olmmoš, man mielde lahtuid." | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "márffibiillagáffe" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
Pmatch documentation: https://github.com/hfst/hfst/wiki/HfstPmatch
Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words:
- Punct contains ASCII punctuation marks
 - The symbol after m-dash is soft-hyphen 
U+00AD - The symbol following {•} is byte-order-mark / zero-width no-break space
U+FEFF. 
Whitespace contains ASCII white space and the List contains some unicode white space characters
- En Quad U+2000 to Zero-Width Joiner U+200d’
 - Narrow No-Break Space U+202F
 - Medium Mathematical Space U+205F
 - Word joiner U+2060
 
Apart from what’s in our morphology, there are
- unknown word-like forms, and
 - unmatched strings
We want to give 1) a match, but let 2) be treated specially by
hfst-tokenise -aUnknowns are made of:- lower-case ASCII
 - upper-case ASCII
 - select extended latin symbols ASCII digits
 - select symbols
 - Combining diacritics as individual symbols,
 - various symbols from Private area (probably Microsoft), so far:
 - U+F0B7 for “x in box”
 
 
Unknown handling
Unknowns are tagged ?? and treated specially with hfst-tokenise
hfst-tokenise –giella-cg will treat such empty analyses as unknowns, and
remove empty analyses from other readings. Empty readings are also
legal in CG, they get a default baseform equal to the wordform, but
no tag to check, so it’s safer to let hfst-tokenise handle them.
Finally we mark as a token any sequence making up a:
- known word in context
 - unknown (OOV) token in context
 - sequence of word and punctuation
 - URL in context
 
This (part of) documentation was generated from tools/tokenisers/tokeniser-disamb-gt-desc.pmscript