Finite state and Constraint Grammar based analysers, proofing tools and other resources
View the project on GitHub giellalt/lang-mhr
Requires a recent version of HFST (3.10.0 / git revision>=3aecdbc) Then just: $ make $ echo “ja, ja” | hfst-tokenise –giella-cg tokeniser-disamb-gt-desc.pmhfst
Issues:
More usage examples: $ echo “Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa boasttu olmmoš, man mielde lahtuid.” | hfst-tokenise –giella-cg tokeniser-disamb-gt-desc.pmhfst $ echo “(gáfe) ‘ja’ ja 3. ja? ц jaja ukjend "ukjend"” | hfst-tokenise –giella-cg tokeniser-disamb-gt-desc.pmhfst $ echo “márffibiillagáffe” | hfst-tokenise –giella-cg tokeniser-disamb-gt-desc.pmhfst
Pmatch documentation: https://kitwiki.csc.fi/twiki/bin/view/KitWiki/HfstPmatch
Apart from what’s in our morphology, there are 1) unknown word-like forms, and 2) unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a
select symbols
TODO: Could use something like this, but built-in’s don’t include šžđčŋ:
Unknowns are tagged ?? and treated specially with hfst-tokenise
hfst-tokenise –giella-cg will treat such empty analyses as unknowns, and
remove empty analyses from other readings. Empty readings are also
legal in CG, they get a default baseform equal to the wordform, but
no tag to check, so it’s safer to let hfst-tokenise handle them.
Needs hfst-tokenise to output things differently depending on the tag they get
This (part of) documentation was generated from tools/tokenisers/tokeniser-disamb-gt-desc.thirties.pmscript