Tornedalen Finnish NLP Grammar

Finite state and Constraint Grammar based analysers, proofing tools and other resources

View the project on GitHub giellalt/lang-fit

Page Content

Tokeniser for fit

Usage:

$ make
$ echo "ja, ja" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa boasttu olmmoš, man mielde lahtuid." | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "márffibiillagáffe" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst

Pmatch documentation: https://github.com/hfst/hfst/wiki/HfstPmatch

Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words:

Whitespace contains ASCII white space and the List contains some unicode white space characters

Apart from what’s in our morphology, there are

  1. unknown word-like forms, and
  2. unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a Unknowns are made of:
    • lower-case ASCII
    • upper-case ASCII
    • select extended latin symbols ASCII digits
    • select symbols
    • Combining diacritics as individual symbols,
    • various symbols from Private area (probably Microsoft), so far:
    • U+F0B7 for “x in box”

Unknown handling

Unknowns are tagged ?? and treated specially with hfst-tokenise hfst-tokenise –giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it’s safer to let hfst-tokenise handle them.

Finally we mark as a token any sequence making up a:


This (part of) documentation was generated from tools/tokenisers/tokeniser-disamb-gt-desc.pmscript

Sitemap

Debugging site.pages:

URL: /assets/css/style.css - Title:

URL: /HInsertion.html - Title:

URL: /Links.html - Title:

URL: /fit.html - Title: Meänkieli (Tornedalen Finnish) language model documentation

URL: /index-header.html - Title: Meänkieli documentation

URL: / - Title: Meänkieli documentation

URL: /isof/ - Title: Kurs i lexc og twolc for Isof, april 2022

URL: /isof/timeplan.html - Title: Oversikt over kurset

URL: /meetings/230301.html - Title: Møte om språkteknologi for meänkieli

URL: /src-cg3-dependency.cg3.html - Title: C O M M O N S Á M I D E P E N D E N C Y G R A M M A R

URL: /src-cg3-disambiguator.cg3.html - Title: Disambiguator for Meänkieli

URL: /src-cg3-functions.cg3.html - Title:

URL: /src-fst-morphology-affixes-abbreviations.lexc.html - Title: Documenting the morphological tags for Meänkieli abbreviations

URL: /src-fst-morphology-affixes-acronyms.lexc.html - Title: Documenting Meänkieli acronym morphology

URL: /src-fst-morphology-affixes-adjectives.lexc.html - Title: Documenting the file for Meänkieli adjective morphology

URL: /src-fst-morphology-affixes-nouns.lexc.html - Title: Meänkieli noun morphology

URL: /src-fst-morphology-affixes-numerals.lexc.html - Title: Meänkieli numerals

URL: /src-fst-morphology-affixes-pronouns.lexc.html - Title:

URL: /src-fst-morphology-affixes-propernouns.lexc.html - Title: Meänkieli propernoun morphology

URL: /src-fst-morphology-affixes-symbols.lexc.html - Title: Symbol affixes

URL: /src-fst-morphology-affixes-verbs.lexc.html - Title: Meänkieli verbs

URL: /src-fst-morphology-phonology.twolc.html - Title: Meänkieli twolc file

URL: /src-fst-morphology-root.lexc.html - Title: Meänkieli morphological transducer

URL: /src-fst-morphology-stems-adjectives.lexc.html - Title: Meänkieli adjectives

URL: /src-fst-morphology-stems-adverbs.lexc.html - Title: Meänkieli adverbs

URL: /src-fst-morphology-stems-conjunctions.lexc.html - Title: Meänkieli conjunctions

URL: /src-fst-morphology-stems-fit-abbreviations.lexc.html - Title: File containing meänkieli abbreviations

URL: /src-fst-morphology-stems-fit-acronyms.lexc.html - Title: Meänkieli aacronyms

URL: /src-fst-morphology-stems-fit-propernouns.lexc.html - Title: Meänkieli propernouns

URL: /src-fst-morphology-stems-interjections.lexc.html - Title: Meänkieli interjections

URL: /src-fst-morphology-stems-nouns.lexc.html - Title: Noun stems for Meänkieli

URL: /src-fst-morphology-stems-numerals.lexc.html - Title: Meänkieli numerals

URL: /src-fst-morphology-stems-postpositions.lexc.html - Title: Meänkieli postpositions

URL: /src-fst-morphology-stems-prepositions.lexc.html - Title: Meänkieli prepositions

URL: /src-fst-morphology-stems-pronouns.lexc.html - Title: Meänkieli pronouns

URL: /src-fst-morphology-stems-subjunctions.lexc.html - Title: Meänkieli subjunctions

URL: /src-fst-morphology-stems-verbs.lexc.html - Title: Documenting the file for meänkieli verbs

URL: /src-fst-phonetics-txt2ipa.xfscript.html - Title:

URL: /src-fst-transcriptions-transcriptor-abbrevs2text.lexc.html - Title:

URL: /src-fst-transcriptions-transcriptor-numbers-digit2text.lexc.html - Title: Number transcriptions

URL: /test-diary.html - Title: Test diary

URL: /tools-grammarcheckers-grammarchecker.cg3.html - Title:

URL: /tools-tokenisers-tokeniser-disamb-gt-desc.pmscript.html - Title: Tokeniser for fit

URL: /tools-tokenisers-tokeniser-gramcheck-gt-desc.pmscript.html - Title: Grammar checker tokenisation for fit

URL: /tools-tokenisers-tokeniser-tts-cggt-desc.pmscript.html - Title: TTS tokenisation for smj

URL: /tyolista.html - Title: Työlista = Arbetslista

Root items:

URL: /HInsertion.html - Title: Hinsertion

URL: /Links.html - Title: Links

URL: /fit.html - Title: Meänkieli (Tornedalen Finnish) language model documentation

URL: /index-header.html - Title: Meänkieli documentation

URL: / - Title: Meänkieli documentation

URL: /isof/ - Title: Kurs i lexc og twolc for Isof, april 2022

URL: /src-cg3-dependency.cg3.html - Title: C O M M O N S Á M I D E P E N D E N C Y G R A M M A R

URL: /src-cg3-disambiguator.cg3.html - Title: Disambiguator for Meänkieli

URL: /src-cg3-functions.cg3.html - Title: Src-cg3-functions.cg3

URL: /src-fst-morphology-affixes-abbreviations.lexc.html - Title: Documenting the morphological tags for Meänkieli abbreviations

URL: /src-fst-morphology-affixes-acronyms.lexc.html - Title: Documenting Meänkieli acronym morphology

URL: /src-fst-morphology-affixes-adjectives.lexc.html - Title: Documenting the file for Meänkieli adjective morphology

URL: /src-fst-morphology-affixes-nouns.lexc.html - Title: Meänkieli noun morphology

URL: /src-fst-morphology-affixes-numerals.lexc.html - Title: Meänkieli numerals

URL: /src-fst-morphology-affixes-pronouns.lexc.html - Title: Src-fst-morphology-affixes-pronouns.lexc

URL: /src-fst-morphology-affixes-propernouns.lexc.html - Title: Meänkieli propernoun morphology

URL: /src-fst-morphology-affixes-symbols.lexc.html - Title: Symbol affixes

URL: /src-fst-morphology-affixes-verbs.lexc.html - Title: Meänkieli verbs

URL: /src-fst-morphology-phonology.twolc.html - Title: Meänkieli twolc file

URL: /src-fst-morphology-root.lexc.html - Title: Meänkieli morphological transducer

URL: /src-fst-morphology-stems-adjectives.lexc.html - Title: Meänkieli adjectives

URL: /src-fst-morphology-stems-adverbs.lexc.html - Title: Meänkieli adverbs

URL: /src-fst-morphology-stems-conjunctions.lexc.html - Title: Meänkieli conjunctions

URL: /src-fst-morphology-stems-fit-abbreviations.lexc.html - Title: File containing meänkieli abbreviations

URL: /src-fst-morphology-stems-fit-acronyms.lexc.html - Title: Meänkieli aacronyms

URL: /src-fst-morphology-stems-fit-propernouns.lexc.html - Title: Meänkieli propernouns

URL: /src-fst-morphology-stems-interjections.lexc.html - Title: Meänkieli interjections

URL: /src-fst-morphology-stems-nouns.lexc.html - Title: Noun stems for Meänkieli

URL: /src-fst-morphology-stems-numerals.lexc.html - Title: Meänkieli numerals

URL: /src-fst-morphology-stems-postpositions.lexc.html - Title: Meänkieli postpositions

URL: /src-fst-morphology-stems-prepositions.lexc.html - Title: Meänkieli prepositions

URL: /src-fst-morphology-stems-pronouns.lexc.html - Title: Meänkieli pronouns

URL: /src-fst-morphology-stems-subjunctions.lexc.html - Title: Meänkieli subjunctions

URL: /src-fst-morphology-stems-verbs.lexc.html - Title: Documenting the file for meänkieli verbs

URL: /src-fst-phonetics-txt2ipa.xfscript.html - Title: Src-fst-phonetics-txt2ipa.xfscript

URL: /src-fst-transcriptions-transcriptor-abbrevs2text.lexc.html - Title: Src-fst-transcriptions-transcriptor-abbrevs2text.lexc

URL: /src-fst-transcriptions-transcriptor-numbers-digit2text.lexc.html - Title: Number transcriptions

URL: /test-diary.html - Title: Test diary

URL: /tools-grammarcheckers-grammarchecker.cg3.html - Title: Tools-grammarcheckers-grammarchecker.cg3

URL: /tools-tokenisers-tokeniser-disamb-gt-desc.pmscript.html - Title: Tokeniser for fit

URL: /tools-tokenisers-tokeniser-gramcheck-gt-desc.pmscript.html - Title: Grammar checker tokenisation for fit

URL: /tools-tokenisers-tokeniser-tts-cggt-desc.pmscript.html - Title: TTS tokenisation for smj

URL: /tyolista.html - Title: Työlista = Arbetslista

Directory items:

URL: /isof/timeplan.html - Title: Oversikt over kurset

URL: /meetings/230301.html - Title: Møte om språkteknologi for meänkieli