Kalaallisut NLP Grammar

Finite state and Constraint Grammar based analysers, proofing tools and other resources

View the project on GitHub giellalt/lang-kal

Page Content

Kalaallisut language model documentation

All doc-comment documentation in one large file.


src-cg3-dependency.cg3.md

West Greenlandic Dependency Parser


This (part of) documentation was generated from src/cg3/dependency.cg3


src-cg3-disambiguator.cg3.md

W E S T G R E E N L A N D I C D I S A M B I G U A T O R

Delimiters, tags and sets

Tags and sets

Positions and household

The following tags: BOS, EOS, CLB, Symbol, iSymbol, PUNCT, LEFT, RIGHT, COMMONAFFSTRING, EXCLMARK

Parts of speech with tags declared as single-membered LISTs

Grammar tags

Gram/… and different specifications

Orthographic error tags

Heur Prop case tags

Heur der tags for iCase to block Abs Pl Heur/Prop analysis

Heur Verb tags to block Abs Sg Heur/Prop analysis

Heur GL final

Heur FOREIGN final prop

Heur FOREIGN initial prop

Heur scan err

Heur Excl tags to block Abs Sg Heur/Prop analysis

Grammatical tags

Sg, Du, Pl, iSg, iDu, iPl, ALL_Sg, ALL_Pl, case forms, verbal inflection

#Diverse tags defineret i kal-pre2 (dog et par hybrider her i disambiguator)

Derivatives

Sets

Alle ordklasser

Verb

Ulike verbtyper.

TRANSVERB = 1SgO, 2SgO, 3SgO, 4SgO, 1PlO, 2PlO, 3PlO, 4PlO

Nominer

Set for kasus, possessum, appellativ, ulike nomentyper

Kombinationer af verber og nominer

Objekts-set introduceret 20170416 - virkede ikke

Unification set til SUBJ med tilhørende TRANSVERB. Reformulering med objekter påbegyndt 20190519

SUBJTRANSVERB er alle ovenfor

Unification set til CONT og subjekter, kun for intransitive CONT.

Unification set til CONT og kongruente V ved intransitive V

!!! Partikler

Lexical sets

Verbernes leksikalske klasser

Semantic tags

Gram/… and different specifications

Regelsektion

BEFORE-SECTIONS

Disambiguere morfemkombinationer

Judithes afsnit start ### påbegyndt 20231006: sorter umulige morfemkombinationer fra


This (part of) documentation was generated from src/cg3/disambiguator.cg3


src-cg3-functions.cg3.md

W E S T G R E E N L A N D I C S Y N T A C T I C F U N C T I O N S

Ny indholdsfortegnelse påbegyndt 20201015

Begyndende udkommentering af REMOVE og SELECT, der kun skal stå i disambiguator. Vil blive udkommenteret med ‘#DATO’ fx ‘#20210704’. Færdiggjort 20230726. Begyndende sletning hvor REMOVE og SELECT står med #

Regelsektion

Delimiter


This (part of) documentation was generated from src/cg3/functions.cg3


src-fst-morphology-affixes-derivations-inflections.lexc.md

Fil for at generere de centrale morfologiske processer i vor grønlandske analysator

Uku qanoq IV-mod_C-imi ilanngunneqassappat?

PL 20180718. Følgende er et særligt lexicon til ip i stedet for tidligere LEXICON IV_k_stem med default flex-iv, der producerede enorm overgenerering. Samtidig er transitiv ip flyttet fra TV til flex-tv, men der skal nok tilføjes enkelte transitive ip som upper-under efterhånden som de dokumenteres.

Et særligt leksikon til tilhæng som RIANNGUAR+Der/vv

| — | — | — | —

| — | — | — | —

| — | — | — | —
| — | —
| — | —

| — | —

I Num2 til Num10 er flagging af NNGUR C-et pga. klokken

flex-tv ;

Ord som PFAS

Nye gennemgangsleksika specifikt for +UTE+Der/vv tilføjet 20180118 (PL)

Nye gennemgangsleksika specifikt for +UTE+Der/vv tilføjet 20180928 (PL)

Nye gennemgangsleksika specifikt for +UTE+Der/vv tilføjet 20180928 (PL)

Nye gennemgangsleksika for TAR med følgende morfemer

Gennemgangsleksikon for TAR m. fl.

Gennemgangsleksikon for TAR m. fl.

Kopi af IV-mod_C bortset fra default

Gennemgangsleksikon for TAR m. fl.

Gennemgangsleksikon for TAR m. fl. efter /i/ f.eks. GUMALLIR GUNNAIR LIR LLAQQIP PASIP QQAMMIR QQIP RIIR UMMIR VIP

Gennemgangsleksikon for TAR m. fl. efter /i/ f.eks. GUMALLIR GUNNAIR LIR LLAQQIP PASIP QQAMMIR QQIP RIIR UMMIR VIP

Gennemgangsleksikon for TAR m. fl.

Gennemgangsleksikon for TAR m. fl.

Gennemgangsleksikon for TAR m. fl. !tilføjet 20170501 !ikke til QE+Der/vv som kun kan få +NIR

Gennemgangsleksikon for TAR m. fl. !tilføjet 20170501 !ikke til QE+Der/vv som kun kan få +NIR

Gennemgangsleksikon for TAR m. fl. !tilføjet 20170501 !til QE+Der/vv som kun kan få +NIR

Gennemgangsleksikon for TAR m. fl. efter LAAR

Gennemgangsleksikon for TAR m. fl. efter NAR+Der/vv

Gennemgangsleksikon for TAR m. fl. !tilføjet 20170501

Gennemgangsleksikon for TAR m. fl.

Gennemgangsleksikon for TAR m. fl.

Gennemgangsleksikon RUJUUR+Der/vv

Gennemgangsleksikon RUJUUR+Der/vv

Gennemgangsleksikon for TAR m. fl. !tilføjet 20170501

efter GUMALLIR GUNNAIR LIR QQAMMIR RIIR UMMIR etc. !Ny default + post_-lexica PL 20180416

TIR og TITIR

Udkommenterede lexica - Flyttet til derivations-inflections.bak20200319 på Pers Mac


This (part of) documentation was generated from src/fst/morphology/affixes/derivations-inflections.lexc


src-fst-morphology-affixes-noun_to_noun.lexc.md

gennemgangskatalog for up-stammer, der kræver replaciv sandhi


This (part of) documentation was generated from src/fst/morphology/affixes/noun_to_noun.lexc


src-fst-morphology-affixes-numerals.lexc.md

Arabiske numeralier

Inflection and derivation.

** Lexicon num_C for numerusmorfologi for ord på konsonant

** Lexicon num_V for numerusmorfologi for ord på vokal

** Lexicon num_C_sub for numerusmorfologi for ord på konsonant, substandarformer

** Lexicon num_V_sub for numerusmorfologi for ord på vokal, substandarformer

** Lexicon ord_V for ordinalmorfologi for ord på vokal

** Lexicon ord_C for ordinalmorfologi for ord på konsonant


This (part of) documentation was generated from src/fst/morphology/affixes/numerals.lexc


src-fst-morphology-affixes-propernouns.lexc.md

Propriernes morfologi i grønlandsk

Nyt 20210303: indsættelse af bindestreg og bindestreg+i ved proprier, der samtidig er akronymer ** Lexicon Vprop_connector DMI

** Lexicon Cprop_connector DHL

Navnelexikon er i ateq-kal-lex.txt.

** Lexicon Z1geo_oqs-bestZ

** Lexicon ZcitationsformZ Et særligt lexicon til literals i citationstegn med placeholder QuotedHyphen (står i acronyms.lexc)

** Lexicon ZcitationsformZ_Num Copy of ZcitationsformZ, but yielding Num

** Lexicon Bogtitel Nyt lex til bogtitler uden citationstegn

** Lexicon Z1ateqZ_infl

** Lexicon Z1ateq_tptZ

** Lexicon Z1ateq_atZ

** Lexicon Z1ateq_gaqZ

** Lexicon Z1ateq_ngaqZ

** Lexicon Z1ateq_goqZ

** Lexicon Z1ateq_qoqZ

** Lexicon Z1ateq_taqZ

** Lexicon Z1ateq_seqZ

** Lexicon Z1ateq_leqZ

** Lexicon Z1ateq_saqZ

** Lexicon Z1ateq+qaZ

** Lexicon Z1ateq+ĸaZ

** Lexicon Zateq_tptZ Atassut

** Lexicon Z1ateqPZ

** Lexicon Z1ateqPZ-suf

** Lexicon Z2-ateqZ

** Lexicon Z2-ateqZ-suf

** Lexicon Z1ateqpropVZ

** Lexicon Z1instpropVZ

** Lexicon Z1ateqpropCZ

** Lexicon Z2ateq_niqZ

** Lexicon Z2ateq_neqZ

** Lexicon Z2suaq_ateqZ

** Lexicon Z2-ateq_specielSZ Siumut – Siumummi, Siumumi

** Lexicon Z1nnguaq_ateqZ

** Lexicon Z1nnguaq_possessumZ PL20220201 LEX til sekvensen UTE=NNGUAQ + POSSESSUM

** Lexicon Zateq_oqsZ
** Lexicon Zateq_oqsZ-suf

** Lexicon Zateq_oqsPZ PL20210224 til Kalaallit Airports o.lign. med usikker numerus. I første omgang uden derivation og personendelser

** Lexicon Zateq_iaqZ

** Lexicon Zateq_iaĸZ

** Lexicon Zateq_ioqZ

** Lexicon Zateq_naqZ

** Lexicon Zateq_noqZ

** Lexicon Zateq_meqZ

** Lexicon Z1geoSZmorf Narsaq

** Lexicon Z1geoPZmorf Paamiut og Ivittuut

** Lexicon Z1geo_nnguaqZmorf Quassunnguaq NY 20100410 (PL)

** Lexicon Z1geo_nnguaqPZmorf Kangilinnguit NY 20100319 (PL)

** Lexicon Z1geo+ssPZmorf Ilulissat

** Lexicon Z1geo_oqsZmorf Finland

** Lexicon Z1geo_oqsZmorf Finland

** Lexicon Z1geo_oqs-nbestZmorf Særlig katalog til lande i bestemt form som Spanien

** Lexicon Z1geo_oqs-tbestZmorf Særlig katalog til lande i bestemt form som Tyrkiet

** Lexicon Z1geo_oqseZmorf Thule

** Lexicon Z2-geoSZmorf Ikerasaarsuk; Korea% Kujalleq

** Lexicon Z2+lgeoSZmorf Nanortalik

** Lexicon Z2-geo_uukSZmorf un til Nuuk

** Lexicon Z2-geo_specielSZmorf til geografiske steder med fleksion såsom Qinngorput – Qinngorpummi, Qinngorpormiu

** Lexicon Z2geo_aqSZmorf Nuussuaq

** Lexicon Z2-geoqPZmorf Saqqarliit:Saqqarleq

** Lexicon Z2-geolikPZmorf Kapisillit:Kapisi

** Lexicon Z1ateqpropVZmorf

** Lexicon Z1instpropVZmorf

** Lexicon Z1ateqpropCZmorf

** Lexicon Z2-ateqZmorf

** Lexicon Z2ateq_niqZmorf

** Lexicon Z2ateq_neqZmorf

** Lexicon Z2suaq_ateqZmorf

** Lexicon Z1ateqZmorf_all

** Lexicon Z1ateq_tptZmorf

** Lexicon Z1ateq_atZmorf

** Lexicon Z1ateq_taqZmorf

** Lexicon Z1ateq_saqZmorf

** Lexicon Z1ateq_seqZmorf

** Lexicon Z1ateq_leqZmorf

** Lexicon Z1ateq_gaqZmorf

** Lexicon Z1ateq_ngaqZmorf

** Lexicon Z1ateq_goqZmorf

** Lexicon Z1ateq_qoqZmorf

** Lexicon Z1ateq+qaZmorf

** Lexicon Z1ateq+ĸaZmorf

** Lexicon Zateq_tptZmorf

** Lexicon Z1ateqPZmorf

** Lexicon Z2-ateq_specielSZmorf

** Lexicon Z1nnguaq_ateqZmorf

** Lexicon Zateq_oqsZmorf

** Lexicon Zateq_numCZmorf Ny 20191010 til proprier som DR1 og Peugeot 206. Sem/Hum fastholdt i første omgang, for DR og biler kan jo gøre noget aktivt???

** Lexicon Zateq_numVZmorf Ny 20191010 til proprier som DR2 og Peugeot 208

** Lexicon Z1ateq_iaqZmorf

** Lexicon Z1ateq_iaĸZmorf

** Lexicon Z1ateq_ioqZmorf

** Lexicon Z1ateq_naqZmorf

** Lexicon Z1ateq_noqZmorf

** Lexicon Z1ateq_meqZmorf

** Lexicon Z1geopropZ

** Lexicon Z1geopropPZ

** Lexicon Z1ateqpropZ

** Lexicon Z1ateqpropPZ De Konservative

** Lexicon Z1Fem_ateqZ Test af femininum tag med stamme på -e. Tidligere Z1ateqpropZ og Z1ateqZmorf

** Lexicon Z1Mask_ateqZ

** Lexicon Z1Mask_GrlateqZ

** Lexicon Z1Fem_tptZ

** Lexicon Z1Mask_tptZ

** Lexicon Z1Mask_atZ

** Lexicon Z1Mask_taqZ

** Lexicon Z1Fem_taqZ

** Lexicon Z1Mask_saqZ

** Lexicon Z1Mask_seqZ

** Lexicon Z1Mask_leqZ

** Lexicon Z1Fem_leqZ

** Lexicon ZMask_oqsZ

** Lexicon ZFem_oqsZ

** Lexicon Z1Fem_nnguaqZ
Test afkønsopdelte fornavne på NNGUAQ. Tidligere Z1nnguaq_ateqZmorf

** Lexicon Z1Mask_nnguaqZ

** Lexicon Z1Mask_araqZ

** Lexicon Z1Fem_araqZ

** Lexicon Z1Fem_araĸZ

** Lexicon Z1Mask_gaqZ

** Lexicon Z1Fem_ngaqZ

Tidligere Z2-ateqZmorf

Nyt lexicon 20180615 Z2suaq_ateqZmorf


This (part of) documentation was generated from src/fst/morphology/affixes/propernouns.lexc


src-fst-morphology-affixes-symbols.lexc.md

Symbol affixes


This (part of) documentation was generated from src/fst/morphology/affixes/symbols.lexc


src-fst-morphology-root.lexc.md

Greenlandic morphological analyser

Fil for at generere de centrale morfologiske processer i vor grønlandske analysator

Multicharacter symbols

Tags for POS (primary tags)

Main Word Classes

Secondary tags

Tags for Verbs

Tags for Pronouns

Tags for Other Word Classes

Semantics

anatomical. Adskiller qiteq+Sem/an+3SgPoss = qitia fra qiteq+3SgPoss = qeqqa

Grammar

Derivation

Dialect

Phonetic / morphophonological

Tags to mark loan word entries with a diverting orthography

That is, they need special treatment in e.g. speech syntesis.

Orthograhy

Usage/error

Ekstra vedr. LG

Tags for Inflection

Numerus

Kasus

Særlige 3./4. persons kasus med DivPron (Gram/Cong)

Modus

Verb person-numerus

Possesive tags - Possessormarkering i possessum

Flag diacritics for Greenlandic

Flag diacritics til pluralis tantum subjekter

Flag diacritics til verber med kun pluralis i objekterne

Test af boolsk variabel til ad hoc blokeringer

Test af boolsk variabel til ad hoc blokering af Gram/Exclm. stems sættes Off og derivation On

Off-flag sættes i verbs på transitive verber med usandsynlig Refl. On-flag på taggen Gram/Refl i gennemgangslexica

Off-flag på verber som akuaa, der ikke må lave metatese på NIQ

Nyt flag 20211214 for at forebygge *taakkuunngitsoq og *taannaanngitsut

Off-flag på nominer, der SKAL opføre sig replacive som pilersaarusiorpoq og aqqusinniorpoq

Off-flag i nouns og Off-flag i der-inf når TUR og TUGAQ ikke må assibileres og On-flag, når de skal assibileres. Også for at forebygge assibilering efter HTR på nnip

Flag specielt for at sikre additiv p-bøjning af ulloq i Trm@

Flag til forebyggelse af manglende assibilering. P sættes i stem-filerne og C i der-infl

Ad hoc til test af alternativ flag diacritics ved præfikser. Husk også den udkommenterede linie ‘Kingumoorutit ;’ i LEXICON Root

Test af P- og D-flag til forebyggelse af rekusivitet ved TIP

og blokeres af

Test 20210504 af P- og R-flag for at generere både takornariat og takornarissat+Err/Sub

Flags for loan words, which must not go to N+Abs+Sg without derivation.

30.10.23: Trond tok taggane som var deklarert fleire gonger (sannsynlegvis tidlegare taggstrengar A=B=C) ut desse og laga i staden ei liste der kvar tag sto ein gong (nedanfor): docs/tagstrings.md

List of the so-called Greenlandic tilhæng, i.e., derivational affixes

Grænsesymbol

Symbols that need to be escaped on the lower side (towards twolc)

Vore morfofonemer

Vore magiske symboler

Language-independent flag diacritics

We have manually optimised the structure of our lexicon using following flag diacritics to restrict morhpological combinatorics - only allow compounds with verbs if the verb is further derived into a noun again:

Flag Explanation
@P.NeedNoun.ON@ (Dis)allow compounds with verbs unless nominalised
@D.NeedNoun.ON@ (Dis)allow compounds with verbs unless nominalised
@C.NeedNoun@ (Dis)allow compounds with verbs unless nominalised

For languages that allow compounding, the following flag diacritics are needed to control position-based compounding restrictions for nominals. Their use is handled automatically if combined with +CmpN/xxx tags. If not used, they will do no harm.

Flag Explanation
@P.CmpFrst.FALSE@ Require that words tagged as such only appear first
@D.CmpPref.TRUE@ Block such words from entering ENDLEX
@P.CmpPref.FALSE@ Block these words from making further compounds
@D.CmpLast.TRUE@ Block such words from entering R
@D.CmpNone.TRUE@ Combines with the next tag to prohibit compounding
@U.CmpNone.FALSE@ Combines with the prev tag to prohibit compounding
@P.CmpOnly.TRUE@ Sets a flag to indicate that the word has passed R
@D.CmpOnly.FALSE@ Disallow words coming directly from root.

Use the following flag diacritics to control downcasing of derived proper nouns (e.g. Finnish Pariisi -> pariisilainen). See e.g. North Sámi for how to use these flags. There exists a ready-made regex that will do the actual down-casing given the proper use of these flags.

Flag Explanation
@U.Cap.Obl@ Allowing downcasing of derived names: deatnulasj.
@U.Cap.Opt@ Allowing downcasing of derived names: København > københavner.

LEXICON Root pointing to main parts of speech


This (part of) documentation was generated from src/fst/morphology/root.lexc


src-fst-morphology-stems-nouns.lexc.md

Grønlandske nomener

Dokumentasjon for leksikonnavne:

De mest alminderlige nomenleksika:

Leksikonet Nomen inneholder nomenstammerne.

xxx 20170522 til former, som ikke kan læses men som har brug for en analyse i cg’en. Pluralis og singularis-kataloger er ikke medtaget.

Retningslinjer for låneord

Grundformen af ordet er den samme som i lånsproget (uden i): emblem N_Loan_GEM ; (og ikke emblemi) Er der alternative ikke godkendte stavemåder i endelsen, tilføjes de i næste katalog (fx emblemmi) Er der alternative ikke godkendte stavemåder andetsteds, tilføjes de med +OLang/xxx+Err/Sub roman+OLang/DAN+Err/Sub:romaani Z1VZmorf ; Er der en godkendt form af låneordet, er det også en grundform, og den får ikke +OLang/xxx:

septembari Z1VZmorf ;
septembari+Orth/Arch:sivtimpari Z1VZmorf ;
september N_Loan ;
enheder sendes til specielt unit-låneordskatalog

* aaffaffak Z2-Zmorf ;              
* aaffaffak+N+Abs+Sg:aaffaffaq Krestr ; 
* aaffaq Z2-qZmorf ;                 
* ...

* * *

<small>This (part of) documentation was generated from [src/fst/morphology/stems/nouns.lexc](https://github.com/giellalt/lang-kal/blob/main/src/fst/morphology/stems/nouns.lexc)</small>

---

# src-fst-morphology-stems-propernouns.lexc.md 



xxx 20170522 til former, som ikke kan læses men som har brug for en analyse i cg'en..

* * *

<small>This (part of) documentation was generated from [src/fst/morphology/stems/propernouns.lexc](https://github.com/giellalt/lang-kal/blob/main/src/fst/morphology/stems/propernouns.lexc)</small>

---

# src-fst-morphology-stems-verbs.lexc.md 



xxx 20170524 til verbalstammer, som ikke kan læses men som har brug for en analyse i cg'en. Pluralis tantum-kataloger er ikke medtaget. !er det nødvendigt også at medtage stammer fra derivationsleksika?

* * *

<small>This (part of) documentation was generated from [src/fst/morphology/stems/verbs.lexc](https://github.com/giellalt/lang-kal/blob/main/src/fst/morphology/stems/verbs.lexc)</small>

---

# src-fst-transcriptions-transcriptor-abbrevs2text.lexc.md 



We describe here how abbreviations are in Kalaallisut are read out, e.g.
for text-to-speech systems.

For example:

* s.:syntynyt # ;  
* os.:omaa% sukua # ;  
* v.:vuosi # ;  
* v.:vuonna # ;  
* esim.:esimerkki # ; 
* esim.:esimerkiksi # ; 

* * *

<small>This (part of) documentation was generated from [src/fst/transcriptions/transcriptor-abbrevs2text.lexc](https://github.com/giellalt/lang-kal/blob/main/src/fst/transcriptions/transcriptor-abbrevs2text.lexc)</small>

---

# src-fst-transcriptions-transcriptor-numbers-digit2text.lexc.md 



% komma% :,      Root ;
% tjuohkkis% :%. Root ;
% kolon% :%:     Root ;
% sárggis% :%-   Root ; 
% násti% :%*     Root ; 

* * *

<small>This (part of) documentation was generated from [src/fst/transcriptions/transcriptor-numbers-digit2text.lexc](https://github.com/giellalt/lang-kal/blob/main/src/fst/transcriptions/transcriptor-numbers-digit2text.lexc)</small>

---

# tools-grammarcheckers-grammarchecker.cg3.md 



#      G R E E N L A N D I C   G R A M M A R   C H E C K E R

In the catalogue for kal, do: ./autogen.sh ./configure –enable-grammarchecker –enable-spellers make -j cd tools/grammarcheckers make dev Then test as follows: echo “e Nerisassiornermut soqutigisaqarpit?|sh modes/trace-kalgram.mode # from the terminal Or eventually, write make check

Tag declaration

Import tag declarations

We import tag declaration from ../../src/cg3/disambiguator.cg3

Rule section

Speller suggestions rule

add &SUGGESTWF to any spelling suggestion that we actually want to suggest to the user. The simplest is to just add it to all spelled words:

@OUTSIDE RULES@

Grammatical rules

Verb valency rules

@OUTSIDE RULES@

ADD:msyn-arg-ins-trm

ADD:msyn-arg-ins-trm

ADD:msyn-arg-abs-rel

ADD:msyn-arg-abs-rel

ADD:msyn-subj-rel-abs

ADD:msyn-subj-rel-abs

Simple punctuation rules

Rules for quotation marks.


This (part of) documentation was generated from tools/grammarcheckers/grammarchecker.cg3


tools-grammarcheckers-liststemplates.cg3.md

Grammarchecker tags


This (part of) documentation was generated from tools/grammarcheckers/liststemplates.cg3


tools-tokenisers-tokeniser-disamb-gt-desc.pmscript.md

Tokeniser for kal

Usage:

$ make
$ echo "ja, ja" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa boasttu olmmoš, man mielde lahtuid." | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "márffibiillagáffe" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst

Pmatch documentation: https://github.com/hfst/hfst/wiki/HfstPmatch

Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words:

Whitespace contains ASCII white space and the List contains some unicode white space characters

Apart from what’s in our morphology, there are

  1. unknown word-like forms, and
  2. unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a Unknowns are made of:
    • lower-case ASCII
    • upper-case ASCII
    • select extended latin symbols ASCII digits
    • select symbols
    • Combining diacritics as individual symbols,
    • various symbols from Private area (probably Microsoft), so far:
    • U+F0B7 for “x in box”

Unknown handling

Unknowns are tagged ?? and treated specially with hfst-tokenise hfst-tokenise –giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it’s safer to let hfst-tokenise handle them.

Finally we mark as a token any sequence making up a:


This (part of) documentation was generated from tools/tokenisers/tokeniser-disamb-gt-desc.pmscript


tools-tokenisers-tokeniser-gramcheck-gt-desc.pmscript.md

Grammar checker tokenisation for kal

Requires a recent version of HFST (3.10.0 / git revision>=3aecdbc) Then just:

$ make
$ echo "ja, ja" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst

More usage examples:

$ echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa boasttu olmmoš, man mielde lahtuid." | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "márffibiillagáffe" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst

Pmatch documentation: https://github.com/hfst/hfst/wiki/HfstPmatch

Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words:

Whitespace contains ASCII white space and the List contains some unicode white space characters

Apart from what’s in our morphology, there are 1) unknown word-like forms, and 2) unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a

TODO: Could use something like this, but built-in’s don’t include šžđčŋ:

Simply give an empty reading when something is unknown: hfst-tokenise –giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it’s safer to let hfst-tokenise handle them.

Finally we mark as a token any sequence making up a:


This (part of) documentation was generated from tools/tokenisers/tokeniser-gramcheck-gt-desc.pmscript


tools-tokenisers-tokeniser-tts-cggt-desc.pmscript.md

TTS tokenisation for smj

Requires a recent version of HFST (3.10.0 / git revision>=3aecdbc) Then just:

make
echo "ja, ja" \
| hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst

More usage examples:

echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa \
boasttu olmmoš, man mielde lahtuid." \
| hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" \
| hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
echo "márffibiillagáffe" \
| hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst

Pmatch documentation: https://kitwiki.csc.fi/twiki/bin/view/KitWiki/HfstPmatch

Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words:

Whitespace contains ASCII white space and the List contains some unicode white space characters

Apart from what’s in our morphology, there are 1) unknown word-like forms, and 2) unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a

TODO: Could use something like this, but built-in’s don’t include šžđčŋ:

Simply give an empty reading when something is unknown: hfst-tokenise –giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it’s safer to let hfst-tokenise handle them.

Needs hfst-tokenise to output things differently depending on the tag they get


This (part of) documentation was generated from tools/tokenisers/tokeniser-tts-cggt-desc.pmscript