Finite state and Constraint Grammar based analysers, proofing tools and other resources
All doc-comment documentation in one large file.
This file contains Traveller Romani adjective inflection.
TODO! the grammar lists -e and -t forms. These should get morphosyntactic tags (here ad hoc marked as +Pl and +Der/Adv+Adv.
LEXICON a1 is the default lexicon. It gives -Ø, -e, -t and points to lexicon comp.
LEXICON a2 given an alternate plural form in -a and redirects to a1.
LEXICON comp gives -are, -ast, -aste, the latter marked +Def.
This (part of) documentation was generated from src/fst/morphology/affixes/adjectives.lexc
This file contains the noun inflection for Traveller Romani. The nouns inflect in number and definiteness
TODO: The lexica m1 - m4 may be unified and their inflection differences handled in twolc. f1, f2 and n have different suffixes and should be kept.
LEXICON f1m2 split in two
LEXICON f1m1 split in two
LEXICON f goes to f2
LEXICON pl with no gender info, just gives +N+Pl
LEXICON mx for uninflected (for now)
LEXICON m1 now points to m2, let us test this
LEXICON m1pl pl only
LEXICON m2 is now the sole masculine lexicon, with suffixes -en, -ar, -ane. The adjustments of m1, m3, m4 suffixes are being taken care of by morphophonology.
LEXICON m3 now points to m2, let us test this
LEXICON m4 now points to m2, let us test this
LEXICON f1 is now the sole lexicon, let us test this, -a, -a2r, -ane suffixes, and the variation handled in the morphophonology
LEXICON f2 now points to f1
LEXICON n is the sole n lexicon, suffixes -e, -Ø, -a, -ane.
This (part of) documentation was generated from src/fst/morphology/affixes/nouns.lexc
This file assigns tags to the Traveller Romani proper nouns It dies contain any inflection.
LEXICON prop-fem
LEXICON prop-mal
LEXICON prop-plc
LEXICON prop-obj
This (part of) documentation was generated from src/fst/morphology/affixes/propernouns.lexc
This is a language-independent file.
LEXICON Noun_symbols_possibly_inflected
LEXICON Noun_symbols_never_inflected
LEXICON SYMBOL_connector
LEXICON SYMBOL_NO_suff
LEXICON SYMBOL_suff (dummy lexicon for now)
This (part of) documentation was generated from src/fst/morphology/affixes/symbols.lexc
The Traveller Romani language verbs inflect in tense and participle. The lexica v1, v2, v3 are taken from the grammar. The distinction lies in the past tense suffix (null for v1 vs. dde for v2).
LEXICON ASJA is adhoc while waiting for more info
LEXICON v1
LEXICON v2
LEXICON v3
This (part of) documentation was generated from src/fst/morphology/affixes/verbs.lexc
=================================== !
=================================== !
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z Æ Ø Å Á É Ó Ú Í À È Ò Ù Ì Ä Ë Ö Ü Ï Â Ê Ô Û Î Ã Ý þ Ñ Ð
puia>0n
grei>en
RULE: Deleting stem-internal a before r in bisyllabic stems =
This (part of) documentation was generated from src/fst/morphology/phonology.twolc
INTRODUCTION TO MORPHOLOGICAL ANALYSER OF Traveller Romani
The morphological analyses of wordforms for Traveller Romani are presented in this system in terms of the following symbols. (It is highly suggested to follow existing standards when adding new tags).
The parts-of-speech are:
Gender:
The parts of speech are further split up into:
The nominals are inflected in the following (Case and) Number
The comparative forms are:
Numerals are classified under:
Other verb forms are
+PrsPrt
The verbs are syntactically split according to transitivity: (well, not yet)
Numeral subgroups
Non-dictionary words can be recognised with:
Question and Focus particles:
The Usage extents are marked using following tags:
+Use/-Spell
+Err/SpaceCmp
Semantics are classified with (so far the 4 first only)
+Sem/Clth
Derivations are classified under the morphophonetic form of the suffix, the source and target part-of-speech.
Morphophonology To represent phonologic variations in word forms we use the following symbols in the lexicon files: (still no such)
We have manually optimised the structure of our lexicon using following flag diacritics to restrict morhpological combinatorics - only allow compounds with verbs if the verb is further derived into a noun again: | @P.NeedNoun.ON@ | (Dis)allow compounds with verbs unless nominalised | @D.NeedNoun.ON@ | (Dis)allow compounds with verbs unless nominalised | @C.NeedNoun@ | (Dis)allow compounds with verbs unless nominalised
For languages that allow compounding, the following flag diacritics are needed to control position-based compounding restrictions for nominals. Their use is handled automatically if combined with +CmpN/xxx tags. If not used, they will do no harm. | @P.CmpFrst.FALSE@ | Require that words tagged as such only appear first | @D.CmpPref.TRUE@ | Block such words from entering ENDLEX | @P.CmpPref.FALSE@ | Block these words from making further compounds | @D.CmpLast.TRUE@ | Block such words from entering R | @D.CmpNone.TRUE@ | Combines with the next tag to prohibit compounding | @U.CmpNone.FALSE@ | Combines with the prev tag to prohibit compounding | @P.CmpOnly.TRUE@ | Sets a flag to indicate that the word has passed R | @D.CmpOnly.FALSE@ | Disallow words coming directly from root.
Use the following flag diacritics to control downcasing of derived proper nouns (e.g. Finnish Pariisi -> pariisilainen). See e.g. North Sámi for how to use these flags. There exists a ready-made regex that will do the actual down-casing given the proper use of these flags. | @U.Cap.Obl@ | Allowing downcasing of derived names: deatnulasj. | @U.Cap.Opt@ | Allowing downcasing of derived names: deatnulasj.
| Flag diacritic | Explanation | :————- |:———– | @U.number.one@ | Flag used to give arabic numerals in smj different cases ; | @U.number.two@ | Flag used to give arabic numerals in smj different cases ; | @U.number.three@ | Flag used to give arabic numerals in smj different cases ; | @U.number.four@ | Flag used to give arabic numerals in smj different cases ; | @U.number.five@ | Flag used to give arabic numerals in smj different cases ; | @U.number.six@ | Flag used to give arabic numerals in smj different cases ; | @U.number.seven@ | Flag used to give arabic numerals in smj different cases ; | @U.number.eight@ | Flag used to give arabic numerals in smj different cases ; | @U.number.nine@ | Flag used to give arabic numerals in smj different cases ; | @U.number.zero@ | Flag used to give arabic numerals in smj different cases ;
LEXICON Root is where it all begins. The word forms in Romany language start from the lexeme roots of basic word classes, or optionally from prefixes. The basic lexica are: Adjectives ; Adverbs ; Conjunctions ; Interjections ; Nouns ; Numerals ; Prefixes ; Prepositions ; Pronouns ; Propernouns ; Punctuation ; Subjunctions ; Symbols ; Verbs ;
This (part of) documentation was generated from src/fst/morphology/root.lexc
Adjectives in the Traveller Romani language have two sblexica, a1, a2.
LEXICON Adjectives
This (part of) documentation was generated from src/fst/morphology/stems/adjectives.lexc
For now, this file contains not only adverbs (they should stay), but also a rest category of things to be moved to their respective stem files (one file for each part-of-speech).
LEXICON adv adds the tag +Adv
LEXICON Adverbs lists the adverbs themselves (as well as the restcategory, for now)
alonom adv “aleine” ;
This (part of) documentation was generated from src/fst/morphology/stems/adverbs.lexc
LEXICON cc adds the tag +CC
LEXICON Conjunctions contains the conjunctions (2 so far)
This (part of) documentation was generated from src/fst/morphology/stems/conjunctions.lexc
LEXICON ij adds the tag +Interj
LEXICON Interjections , one so far
This (part of) documentation was generated from src/fst/morphology/stems/interjections.lexc
This is the noun stem file for Traveller Romani (romani rakkripa). Nouns in the Traveller Romani are divided in m, f, n.
TODO Conflate m1-m4.
LEXICON Nouns
This (part of) documentation was generated from src/fst/morphology/stems/nouns.lexc
This is a list of whatever was found in the dictionary.
The file dontains a first draft of a systematic setup for generating all nuerals.
LEXICON num just adds the tag +Num.
LEXICON Numerals splits layers of numerals in sublexca
LEXICON TEENS where dypansj- is redirected to 1to9.
LEXICON 1to9 contains the basic numerals
LEXICON TENS contains 10, 20, .. 90.
LEXICON TENSsplit splits 20, … 90 into 20 vs 21, 22, …
LEXICON numeralcompounds is a lexicon to be looked into.
This (part of) documentation was generated from src/fst/morphology/stems/numerals.lexc
Prefixes in the Traveller Romani language …
Nothing has been done on this, it is a dummy file. The intention is to add eventual prefixes such as Norwegian u-.
This (part of) documentation was generated from src/fst/morphology/stems/prefixes.lexc
LEXICON pr adds the tag +Pr
LEXICON Prepositions contains the prepositions (10 so far)
This (part of) documentation was generated from src/fst/morphology/stems/prepositions.lexc
This is a list of pronouns in the Traveller Romani language.
TODO The lis should be completed and given morphosyntactic tags when needed.
LEXICON pers adds the tags +Pron+Pers
LEXICON Pronouns lists personal pronouns
This (part of) documentation was generated from src/fst/morphology/stems/pronouns.lexc
Here, we should use the nob file. The tags are added in the affixes/propernouns.lexc file.
LEXICON Propernouns
This (part of) documentation was generated from src/fst/morphology/stems/propernouns.lexc
LEXICON cs adds the tag +CS
LEXICON im adds the tag +IM
LEXICON Subjunctions lists the subjunctions (4 so far)
This (part of) documentation was generated from src/fst/morphology/stems/subjunctions.lexc
This is the list of verbs in the Traveller Romani language.
LEXICON Verbs
This (part of) documentation was generated from src/fst/morphology/stems/verbs.lexc
retroflex plosive, voiceless t ʈ 0288, 648 (
= ASCII 096)
retroflex plosive, voiced d ɖ 0256, 598
labiodental nasal F ɱ 0271, 625
retroflex nasal n
ɳ 0273, 627
palatal nasal J ɲ 0272, 626
velar nasal N ŋ 014B, 331
uvular nasal N\ ɴ 0274, 628
bilabial trill B\ ʙ 0299, 665
uvular trill R\ ʀ 0280, 640
alveolar tap 4 ɾ 027E, 638
retroflex flap r ɽ 027D, 637
bilabial fricative, voiceless p\ ɸ 0278, 632
bilabial fricative, voiced B β 03B2, 946
dental fricative, voiceless T θ 03B8, 952
dental fricative, voiced D ð 00F0, 240
postalveolar fricative, voiceless S ʃ 0283, 643
postalveolar fricative, voiced Z ʒ 0292, 658
retroflex fricative, voiceless s
ʂ 0282, 642
retroflex fricative, voiced z` ʐ 0290, 656
palatal fricative, voiceless C ç 00E7, 231
palatal fricative, voiced j\ ʝ 029D, 669
velar fricative, voiced G ɣ 0263, 611
uvular fricative, voiceless X χ 03C7, 967
uvular fricative, voiced R ʁ 0281, 641
pharyngeal fricative, voiceless X\ ħ 0127, 295
pharyngeal fricative, voiced ?\ ʕ 0295, 661
glottal fricative, voiced h\ ɦ 0266, 614
alveolar lateral fricative, vl. K alveolar lateral fricative, vd. K\
labiodental approximant P (or v) alveolar approximant r\ retroflex approximant r` velar approximant M\
retroflex lateral approximant l`
palatal lateral approximant L
velar lateral approximant L
Clicks
bilabial O\ (O = capital letter)
dental |
(post)alveolar !\
palatoalveolar =\
alveolar lateral ||
Ejectives, implosives
ejective > e.g. ejective p p> implosive < e.g. implosive b b< Vowels
close back unrounded M close central unrounded 1 close central rounded } lax i I lax y Y lax u U
close-mid front rounded 2 close-mid central unrounded @\ close-mid central rounded 8 close-mid back unrounded 7
schwa ə @
open-mid front unrounded E open-mid front rounded 9 open-mid central unrounded 3 open-mid central rounded 3\ open-mid back unrounded V open-mid back rounded O
ash (ae digraph) { open schwa (turned a) 6
open front rounded & open back unrounded A open back rounded Q Other symbols
voiceless labial-velar fricative W voiced labial-palatal approx. H voiceless epiglottal fricative H\ voiced epiglottal fricative <\ epiglottal plosive >\
alveolo-palatal fricative, vl. s\ alveolo-palatal fricative, voiced z\ alveolar lateral flap l\ simultaneous S and x x\ tie bar _ Suprasegmentals
primary stress “
secondary stress %
long :
half-long :\
extra-short _X
linking mark -
Tones and word accents
level extra high _T level high _H level mid _M level low _L level extra low _B downstep ! upstep ^ (caret, circumflex)
contour, rising contour, falling _F contour, high rising _H_T contour, low rising _B_L
contour, rising-falling _R_F
(NB Instead of being written as diacritics with _, all prosodic
marks can alternatively be placed in a separate tier, set off
by < >, as recommended for the next two symbols.)
global rise
voiceless 0 (0 = figure), e.g. n_0 voiced _v aspirated _h more rounded _O (O = letter) less rounded _c advanced _+ retracted _- centralized _” syllabic = (or _=) e.g. n= (or n=) non-syllabic _^ rhoticity `
breathy voiced _t creaky voiced _k linguolabial _N labialized _w palatalized ‘ (or _j) e.g. t’ (or t_j) velarized _G pharyngealized _?\
dental d apical _a laminal _m nasalized ~ (or _~) e.g. A~ (or A~) nasal release _n lateral release _l no audible release _}
velarized or pharyngealized _e velarized l, alternatively 5 raised _r lowered _o advanced tongue root _A retracted tongue root _q
This (part of) documentation was generated from src/fst/phonetics/txt2ipa.xfscript
We describe here how abbreviations are in Traveller Norwegian are read out, e.g. for text-to-speech systems.
For example:
This (part of) documentation was generated from src/fst/transcriptions/transcriptor-abbrevs2text.lexc
% komma% :, Root ; % tjuohkkis% :%. Root ; % kolon% :%: Root ; % sárggis% :%- Root ; % násti% :%* Root ;
This (part of) documentation was generated from src/fst/transcriptions/transcriptor-numbers-digit2text.lexc
Usage:
$ make
$ echo "ja, ja" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa boasttu olmmoš, man mielde lahtuid." | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "márffibiillagáffe" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
Pmatch documentation: https://github.com/hfst/hfst/wiki/HfstPmatch
Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words:
U+00AD
U+FEFF
.Whitespace contains ASCII white space and the List contains some unicode white space characters
Apart from what’s in our morphology, there are
hfst-tokenise -a
Unknowns are made of:
Unknowns are tagged ?? and treated specially with hfst-tokenise
hfst-tokenise –giella-cg will treat such empty analyses as unknowns, and
remove empty analyses from other readings. Empty readings are also
legal in CG, they get a default baseform equal to the wordform, but
no tag to check, so it’s safer to let hfst-tokenise handle them.
Finally we mark as a token any sequence making up a:
This (part of) documentation was generated from tools/tokenisers/tokeniser-disamb-gt-desc.pmscript
Requires a recent version of HFST (3.10.0 / git revision>=3aecdbc) Then just:
$ make
$ echo "ja, ja" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
More usage examples:
$ echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa boasttu olmmoš, man mielde lahtuid." | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "márffibiillagáffe" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
Pmatch documentation: https://github.com/hfst/hfst/wiki/HfstPmatch
Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words:
U+00AD
U+FEFF
.Whitespace contains ASCII white space and the List contains some unicode white space characters
Apart from what’s in our morphology, there are 1) unknown word-like forms, and 2) unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a
TODO: Could use something like this, but built-in’s don’t include šžđčŋ:
Simply give an empty reading when something is unknown: hfst-tokenise –giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it’s safer to let hfst-tokenise handle them.
Finally we mark as a token any sequence making up a:
This (part of) documentation was generated from tools/tokenisers/tokeniser-gramcheck-gt-desc.pmscript
Requires a recent version of HFST (3.10.0 / git revision>=3aecdbc) Then just:
make
echo "ja, ja" \
| hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
More usage examples:
echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa \
boasttu olmmoš, man mielde lahtuid." \
| hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" \
| hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
echo "márffibiillagáffe" \
| hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
Pmatch documentation: https://kitwiki.csc.fi/twiki/bin/view/KitWiki/HfstPmatch
Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words:
U+00AD
U+FEFF
.Whitespace contains ASCII white space and the List contains some unicode white space characters
Apart from what’s in our morphology, there are 1) unknown word-like forms, and 2) unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a
TODO: Could use something like this, but built-in’s don’t include šžđčŋ:
Simply give an empty reading when something is unknown: hfst-tokenise –giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it’s safer to let hfst-tokenise handle them.
Needs hfst-tokenise to output things differently depending on the tag they get
This (part of) documentation was generated from tools/tokenisers/tokeniser-tts-cggt-desc.pmscript