Traveller Norwegian NLP Grammar

Finite state and Constraint Grammar based analysers, proofing tools and other resources

View the project on GitHub giellalt/lang-rmg

Traveller Norwegian language model documentation

All doc-comment documentation in one large file.


src-fst-morphology-affixes-adjectives.lexc.md

Adjective inflection

This file contains Traveller Romani adjective inflection.

TODO! the grammar lists -e and -t forms. These should get morphosyntactic tags (here ad hoc marked as +Pl and +Der/Adv+Adv.

LEXICON a1 is the default lexicon. It gives -Ø, -e, -t and points to lexicon comp.

LEXICON a2 given an alternate plural form in -a and redirects to a1.

LEXICON comp gives -are, -ast, -aste, the latter marked +Def.


This (part of) documentation was generated from src/fst/morphology/affixes/adjectives.lexc


src-fst-morphology-affixes-nouns.lexc.md

Noun inflection

This file contains the noun inflection for Traveller Romani. The nouns inflect in number and definiteness

TODO: The lexica m1 - m4 may be unified and their inflection differences handled in twolc. f1, f2 and n have different suffixes and should be kept.

The lexicons

Lexicons pointing to other lexicons

LEXICON f1m2 split in two

LEXICON f1m1 split in two

LEXICON f goes to f2

LEXICON pl with no gender info, just gives +N+Pl

Lexicons for masculine nouns

LEXICON mx for uninflected (for now)

LEXICON m1 now points to m2, let us test this

LEXICON m1pl pl only

LEXICON m2 is now the sole masculine lexicon, with suffixes -en, -ar, -ane. The adjustments of m1, m3, m4 suffixes are being taken care of by morphophonology.

LEXICON m3 now points to m2, let us test this

LEXICON m4 now points to m2, let us test this

Lexicons for feminine nouns

LEXICON f1 is now the sole lexicon, let us test this, -a, -a2r, -ane suffixes, and the variation handled in the morphophonology

LEXICON f2 now points to f1

Lexicons for neuter nouns

LEXICON n is the sole n lexicon, suffixes -e, -Ø, -a, -ane.


This (part of) documentation was generated from src/fst/morphology/affixes/nouns.lexc


src-fst-morphology-affixes-propernouns.lexc.md

Proper noun inflection

This file assigns tags to the Traveller Romani proper nouns It dies contain any inflection.


This (part of) documentation was generated from src/fst/morphology/affixes/propernouns.lexc


src-fst-morphology-affixes-symbols.lexc.md

Symbol affixes

This is a language-independent file.

LEXICON Noun_symbols_possibly_inflected

LEXICON Noun_symbols_never_inflected

LEXICON SYMBOL_connector

LEXICON SYMBOL_NO_suff

LEXICON SYMBOL_suff (dummy lexicon for now)


This (part of) documentation was generated from src/fst/morphology/affixes/symbols.lexc


src-fst-morphology-affixes-verbs.lexc.md

Traveller Norwegian verb inflection

The Traveller Romani language verbs inflect in tense and participle. The lexica v1, v2, v3 are taken from the grammar. The distinction lies in the past tense suffix (null for v1 vs. dde for v2).

LEXICON ASJA is adhoc while waiting for more info

LEXICON v1

LEXICON v2

LEXICON v3


This (part of) documentation was generated from src/fst/morphology/affixes/verbs.lexc


src-fst-morphology-phonology.twolc.md

=================================== !

The Traveller Romani morphophonological/twolc rules file

=================================== !

Alphabet

Sets

Rules

RULE: Deleting stem-internal a before r in bisyllabic stems =


This (part of) documentation was generated from src/fst/morphology/phonology.twolc


src-fst-morphology-root.lexc.md

Traveller Romani morphological analyser

INTRODUCTION TO MORPHOLOGICAL ANALYSER OF Traveller Romani

Definitions for Multichar_Symbols

Analysis symbols

The morphological analyses of wordforms for Traveller Romani are presented in this system in terms of the following symbols. (It is highly suggested to follow existing standards when adding new tags).

The parts-of-speech are:

Gender:

The parts of speech are further split up into:

The nominals are inflected in the following (Case and) Number

The comparative forms are:

Numerals are classified under:

Other verb forms are

The verbs are syntactically split according to transitivity: (well, not yet)

Numeral subgroups

Non-dictionary words can be recognised with:

Question and Focus particles:

The Usage extents are marked using following tags:

Paradigm choice

Semantics are classified with (so far the 4 first only)

Derivations are classified under the morphophonetic form of the suffix, the source and target part-of-speech.

Morphophonology To represent phonologic variations in word forms we use the following symbols in the lexicon files: (still no such)

Flag diacritics

We have manually optimised the structure of our lexicon using following flag diacritics to restrict morhpological combinatorics - only allow compounds with verbs if the verb is further derived into a noun again: | @P.NeedNoun.ON@ | (Dis)allow compounds with verbs unless nominalised | @D.NeedNoun.ON@ | (Dis)allow compounds with verbs unless nominalised | @C.NeedNoun@ | (Dis)allow compounds with verbs unless nominalised

For languages that allow compounding, the following flag diacritics are needed to control position-based compounding restrictions for nominals. Their use is handled automatically if combined with +CmpN/xxx tags. If not used, they will do no harm. | @P.CmpFrst.FALSE@ | Require that words tagged as such only appear first | @D.CmpPref.TRUE@ | Block such words from entering ENDLEX | @P.CmpPref.FALSE@ | Block these words from making further compounds | @D.CmpLast.TRUE@ | Block such words from entering R | @D.CmpNone.TRUE@ | Combines with the next tag to prohibit compounding | @U.CmpNone.FALSE@ | Combines with the prev tag to prohibit compounding | @P.CmpOnly.TRUE@ | Sets a flag to indicate that the word has passed R | @D.CmpOnly.FALSE@ | Disallow words coming directly from root.

Use the following flag diacritics to control downcasing of derived proper nouns (e.g. Finnish Pariisi -> pariisilainen). See e.g. North Sámi for how to use these flags. There exists a ready-made regex that will do the actual down-casing given the proper use of these flags. | @U.Cap.Obl@ | Allowing downcasing of derived names: deatnulasj. | @U.Cap.Opt@ | Allowing downcasing of derived names: deatnulasj.

| Flag diacritic | Explanation | :————- |:———– | @U.number.one@ | Flag used to give arabic numerals in smj different cases ; | @U.number.two@ | Flag used to give arabic numerals in smj different cases ; | @U.number.three@ | Flag used to give arabic numerals in smj different cases ; | @U.number.four@ | Flag used to give arabic numerals in smj different cases ; | @U.number.five@ | Flag used to give arabic numerals in smj different cases ; | @U.number.six@ | Flag used to give arabic numerals in smj different cases ; | @U.number.seven@ | Flag used to give arabic numerals in smj different cases ; | @U.number.eight@ | Flag used to give arabic numerals in smj different cases ; | @U.number.nine@ | Flag used to give arabic numerals in smj different cases ; | @U.number.zero@ | Flag used to give arabic numerals in smj different cases ;

Compound tags

Language tags

LEXICON Root is where it all begins. The word forms in Romany language start from the lexeme roots of basic word classes, or optionally from prefixes. The basic lexica are: Adjectives ; Adverbs ; Conjunctions ; Interjections ; Nouns ; Numerals ; Prefixes ; Prepositions ; Pronouns ; Propernouns ; Punctuation ; Subjunctions ; Symbols ; Verbs ;


This (part of) documentation was generated from src/fst/morphology/root.lexc


src-fst-morphology-stems-adjectives.lexc.md

Adjectives

Adjectives in the Traveller Romani language have two sblexica, a1, a2.

LEXICON Adjectives


This (part of) documentation was generated from src/fst/morphology/stems/adjectives.lexc


src-fst-morphology-stems-adverbs.lexc.md

Traveller Romani adverbs

For now, this file contains not only adverbs (they should stay), but also a rest category of things to be moved to their respective stem files (one file for each part-of-speech).

LEXICON adv adds the tag +Adv

LEXICON Adverbs lists the adverbs themselves (as well as the restcategory, for now)


This (part of) documentation was generated from src/fst/morphology/stems/adverbs.lexc


src-fst-morphology-stems-conjunctions.lexc.md

Traveler Norwegian conjunctions

LEXICON cc adds the tag +CC

LEXICON Conjunctions contains the conjunctions (2 so far)


This (part of) documentation was generated from src/fst/morphology/stems/conjunctions.lexc


src-fst-morphology-stems-interjections.lexc.md

Traveller Norwegian interjection file

LEXICON ij adds the tag +Interj

LEXICON Interjections , one so far


This (part of) documentation was generated from src/fst/morphology/stems/interjections.lexc


src-fst-morphology-stems-nouns.lexc.md

Traveller Norwegian nouns

This is the noun stem file for Traveller Romani (romani rakkripa). Nouns in the Traveller Romani are divided in m, f, n.

TODO Conflate m1-m4.

LEXICON Nouns


This (part of) documentation was generated from src/fst/morphology/stems/nouns.lexc


src-fst-morphology-stems-numerals.lexc.md

Traveller Norwegian numerals

This is a list of whatever was found in the dictionary.

The file dontains a first draft of a systematic setup for generating all nuerals.

LEXICON num just adds the tag +Num.

LEXICON Numerals splits layers of numerals in sublexca

LEXICON TEENS where dypansj- is redirected to 1to9.

LEXICON 1to9 contains the basic numerals

LEXICON TENS contains 10, 20, .. 90.

LEXICON TENSsplit splits 20, … 90 into 20 vs 21, 22, …

LEXICON numeralcompounds is a lexicon to be looked into.


This (part of) documentation was generated from src/fst/morphology/stems/numerals.lexc


src-fst-morphology-stems-prefixes.lexc.md

Prefixes

Prefixes in the Traveller Romani language …

Nothing has been done on this, it is a dummy file. The intention is to add eventual prefixes such as Norwegian u-.


This (part of) documentation was generated from src/fst/morphology/stems/prefixes.lexc


src-fst-morphology-stems-prepositions.lexc.md

Traveller Norwegian prepositions

LEXICON pr adds the tag +Pr

LEXICON Prepositions contains the prepositions (10 so far)


This (part of) documentation was generated from src/fst/morphology/stems/prepositions.lexc


src-fst-morphology-stems-pronouns.lexc.md

Traveller Norwegian pronouns

This is a list of pronouns in the Traveller Romani language.

TODO The lis should be completed and given morphosyntactic tags when needed.

LEXICON pers adds the tags +Pron+Pers

LEXICON Pronouns lists personal pronouns


This (part of) documentation was generated from src/fst/morphology/stems/pronouns.lexc


src-fst-morphology-stems-propernouns.lexc.md

Traveller Romani propernouns

Here, we should use the nob file. The tags are added in the affixes/propernouns.lexc file.

LEXICON Propernouns


This (part of) documentation was generated from src/fst/morphology/stems/propernouns.lexc


src-fst-morphology-stems-subjunctions.lexc.md

Traveller Norwegian subjunction file

LEXICON cs adds the tag +CS

LEXICON im adds the tag +IM

LEXICON Subjunctions lists the subjunctions (4 so far)


This (part of) documentation was generated from src/fst/morphology/stems/subjunctions.lexc


src-fst-morphology-stems-verbs.lexc.md

Traveller Norwegian verbs

This is the list of verbs in the Traveller Romani language.

LEXICON Verbs


This (part of) documentation was generated from src/fst/morphology/stems/verbs.lexc


src-fst-phonetics-txt2ipa.xfscript.md

retroflex plosive, voiceless t ʈ 0288, 648 ( = ASCII 096) retroflex plosive, voiced d ɖ 0256, 598 labiodental nasal F ɱ 0271, 625 retroflex nasal n ɳ 0273, 627 palatal nasal J ɲ 0272, 626 velar nasal N ŋ 014B, 331 uvular nasal N\ ɴ 0274, 628

bilabial trill B\ ʙ 0299, 665 uvular trill R\ ʀ 0280, 640 alveolar tap 4 ɾ 027E, 638 retroflex flap r ɽ 027D, 637 bilabial fricative, voiceless p\ ɸ 0278, 632 bilabial fricative, voiced B β 03B2, 946 dental fricative, voiceless T θ 03B8, 952 dental fricative, voiced D ð 00F0, 240 postalveolar fricative, voiceless S ʃ 0283, 643 postalveolar fricative, voiced Z ʒ 0292, 658 retroflex fricative, voiceless s ʂ 0282, 642 retroflex fricative, voiced z` ʐ 0290, 656 palatal fricative, voiceless C ç 00E7, 231 palatal fricative, voiced j\ ʝ 029D, 669 velar fricative, voiced G ɣ 0263, 611 uvular fricative, voiceless X χ 03C7, 967 uvular fricative, voiced R ʁ 0281, 641 pharyngeal fricative, voiceless X\ ħ 0127, 295 pharyngeal fricative, voiced ?\ ʕ 0295, 661 glottal fricative, voiced h\ ɦ 0266, 614

alveolar lateral fricative, vl. K alveolar lateral fricative, vd. K\

labiodental approximant P (or v) alveolar approximant r\ retroflex approximant r` velar approximant M\

retroflex lateral approximant l` palatal lateral approximant L velar lateral approximant L
Clicks

bilabial O\ (O = capital letter) dental |
(post)alveolar !\ palatoalveolar =\ alveolar lateral ||
Ejectives, implosives

ejective > e.g. ejective p p> implosive < e.g. implosive b b< Vowels

close back unrounded M close central unrounded 1 close central rounded } lax i I lax y Y lax u U

close-mid front rounded 2 close-mid central unrounded @\ close-mid central rounded 8 close-mid back unrounded 7

schwa ə @

open-mid front unrounded E open-mid front rounded 9 open-mid central unrounded 3 open-mid central rounded 3\ open-mid back unrounded V open-mid back rounded O

ash (ae digraph) { open schwa (turned a) 6

open front rounded & open back unrounded A open back rounded Q Other symbols

voiceless labial-velar fricative W voiced labial-palatal approx. H voiceless epiglottal fricative H\ voiced epiglottal fricative <\ epiglottal plosive >\

alveolo-palatal fricative, vl. s\ alveolo-palatal fricative, voiced z\ alveolar lateral flap l\ simultaneous S and x x\ tie bar _ Suprasegmentals

primary stress “ secondary stress % long : half-long :\ extra-short _X linking mark -
Tones and word accents

level extra high _T level high _H level mid _M level low _L level extra low _B downstep ! upstep ^ (caret, circumflex)

contour, rising contour, falling _F contour, high rising _H_T contour, low rising _B_L

contour, rising-falling _R_F (NB Instead of being written as diacritics with _, all prosodic marks can alternatively be placed in a separate tier, set off by < >, as recommended for the next two symbols.) global rise global fall Diacritics

voiceless 0 (0 = figure), e.g. n_0 voiced _v aspirated _h more rounded _O (O = letter) less rounded _c advanced _+ retracted _- centralized _” syllabic = (or _=) e.g. n= (or n=) non-syllabic _^ rhoticity `

breathy voiced _t creaky voiced _k linguolabial _N labialized _w palatalized ‘ (or _j) e.g. t’ (or t_j) velarized _G pharyngealized _?\

dental d apical _a laminal _m nasalized ~ (or _~) e.g. A~ (or A~) nasal release _n lateral release _l no audible release _}

velarized or pharyngealized _e velarized l, alternatively 5 raised _r lowered _o advanced tongue root _A retracted tongue root _q


This (part of) documentation was generated from src/fst/phonetics/txt2ipa.xfscript


src-fst-transcriptions-transcriptor-abbrevs2text.lexc.md

We describe here how abbreviations are in Traveller Norwegian are read out, e.g. for text-to-speech systems.

For example:


This (part of) documentation was generated from src/fst/transcriptions/transcriptor-abbrevs2text.lexc


src-fst-transcriptions-transcriptor-numbers-digit2text.lexc.md

% komma% :, Root ; % tjuohkkis% :%. Root ; % kolon% :%: Root ; % sárggis% :%- Root ; % násti% :%* Root ;


This (part of) documentation was generated from src/fst/transcriptions/transcriptor-numbers-digit2text.lexc


tools-tokenisers-tokeniser-disamb-gt-desc.pmscript.md

Tokeniser for rmg

Usage:

$ make
$ echo "ja, ja" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa boasttu olmmoš, man mielde lahtuid." | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "márffibiillagáffe" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst

Pmatch documentation: https://github.com/hfst/hfst/wiki/HfstPmatch

Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words:

Whitespace contains ASCII white space and the List contains some unicode white space characters

Apart from what’s in our morphology, there are

  1. unknown word-like forms, and
  2. unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a Unknowns are made of:
    • lower-case ASCII
    • upper-case ASCII
    • select extended latin symbols ASCII digits
    • select symbols
    • Combining diacritics as individual symbols,
    • various symbols from Private area (probably Microsoft), so far:
    • U+F0B7 for “x in box”

Unknown handling

Unknowns are tagged ?? and treated specially with hfst-tokenise hfst-tokenise –giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it’s safer to let hfst-tokenise handle them.

Finally we mark as a token any sequence making up a:


This (part of) documentation was generated from tools/tokenisers/tokeniser-disamb-gt-desc.pmscript


tools-tokenisers-tokeniser-gramcheck-gt-desc.pmscript.md

Grammar checker tokenisation for rmg

Requires a recent version of HFST (3.10.0 / git revision>=3aecdbc) Then just:

$ make
$ echo "ja, ja" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst

More usage examples:

$ echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa boasttu olmmoš, man mielde lahtuid." | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "márffibiillagáffe" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst

Pmatch documentation: https://github.com/hfst/hfst/wiki/HfstPmatch

Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words:

Whitespace contains ASCII white space and the List contains some unicode white space characters

Apart from what’s in our morphology, there are 1) unknown word-like forms, and 2) unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a

TODO: Could use something like this, but built-in’s don’t include šžđčŋ:

Simply give an empty reading when something is unknown: hfst-tokenise –giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it’s safer to let hfst-tokenise handle them.

Finally we mark as a token any sequence making up a:


This (part of) documentation was generated from tools/tokenisers/tokeniser-gramcheck-gt-desc.pmscript


tools-tokenisers-tokeniser-tts-cggt-desc.pmscript.md

TTS tokenisation for smj

Requires a recent version of HFST (3.10.0 / git revision>=3aecdbc) Then just:

make
echo "ja, ja" \
| hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst

More usage examples:

echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa \
boasttu olmmoš, man mielde lahtuid." \
| hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" \
| hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
echo "márffibiillagáffe" \
| hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst

Pmatch documentation: https://kitwiki.csc.fi/twiki/bin/view/KitWiki/HfstPmatch

Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words:

Whitespace contains ASCII white space and the List contains some unicode white space characters

Apart from what’s in our morphology, there are 1) unknown word-like forms, and 2) unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a

TODO: Could use something like this, but built-in’s don’t include šžđčŋ:

Simply give an empty reading when something is unknown: hfst-tokenise –giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it’s safer to let hfst-tokenise handle them.

Needs hfst-tokenise to output things differently depending on the tag they get


This (part of) documentation was generated from tools/tokenisers/tokeniser-tts-cggt-desc.pmscript