Irish NLP Grammar

Finite state and Constraint Grammar based analysers, proofing tools and other resources

View the project on GitHub giellalt/lang-gle

Page Content

  • src-fst-morphology-stems-determiners.lexc.md
  • src-fst-morphology-stems-english.lexc.md
  • src-fst-morphology-stems-interjections.lexc.md
  • src-fst-morphology-stems-numerals.lexc.md
  • src-fst-morphology-stems-particles.lexc.md
  • src-fst-morphology-stems-pronouns.lexc.md
  • src-fst-morphology-stems-propernouns.lexc.md
  • src-fst-morphology-stems-punctuations.lexc.md
  • src-fst-morphology-stems-tags.lexc.md
  • src-fst-morphology-stems-tobar.lexc.md
  • src-fst-morphology-stems-verbalnouns.lexc.md
  • src-fst-morphology-stems-verbs.lexc.md
  • src-fst-orthography-urucaps.xfscript.md
  • src-fst-phonetics-txt2ipa.xfscript.md
  • src-fst-transcriptions-transcriptor-abbrevs2text.lexc.md
  • src-fst-transcriptions-transcriptor-numbers-digit2text.lexc.md
  • tools-grammarcheckers-grammarchecker.cg3.md
  • DELIMITERS
  • TAGS AND SETS
  • BEFORE-SECTIONS
  • SECTION
  • tools-tokenisers-tokeniser-disamb-gt-desc.pmscript.md
  • Tokeniser for gle
  • tools-tokenisers-tokeniser-gramcheck-gt-desc.pmscript.md
  • Grammar checker tokenisation for gle
  • tools-tokenisers-tokeniser-tts-cggt-desc.pmscript.md
  • TTS tokenisation for smj
  • Irish language model documentation

    All doc-comment documentation in one large file.


    src-cg3-functions.cg3.md

    These sets model noun phrases (NPs). The idea is to first define whatever can occur in front of the head of the NP, and thereafter negate that with the expression WORD - premodifiers.

    The set NOT-NPMOD is used to find barriers between NPs. Typical usage: … (*1 N BARRIER NPT-NPMOD) … meaning: Scan to the first noun, ignoring anything that can be part of the noun phrase of that noun (i.e., “scan to the next NP head”)

    These were the set types.

    HABITIVE MAPPING

    sma object

    SUBJ MAPPING - leftovers

    OBJ MAPPING - leftovers

    HNOUN MAPPING


    This (part of) documentation was generated from src/cg3/functions.cg3


    src-fst-morphology-affixes-nouns.lexc.md

    Moirfeolaíocht na nAinmfhocail Gaeilge (Morphology of Irish Nouns)

    FEMININE NOUN continuation classes Weak Plurals : Broad singular is made slender; plural already broad

    Weak Plurals : Broaden

    Singular already slender; plural is made broad

    Weak Plurals :

    Weak Plurals :

    STRONG PLURALS

    STRONG PLURALS

    STRONG PLURALS

    STRONG PLURALS

    STRONG PLURALS

    STRONG PLURALS

    3rd Declension Strong Plurals : +aí

    an bheannacht -> na beannachtaí

    gamhain - gamhna (gs), midheamhain - midheamhna (gs)

    Strong Plurals : +(e)anna

    tóin -> tóineanna scoth -> scothanna EXCEPTION: an chuid -> na codanna see FIX file EXCEPTION: an raith -> na rathanna see FIX file

    Strong Plurals : Broaden +anna

    an chuid -> na codanna see FIX file an raith -> na rathanna an laith -> na lathanna an luaith -> na luathanna

    Strong Plurals : +í

    an bhearna -> na bearnaí an eala -> na healaí

    Strong Plurals : Athrú e -> í

    an aicme -> na haicmí (classes) an táille -> na táillí (fees)

    Strong Plurals :

    various ending in vowel ! plurals +nna

    Strong Plurals : Leathnú +acha

    an bheoir -> na beoracha (beers)

    Gen Sg : Coim + ach Strong Plurals : Coimriú +eacha

    an chathaoir -> na cathaoireacha (chairs) (Note long vowel aoi is not sync. an cathair -> na cathracha

    Gen Sg : Coim + a Strong Plurals : Coimriú +(e)acha samhail -> samhla anacair -> anacra

    Gen Sg : Coim + Slen + e Strong Plurals : Coimriú +(e)acha crithir - critre fothair - foithre

    tarraingt - tarraingthe - tarraingtí

    MASCULINE NOUN continuation classes

    WEAK PLURALS (i.e. where the nominative and genitive plurals are different) TYPE 1 Nom pl. ends in conson. eg cat : cait, fear : fir, marcach: marcaigh

    TYPE 2 Nom pl. formed by adding -a eg cos : cosa, úll : úlla

    (TYPE 3) Nom pl. formed by adding -ta eg

    2nd Declension sliabh -> na sléibhte

    3rd Declension Strong Plurals : +í as in Nm7 but singular are different

    eg. bádóir -> na bádóirí

    Strong Plurals : +anna

    eg. an bláth -> na bláthanna

    Strong Plurals : +aí

    gen briocht -> breachta Strong Plurals : +aí briocht -> briochtaí

    Strong Plurals : +anna eg. an bláth -> na bláthanna

    ^Lea,broadening, is required, in gen sg: io -> ea (bior, crios) and this is done using ^Ath (change) pl bior -> bioranna

    Strong Plurals : Athrú +anna (io->ea) eg. an cith -> na ceathanna

    ^Lea,broadening, is required, in gen sg: cith -> ceatha, greim -> greama and this is done using ^Ath (change) pl also broadened cith -> ceathanna

    sliocht - sleachta gs & pl

    Strong Plurals : +í

    (A) nouns ending in -ín (a diminutive) smidiríní (smithereens) no singular eg. an cailín -> na cailíní (girls) eg. an báidín -> na báidíní (small boats)

    (B) nouns ending in -a eg. an balla -> na ballaí (walls)

    01/04/08

    Strong Plurals : +idí an fiche -> na fichidí (the twenties) eidí needs correcting an caoga -> na caogaidí (the fifties)

    GS +the

    GS +te

    GS +tha PL +thaí bascadh - basctha - bascthaí

    GS +ta

    moladh / gs = molta / pl = moltaí

    INITIAL MUTATIONS NOMINATIVE SINGULAR

    ^IM = initial mutation e.g. with prepositions, and possession Singular: e.g. ar an bhosca, ar an mbosca possessive markers on vowels: ár n-athair, a (f) hathair, Plural: e.g. ar bhoscaí, i mboscaí possessive e.g. ár n-aithreacha - our fathers (^C)

    adds ^h to vowel-initial words … but adds the +hPref to all words … see fix file

    GENITIVE SINGULAR

    VOCATIVE SINGULAR Since this is trivial (always ^Sé) it is included with Final Mutations in Voc-sg-0 and Voc-sg-1.

    ALL PLURALS Note: Vocative Plural does not require Def & Idf but it is easier to generate them and remove all Voc Pl Idfs at the end (the Def form is correct although the Def marker is unnecessary)

    FINAL MUTATIONS NOMINATIVE SINGULAR

    GENITIVE SINGULAR

    VOCATIVE SINGULAR

    ALL PLURALS

    when it is a place name as well as the usual inflections for propernouns (4 classes) we want to generate an adjectival form e.g. Beilg - Beilgeach

    new 5-6-2024 Place and Personal name files both use Nf1-Prop and Nm1-Prop etc.

    masc nouns - slenderise

    fem nouns - slenderise and add e

    fem nouns - broaden and add a

    fem nouns - no change

    masc nouns - no change

    fem nouns - Albain/na hAlban


    This (part of) documentation was generated from src/fst/morphology/affixes/nouns.lexc


    src-fst-morphology-affixes-prefixes.lexc.md

    Prefixes Prefixes in the Irish language are bound to beginning of other words.


    This (part of) documentation was generated from src/fst/morphology/affixes/prefixes.lexc


    src-fst-morphology-affixes-propernouns.lexc.md

    Proper noun inflection The Irish language proper nouns inflect in the same cases as regular nouns, but with a colon (‘:’) as separator.


    This (part of) documentation was generated from src/fst/morphology/affixes/propernouns.lexc


    src-fst-morphology-affixes-symbols.lexc.md

    Symbol affixes


    This (part of) documentation was generated from src/fst/morphology/affixes/symbols.lexc


    src-fst-morphology-affixes-verbs.lexc.md

    inserted +Len +Uru to distinguish between a bhíonn & a mbíonn Dir/Indir Rel clauses Dec 2004


    This (part of) documentation was generated from src/fst/morphology/affixes/verbs.lexc


    src-fst-morphology-phonology.nounadj.xfscript.md

    a d h -> [%^FC ]   [d n t l s] %^X _ %^Ath (%^Caol) t

    This (part of) documentation was generated from src/fst/morphology/phonology.nounadj.xfscript


    src-fst-morphology-phonology.twolc.md

    =================================== ! The Irish morphophonological/twolc rules file ! =================================== !


    This (part of) documentation was generated from src/fst/morphology/phonology.twolc


    src-fst-morphology-phonology.verb.xfscript.md

    Verbal Noun Gen


    This (part of) documentation was generated from src/fst/morphology/phonology.verb.xfscript


    src-fst-morphology-root-adj.lexc.md

    INTRODUCTION TO MORPHOLOGICAL ANALYSER OF Irish LANGUAGE.

    Multichar_Symbols definitions

    Analysis symbols

    The morphological analyses of wordforms of UNDEFINED language are presented in this system in terms of following symbols. (It is highly suggested to follow existing standards when adding new tags).

    Subj is used for subjunctive


    This (part of) documentation was generated from src/fst/morphology/root-adj.lexc


    src-fst-morphology-root-noun-all.lexc.md

    INTRODUCTION TO MORPHOLOGICAL ANALYSER OF Irish LANGUAGE.

    Multichar_Symbols definitions

    Analysis symbols

    The morphological analyses of wordforms of UNDEFINED language are presented in this system in terms of following symbols. (It is highly suggested to follow existing standards when adding new tags).

    Subj is used for subjunctive


    This (part of) documentation was generated from src/fst/morphology/root-noun-all.lexc


    src-fst-morphology-root-others.lexc.md

    INTRODUCTION TO MORPHOLOGICAL ANALYSER OF Irish LANGUAGE.

    Multichar_Symbols definitions

    Analysis symbols

    The morphological analyses of wordforms of UNDEFINED language are presented in this system in terms of following symbols. (It is highly suggested to follow existing standards when adding new tags).

    Subj is used for subjunctive


    This (part of) documentation was generated from src/fst/morphology/root-others.lexc


    src-fst-morphology-root-verb-all.lexc.md

    INTRODUCTION TO MORPHOLOGICAL ANALYSER OF Irish LANGUAGE.

    Multichar_Symbols definitions

    Analysis symbols

    The morphological analyses of wordforms of UNDEFINED language are presented in this system in terms of following symbols. (It is highly suggested to follow existing standards when adding new tags).

    Subj is used for subjunctive


    This (part of) documentation was generated from src/fst/morphology/root-verb-all.lexc


    src-fst-morphology-root.lexc.md

    Irish morphological analyser !

    INTRODUCTION TO MORPHOLOGICAL ANALYSER OF Irish LANGUAGE.

    Definitions for Multichar_Symbols

    Tag symbols for analysis

    The morphological analyses of wordforms for the Irish language are presented in this system in terms of the following symbols.

    Tag list:

    Flag diacritics

    We have manually optimised the structure of our lexicon using following flag diacritics to restrict morhpological combinatorics - only allow compounds with verbs if the verb is further derived into a noun again: | @P.NeedNoun.ON@ | (Dis)allow compounds with verbs unless nominalised | @D.NeedNoun.ON@ | (Dis)allow compounds with verbs unless nominalised | @C.NeedNoun@ | (Dis)allow compounds with verbs unless nominalised

    For languages that allow compounding, the following flag diacritics are needed to control position-based compounding restrictions for nominals. Their use is handled automatically if combined with +CmpN/xxx tags. If not used, they will do no harm. | @P.CmpFrst.FALSE@ | Require that words tagged as such only appear first | @D.CmpPref.TRUE@ | Block such words from entering ENDLEX | @P.CmpPref.FALSE@ | Block these words from making further compounds | @D.CmpLast.TRUE@ | Block such words from entering R | @D.CmpNone.TRUE@ | Combines with the next tag to prohibit compounding | @U.CmpNone.FALSE@ | Combines with the prev tag to prohibit compounding | @P.CmpOnly.TRUE@ | Sets a flag to indicate that the word has passed R | @D.CmpOnly.FALSE@ | Disallow words coming directly from root.

    Use the following flag diacritics to control downcasing of derived proper nouns (e.g. Finnish Pariisi -> pariisilainen). See e.g. North Sámi for how to use these flags. There exists a ready-made regex that will do the actual down-casing given the proper use of these flags. | @U.Cap.Obl@ | Allowing downcasing of derived names: deatnulasj. | @U.Cap.Opt@ | Allowing downcasing of derived names: deatnulasj.

    The Root lexicon etc.


    This (part of) documentation was generated from src/fst/morphology/root.lexc


    src-fst-morphology-stems-abbreviations.lexc.md

    Abbreviations


    This (part of) documentation was generated from src/fst/morphology/stems/abbreviations.lexc


    src-fst-morphology-stems-adjectives.lexc.md

    SEE PREP/NUM etc dá Adj3-1; ! do or de +

    I R R E G U L A R A D J E C T I V E S

    the following always come at the end of the noun/pron/adj and cannot be intermingled with other adjectives Have moved to Demonstrative Determiners


    This (part of) documentation was generated from src/fst/morphology/stems/adjectives.lexc


    src-fst-morphology-stems-adpositions.lexc.md

    Prepositions:

    SIMPLE PREPOSITIONS theses are not preps only copula or conj this “is” looks like “agus” to me … removing the prep reading … should be subst except in Prep Cmpd - see below : maidir+Prep+Simp:maidir #; should be subst: maille+Prep+Simp:maille #; ! maille le = along with

    COMPOUNDS: PREP + ARTICLE (an/na)

    le does not combine with art: but becomes leis before “an”

    trí does not combine with art: but becomes tríd before “an”

    Hisrorical forms (bardic)

    COMPOUNDS: PREP + POSS PRON (a/ár)

    COMPOUNDS: PREP + REL. PART. (a/ar)

    COMPOUND PREPOSITIONS: PREP + NOUN


    This (part of) documentation was generated from src/fst/morphology/stems/adpositions.lexc


    src-fst-morphology-stems-adverbs.lexc.md

    Adverbs

    word for word, i.e. literally (2019)

    MOVED TO ADJ annamh+Adv+Gn:annamh #; what about chomh mór/hálainn etc. etc.

    see PART-LEX.TXT (etc.) for following


    This (part of) documentation was generated from src/fst/morphology/stems/adverbs.lexc


    src-fst-morphology-stems-articles.lexc.md

    Common Functional Words - Articles


    This (part of) documentation was generated from src/fst/morphology/stems/articles.lexc


    src-fst-morphology-stems-bardic.lexc.md


    This (part of) documentation was generated from src/fst/morphology/stems/bardic.lexc


    src-fst-morphology-stems-conjunctions.lexc.md

    CONJUNCTIONS E. Uí Dhonnchadha New Irish Grammar by The Christian Brothers, etc. removed items (subbord conjs) which are pre-verbal (which often have past tense inflection) e.g. go/gur a/ar nach/nár and which often follow (or attach to) a conjunction e.g. cé go, nuair nach, remaining subordinating conjunctions can be followed by verb or copula go mb’fhusa an obair …, go dtógfadh sé e.g. má bhíonn, más some still have tense marking as they are combined forms e.g. sula, sular, murar etc.

    LEXICON Conjunctions

    gur NOT moved to Verb Part as a)always precede a verb b) have tense c) preceded by conjs like nuair, cé - SHOULD BE REMOVED I THINK ???? NO NEED FOT CONJ AS WELL AS SUBORD COP AND SUBORD VERB PART …


    This (part of) documentation was generated from src/fst/morphology/stems/conjunctions.lexc


    src-fst-morphology-stems-determiners.lexc.md

    DETERMINERS E. Uí Dhonnchadha

    Determiners: Possessives

    Determiners: INTERROGATIVES

    (definite & indefinite amounts)


    This (part of) documentation was generated from src/fst/morphology/stems/determiners.lexc


    src-fst-morphology-stems-english.lexc.md

    @dm discourse marker added to distinguish this from Irish so=seo=here

    as in Air France, Air India etc.


    This (part of) documentation was generated from src/fst/morphology/stems/english.lexc


    src-fst-morphology-stems-interjections.lexc.md

    INTERJECTIONS

    Interjections

    FILLED PAUSES Fillers in speech

    Communicators in speech

    Events in speech transcripts, e.g. cough, sneeze etc.

    Anonymisation in transcripts/exam scripts


    This (part of) documentation was generated from src/fst/morphology/stems/interjections.lexc


    src-fst-morphology-stems-numerals.lexc.md

    NUMBERS E. Uí Dhonnchadha For Personal Numerals (duine, beirt, triúr) SEE NOUNS

    CARDINAL Numbers

    ORDINAL Numbers

    Number Operators


    This (part of) documentation was generated from src/fst/morphology/stems/numerals.lexc


    src-fst-morphology-stems-particles.lexc.md

    PARTICLES E. Uí Dhonnchadha Preverbal Unique Membership classes

    tense distiction is unnecessary

    relative if can be translated as “who/which/whose/to,on,of etc. whom etc.” or “that”

    not relative if can’t be translated as “who/which/whose/to,on,of etc. whom etc.” ??? i.e. complementiser “that” …

    Reflexive (or emphatic) ‘féin’ moved from pronouns file


    This (part of) documentation was generated from src/fst/morphology/stems/particles.lexc


    src-fst-morphology-stems-pronouns.lexc.md

    Na Forainmneacha Pearsanta - The Personal Pronouns (mé,tú, sé, sí..) Na Forainmneacha Éiginnte - Indefinite Pronouns (ceachtar, cibé …) Pronominals - words which act like pronouns

    Personal Pronouns

    Emphatic/Contrastive Pronouns

    this is not an independent pronoun - it accompanies an pronoun or noun

    Indefinite Pronouns Interrogative Pronouns (added Feb 05) removed Pro from cén as noun complement is needed unlike cé also include Det Art Sg in det-lex for “a shonrú cén dáta” = which

    Copular DEMONSTRATIVE See also Determiners

    PREPOSITIONAL PRONOUNS (CONJUGATED PREPOSITIONS)


    This (part of) documentation was generated from src/fst/morphology/stems/pronouns.lexc


    src-fst-morphology-stems-propernouns.lexc.md

    Moirfeolaíocht na nAinmfhocail Gaeilge (Morphology of Irish Nouns)

    South Africa

    Mar 2012 Mar 2012

    Added. Most popular names. Male

    Female


    This (part of) documentation was generated from src/fst/morphology/stems/propernouns.lexc


    src-fst-morphology-stems-punctuations.lexc.md

    Punctuation


    This (part of) documentation was generated from src/fst/morphology/stems/punctuations.lexc


    src-fst-morphology-stems-tags.lexc.md

    Multichar_Symbols

    +XMLTag !

    LEXICON Root XMLTags;


    This (part of) documentation was generated from src/fst/morphology/stems/tags.lexc


    src-fst-morphology-stems-tobar.lexc.md

    Tobar - ac Grianna

    PLACENAMES


    This (part of) documentation was generated from src/fst/morphology/stems/tobar.lexc


    src-fst-morphology-stems-verbalnouns.lexc.md

    NOTE: ‘druideadh’ is commented out since it was not found as a verbal noun in the corpus, yet chances are that it would get mixed up with ‘druideadh’ as independed form of ‘druid’, i.e. ‘ó druideadh an scoil’


    This (part of) documentation was generated from src/fst/morphology/stems/verbalnouns.lexc


    src-fst-morphology-stems-verbs.lexc.md

    First Conjugation Verb Stems

    Second Conjugation Verb Stems

    DEFECTIVE VERBS

    SOME COMMON COMPOUNDS leave out _fios from lemma as it prevents some bí CG rules applying IRREGULAR VERBS

    Irregular Verbs

    auto does not lenite

    varient varient varient varient varient varient varient varient

    auto does not lenite

    Bardic - historical nonstandard spellings

    FORMS NOT LENITED IN POSITIVE PAST TENSE incl IMPERFECT

    Mixed Verb Stems

    NEEDS FURTHER TESTING OF -X WORDS and TEST


    This (part of) documentation was generated from src/fst/morphology/stems/verbs.lexc


    src-fst-orthography-urucaps.xfscript.md

    NOW COMPOSED IN LOOKUP.SCRIPT


    This (part of) documentation was generated from src/fst/orthography/urucaps.xfscript


    src-fst-phonetics-txt2ipa.xfscript.md

    retroflex plosive, voiceless t ʈ 0288, 648 ( = ASCII 096) retroflex plosive, voiced d ɖ 0256, 598 labiodental nasal F ɱ 0271, 625 retroflex nasal n ɳ 0273, 627 palatal nasal J ɲ 0272, 626 velar nasal N ŋ 014B, 331 uvular nasal N\ ɴ 0274, 628

    bilabial trill B\ ʙ 0299, 665 uvular trill R\ ʀ 0280, 640 alveolar tap 4 ɾ 027E, 638 retroflex flap r ɽ 027D, 637 bilabial fricative, voiceless p\ ɸ 0278, 632 bilabial fricative, voiced B β 03B2, 946 dental fricative, voiceless T θ 03B8, 952 dental fricative, voiced D ð 00F0, 240 postalveolar fricative, voiceless S ʃ 0283, 643 postalveolar fricative, voiced Z ʒ 0292, 658 retroflex fricative, voiceless s ʂ 0282, 642 retroflex fricative, voiced z` ʐ 0290, 656 palatal fricative, voiceless C ç 00E7, 231 palatal fricative, voiced j\ ʝ 029D, 669 velar fricative, voiced G ɣ 0263, 611 uvular fricative, voiceless X χ 03C7, 967 uvular fricative, voiced R ʁ 0281, 641 pharyngeal fricative, voiceless X\ ħ 0127, 295 pharyngeal fricative, voiced ?\ ʕ 0295, 661 glottal fricative, voiced h\ ɦ 0266, 614

    alveolar lateral fricative, vl. K alveolar lateral fricative, vd. K\

    labiodental approximant P (or v) alveolar approximant r\ retroflex approximant r` velar approximant M\

    retroflex lateral approximant l` palatal lateral approximant L velar lateral approximant L
    Clicks

    bilabial O\ (O = capital letter) dental |
    (post)alveolar !\ palatoalveolar =\ alveolar lateral ||
    Ejectives, implosives

    ejective > e.g. ejective p p> implosive < e.g. implosive b b< Vowels

    close back unrounded M close central unrounded 1 close central rounded } lax i I lax y Y lax u U

    close-mid front rounded 2 close-mid central unrounded @\ close-mid central rounded 8 close-mid back unrounded 7

    schwa ə @

    open-mid front unrounded E open-mid front rounded 9 open-mid central unrounded 3 open-mid central rounded 3\ open-mid back unrounded V open-mid back rounded O

    ash (ae digraph) { open schwa (turned a) 6

    open front rounded & open back unrounded A open back rounded Q Other symbols

    voiceless labial-velar fricative W voiced labial-palatal approx. H voiceless epiglottal fricative H\ voiced epiglottal fricative <\ epiglottal plosive >\

    alveolo-palatal fricative, vl. s\ alveolo-palatal fricative, voiced z\ alveolar lateral flap l\ simultaneous S and x x\ tie bar _ Suprasegmentals

    primary stress “ secondary stress % long : half-long :\ extra-short _X linking mark -
    Tones and word accents

    level extra high _T level high _H level mid _M level low _L level extra low _B downstep ! upstep ^ (caret, circumflex)

    contour, rising contour, falling _F contour, high rising _H_T contour, low rising _B_L

    contour, rising-falling _R_F (NB Instead of being written as diacritics with _, all prosodic marks can alternatively be placed in a separate tier, set off by < >, as recommended for the next two symbols.) global rise global fall Diacritics

    voiceless 0 (0 = figure), e.g. n_0 voiced _v aspirated _h more rounded _O (O = letter) less rounded _c advanced _+ retracted _- centralized _” syllabic = (or _=) e.g. n= (or n=) non-syllabic _^ rhoticity `

    breathy voiced _t creaky voiced _k linguolabial _N labialized _w palatalized ‘ (or _j) e.g. t’ (or t_j) velarized _G pharyngealized _?\

    dental d apical _a laminal _m nasalized ~ (or _~) e.g. A~ (or A~) nasal release _n lateral release _l no audible release _}

    velarized or pharyngealized _e velarized l, alternatively 5 raised _r lowered _o advanced tongue root _A retracted tongue root _q


    This (part of) documentation was generated from src/fst/phonetics/txt2ipa.xfscript


    src-fst-transcriptions-transcriptor-abbrevs2text.lexc.md

    We describe here how abbreviations are in Irish are read out, e.g. for text-to-speech systems.

    For example:


    This (part of) documentation was generated from src/fst/transcriptions/transcriptor-abbrevs2text.lexc


    src-fst-transcriptions-transcriptor-numbers-digit2text.lexc.md

    % komma% :, Root ; % tjuohkkis% :%. Root ; % kolon% :%: Root ; % sárggis% :%- Root ; % násti% :%* Root ;


    This (part of) documentation was generated from src/fst/transcriptions/transcriptor-numbers-digit2text.lexc


    tools-grammarcheckers-grammarchecker.cg3.md

    IRISH G R A M M A R C H E C K E R

    DELIMITERS

    TAGS AND SETS

    Tags

    This section lists all the tags inherited from the fst, and used as tags in the syntactic analysis. The next section, Sets, contains sets defined on the basis of the tags listed here, those set names are not visible in the output.

    Beginning and end of sentence

    BOS EOS

    Parts of speech tags

    Art Noun Prep

    Subst Check what it is

    N A Adv V Pron CS CC CC-CS Po Pr Pcle Num Interj ABBR ACR CLB LEFT RIGHT WEB PPUNCT Det PUNCT

    COMMA ¶

    Tags for POS sub-categories

    Simp Sbj

    Pers Dem Interr Indef Recipr Refl Rel Coll NomAg Prop Allegro Arab Romertall

    Tags for morphosyntactic properties

    DefArt Art Def Fem Masc Len hPref, for h prefixation Ecl Poss, possessive

    Nom Acc Gen Dat Loc Com Ess Par Voc Sg Pl Cmp/SplitR Cmp/SgNom Cmp/SgGen Cmp/SgGen

    Comp Superl Attr Ord Qst IV TV VD VTI Prt Prs Ind Pot Cond Imprt ImprtII Sg1 Sg2 Sg3 Du1 Du2 Du3 Pl1 Pl2 Pl3 Inf ConNeg Neg PrfPrc VGen PrsPrc Ger Sup Actio VAbess

    Err/Orth

    Semantic tags

    Sem/Act Sem/Ani Sem/Atr Sem/Body Sem/Clth Sem/Domain Sem/Feat-phys Sem/Fem Sem/Group Sem/Lang Sem/Mal Sem/Measr Sem/Money Sem/Obj Sem/Obj-el Sem/Org Sem/Perc-emo Sem/Plc Sem/Sign Sem/State-sick Sem/Sur Sem/Time Sem/Txt

    HUMAN

    HAB-ACTOR HAB-ACTOR-NOT-HUMAN

    PROP-ATTR PROP-SUR

    TIME-N-SET

    Noun errors (Len vs. not Len) after prepositions

    The following prepositions cause the following noun to be eclipsed and there are different rules for each preposition.

    These prepositions always cause the nouns after them to be lenited:

    Noun errors (Ecl vs. not Ecl) after prepositions

    Syntactic tags

    @+FAUXV @+FMAINV @-FAUXV @-FMAINV @-FSUBJ> @-F<OBJ @-FOBJ> @-FSPRED<OBJ @-F<ADVL @-FADVL> @-F<SPRED @-F<OPRED @-FSPRED> @-FOPRED> @>ADVL @ADVL< @<ADVL @ADVL> @ADVL @HAB> @<HAB @>N @Interj @N< @>A @P< @>P @HNOUN @INTERJ @>Num @Pron< @>Pron @Num< @OBJ @<OBJ @OBJ> @OPRED @<OPRED @OPRED> @PCLE @COMP-CS< @SPRED @<SPRED @SPRED> @SUBJ @<SUBJ @SUBJ> SUBJ SPRED OPRED @PPRED @APP @APP-N< @APP-Pron< @APP>Pron @APP-Num< @APP-ADVL< @VOC @CVP @CNP OBJ <OBJ OBJ> <OBJ-OTHERS OBJ>-OTHERS

    SYN-V @X

    Sets containing sets of lists and tags

    This part of the file lists a large number of sets based partly upon the tags defined above, and partly upon lexemes drawn from the lexicon. See the sourcefile itself to inspect the sets, what follows here is an overview of the set types.

    Sets for Single-word sets

    Sets for word or not

    WORD any word

    Case sets

    ADLVCASE

    CASE-AGREEMENT CASE

    NOT-NOM NOT-GEN NOT-ACC

    Verb sets

    Verbs and their complements

    NOT-V

    Sets for finiteness and mood

    REAL-NEG

    MOOD-V

    NOT-PRFPRC

    Sets for person

    SG1-V SG2-V SG3-V PL1-V PL2-V PL3-V

    Set for your, my and his

    Note that imperative verbs are not included in these sets!

    Some subsets of the VFIN sets

    Pronoun sets

    Adjectival sets and their complements

    Adverbial sets and their complements

    Sets of elements with common syntactic behaviour

    NP sets defined according to their morphosyntactic features

    The PRE-NP-HEAD family of sets

    These sets model noun phrases (NPs). The idea is to first define whatever can occur in front of the head of the NP, and thereafter negate that with the expression WORD - premodifiers.

    Border sets and their complements

    Morphoponological sets

    Grammarchecker sets

    Here ends the list and set section

    BEFORE-SECTIONS

    SECTION

    spellchecking

    Gender errors in adjectives

    RULE: msyn-adj-gender to change Masculine adjective to Feminine if it modifies a feminine noun !!IT WORKS!!

    Prepositions

    ADD:msyn-prep-pron RULE TO CHANGE A PREPOSITION AND A PRONOUN INTO A PREPOSITIONAL PRONOUN (e.g., AG MÉ = AGAM, ROIMH SIBH = ROMHAIBH) !!IT WORKS!!

    ADD:msyn-h-after-fem-possessive-adjective: rule to add h to noun following possessor

    ADD:msyn-len-after-prep:

    ADD:msyn-len-after-prep: rule to add lenition to determiners following prepositions

    ADD:msyn-ecl-after-prep: A rule to correct eclipse errors without an intervening article. !!!IT WORKS!!!

    ADD:msyn-ecl-after-prep-sfem: Eclipse after preposition … (sfem?)

    Rules for lenition

    ADD:msyn-teastaigh-ó: exchange prep “mé” with “ó” when following “teastaigh”

    ADD:msyn-inis-do

    ADD:msyn-ar-an-aonach: A rule to correct the error “ag an aonach” to the correct form “ar an aonach”.

    ADD:msyn-ar-a-haon-a-chlog

    ADD:msyn-fada-ó

    ADD:msyn-beag-is-fiú-de “beag is fiú de” A + “de”

    ADD:msyn-cúpla-plnoun-sgnoun ..

    ADD:msyn-gen-case-nouns

    Definiteness errors in nouns

    A RULE TO CHANGE THE NOUN AFTER A NOUN AND A POSSESIVE ADJECTIVE TO THE GENITIVE CASE. !!IT WORKS!!

    ADD:use-guillemets: Simple punctuation rules showing how to change the lemma in the suggestions:

    ADD:use-ellipsis

    ADD:msyn-ar-an-tae: This rule is for when people put milk in tea. In Irish, the correct way to say it is that milk is put on tea.

    This rule is for when people put milk in coffee. In Irish, the correct way to say it is that milk is put on coffee. At is stands, the rule works for Ulaidh Irish. It needs to be changed to work for standard Irish.

    ADD:msyn-ar-an-gcaife

    ADD:msyn-tóin-poill

    ADD:msyn-ie.i.


    This (part of) documentation was generated from tools/grammarcheckers/grammarchecker.cg3


    tools-tokenisers-tokeniser-disamb-gt-desc.pmscript.md

    Tokeniser for gle

    Usage:

    $ make
    $ echo "ja, ja" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
    $ echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa boasttu olmmoš, man mielde lahtuid." | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
    $ echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
    $ echo "márffibiillagáffe" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
    

    Pmatch documentation: https://github.com/hfst/hfst/wiki/HfstPmatch

    Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words:

    Whitespace contains ASCII white space and the List contains some unicode white space characters

    Apart from what’s in our morphology, there are

    1. unknown word-like forms, and
    2. unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a Unknowns are made of:
      • lower-case ASCII
      • upper-case ASCII
      • select extended latin symbols ASCII digits
      • select symbols
      • Combining diacritics as individual symbols,
      • various symbols from Private area (probably Microsoft), so far:
      • U+F0B7 for “x in box”

    Unknown handling

    Unknowns are tagged ?? and treated specially with hfst-tokenise hfst-tokenise –giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it’s safer to let hfst-tokenise handle them.

    Finally we mark as a token any sequence making up a:


    This (part of) documentation was generated from tools/tokenisers/tokeniser-disamb-gt-desc.pmscript


    tools-tokenisers-tokeniser-gramcheck-gt-desc.pmscript.md

    Grammar checker tokenisation for gle

    Requires a recent version of HFST (3.10.0 / git revision>=3aecdbc) Then just:

    $ make
    $ echo "ja, ja" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
    

    More usage examples:

    $ echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa boasttu olmmoš, man mielde lahtuid." | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
    $ echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
    $ echo "márffibiillagáffe" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
    

    Pmatch documentation: https://github.com/hfst/hfst/wiki/HfstPmatch

    Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words:

    Whitespace contains ASCII white space and the List contains some unicode white space characters

    Apart from what’s in our morphology, there are 1) unknown word-like forms, and 2) unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a

    TODO: Could use something like this, but built-in’s don’t include šžđčŋ:

    Simply give an empty reading when something is unknown: hfst-tokenise –giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it’s safer to let hfst-tokenise handle them.

    Finally we mark as a token any sequence making up a:


    This (part of) documentation was generated from tools/tokenisers/tokeniser-gramcheck-gt-desc.pmscript


    tools-tokenisers-tokeniser-tts-cggt-desc.pmscript.md

    TTS tokenisation for smj

    Requires a recent version of HFST (3.10.0 / git revision>=3aecdbc) Then just:

    make
    echo "ja, ja" \
    | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
    

    More usage examples:

    echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa \
    boasttu olmmoš, man mielde lahtuid." \
    | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
    echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" \
    | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
    echo "márffibiillagáffe" \
    | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
    

    Pmatch documentation: https://kitwiki.csc.fi/twiki/bin/view/KitWiki/HfstPmatch

    Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words:

    Whitespace contains ASCII white space and the List contains some unicode white space characters

    Apart from what’s in our morphology, there are 1) unknown word-like forms, and 2) unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a

    TODO: Could use something like this, but built-in’s don’t include šžđčŋ:

    Simply give an empty reading when something is unknown: hfst-tokenise –giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it’s safer to let hfst-tokenise handle them.

    Needs hfst-tokenise to output things differently depending on the tag they get


    This (part of) documentation was generated from tools/tokenisers/tokeniser-tts-cggt-desc.pmscript

    Sitemap