Irish morphological analyser !

INTRODUCTION TO MORPHOLOGICAL ANALYSER OF Irish LANGUAGE.

Definitions for Multichar_Symbols

Tag symbols for analysis

The morphological analyses of wordforms for the Irish language are presented in this system in terms of the following symbols.

**+Corr ** = correction in Learner corpus
**+Error ** = error in Learner corpus
**+Start ** = start of error/correction

Tag list:

**+1P ** = first person inflection
**+2P ** = second person inflection
**+3P ** = third person inflection
**+A ** = XXX check
**+ABBR ** = Abbreviation
**+ACR ** = Acronym
**+Abr ** = “Abbreviation”
**+Ad ** = Adverbial particle: go
**+Adj ** = Adjective
**+Adv ** = Adverb
**+Anon ** = Anonymisation in transcribed speech
**+Arab ** = Arabic numerals (1, 2, …)
**+Art ** = Article determiner (an/na)
**+Attr ** = Attribute, element preceeding head
**+Auto ** = Autonomous verb form
**+Bar ** = hyphen, underscore, dash etc.
**+Bare ** = bare number form used after number particle “a”
**+Base ** = Base form of adjective (changed from +Pos to +Base 10/09/03)
**+Brack ** = round, square and curly brackets
**+CC ** = Canúint Chonnachta, Connaught dialect
**+CLB ** = Clause boundary
**+CLBfinal ** = Final clause boundary
**+CM ** = Canúint na Mumhan, Munster dialect
**+CU ** = Canúint Uladh, Ulster dialect
**+Card ** = Cardinal number(one two three …)
**+Cmc ** = Communicator (yeah, y’know) in transcribed speech
**+CmpNP/First ** =
**+CmpNP/None ** =
**+Cmpd ** = Compound prepostion
**+CmpdNoGen ** = Compound prepostion which does not require genitive case on object NP
**+Cmpl ** = Complementizer: go, gur, nach, nár
**+Com ** = Common case (nominative/accusative/dative case)
**+Comp ** = Comparative adjective (c)
**+Conj ** = Conjunction
**+Cond ** = Conditional mood
**+Coord ** = Coordinating conjunction
**+Cop ** = Copula
**+Curr ** = Currency symbols
**+Dat ** = Dative case (e.g. chois) fossilised forms
**+DeNom ** = Adjectives drived from proper nouns, e.g. Albanach (Scottish adjective), not the same as Albanais (Scottish language noun)
**+Def ** = Definite article
**+DefArt ** = noun/number form that follows a definite article (an)
**+Deg ** = degree particle with Adj/Abstract Noun (so loud, so sharp etc..
**+Dem ** = Demonstrative determiner (also combined with copula, e.g. Seo
**+Dep ** = Dependant forms of verbs
**+Det ** = Determiner, e.g. possessive determiner: mo, do
**+Dir ** = Directional adverb
**+Direct ** = Direct relative particle
**+Ecl ** = Eclipsis (+Urú) initial mutation, e.g. ar an gcat
**+Emph ** = Emphatic (Contrastive) form of personal pronoun e.g. ár dteachsa, do theachsa, a teachsa
**+End ** = end bracket, quote etc
**+English ** = English language words
**+Err/Hyph ** =
**+Err/Lex ** =
**+Err/MissingSpace ** =
**+Err/Orth ** = Orthografical error
**+Err/SpaceCmp ** = Compound space error
**+Event ** = Simple Event (laugh, sneeze etc.) in transcribed speech
**+Fam ** = Family Name - proper noun
**+Fem ** = Feminine gender
**+Filler ** = Filled Pause (eh, em, etc.) in transcribed speech
**+Fin ** = sentence final punctuation
**+Foreign ** = words from other languages, mainly English, some Latin
**+Fut ** = Future tense verbal particle
**+FutInd ** = Future Indicative verb
**+Gen ** = Genitive case
**+Acc ** = Accusative case
**+Cp ** = ?
**+Wh ** = wh word?
**+Gn ** = General adverb
**+Guess ** = Morphological guesser
**+hPref ** = h prefixed to a vowel-initial word
**+Idf ** = Indefinite quantifier/pronoun e.g. aon (any), cibé (whoever), ceachtar/neachtar (either/neither) etc.
**+Ill ** = n/a
**+Imp ** = Imperative particle (negative)
**+Imper ** = Imperative mood
**+Indirect ** = Indirect relative particle
**+Inf ** = Infinitival particle
**+Int ** = Sentence internal punctuation
**+Itj ** = Interjection
**+Its ** = Intensifier of adjective e.g. sách, ró- etc.
**+End ** = end bracket, quote etc
**+Latin ** = Latin language words
**+LEFT ** = Left side of parwise symbol (parenthesis or quotation mark)
**+Len ** = Lenited forms
**+Loc ** = Locative adverb
**+MIDDLE ** = Middle punctuation
**+MWE ** = Multi word expression
**+Masc ** = Masculine gender
**+N ** = n/a (Noun is used) – The +N tag is in use, TODO: change it
**+NER ** = Named Entity Recognition
**+NG ** = Don’t generate non-standard form
**+NStem ** = De-nominal verbal noun
**+Neg ** = Negative particle (n)
**+NegQ ** = Negative interrogative verbal particle(q)
**+Nm ** = Number particle (m)
**+Nom ** =
**+NotSlen ** = Adj qualifies pl noun with non-slender ending
**+Noun ** = Noun (common, proper, verbal, substantive)
**+Num ** = Numeral
**+OLang/ENG ** = - English language words
**+OLang/FIN ** = -
**+OLang/HUN ** = -
**+OLang/LAT ** = - Latin language words
**+OLang/NNO ** = -
**+OLang/NOB ** = -
**+OLang/RUS ** = -
**+OLang/SMA ** = -
**+OLang/SME ** = -
**+OLang/SMJ ** = -
**+OLang/SWE ** = -
**+OLang/UND ** = -
**+Obj ** = Object e.g. á = “do a” when obj of VN
*+Op ** = Number Operators, e.g. +,-,,/ etc.
**+Ord ** = Ordinal (first, second, third..) i.e. mo dhá lámh, an chéad dhá theach
**+Part ** = Particle (not +Vb) (U)
**+Past ** = Past tense verbal particle
**+PastImp ** = Past Habitual - Gháthchaite (Imperfect Indicative)
**+PastInd ** = Past Indicative tense
**+PastSubj ** = Past Subjunctive tense
**+Pat ** = Patronymic particle (p) (e.g. Ó, Ní, Uí, le, de ..)
**+Pers ** = Personal pronoun
**+PersName ** = Personal name - proper noun
**+Pl ** = Plural number
**+Place ** = Place name - proper noun
**+Poss ** = Possessive pronoun (can be attached to a prep, e.g. im’, dá, faoina)
**+Pref ** = Prefix; seperated prefixes in historical texts
**+Prep ** = Preposition
**+Pres ** = Copula present & future tense
**+PresImp ** = Pres Habitual - Gháthláithreach(Verb bí only - and deireann (abair)
**+PresInd ** = Present Indicative
**+PresSubj ** = Present Subjunctive
**+Pro ** = Pronoun with copula or relative particle
**+Pron ** = Pronoun
**+Prop ** = Proper noun
**+Punct ** = Abbreviation
**+PUNCT ** = Abbreviation (it seems several languages have two tags :-/
**+Q ** = Interrogative particle(q)
**+Qty ** = Quantifier
**+Quo ** = all quotation marks double, single etc.
**+Ref ** = Reflexive particle
**+Rel ** = Relative particle
**+RelInd ** = rel. indirect
**+RIGHT ** = Right side of parwise symbol (parenthesis or quotation mark)
**+Rom ** =
**+Sbj ** = Subject pronouns: sí, sé and siad are used only when pron follows predicate verb in subject position, otherwise í, é and iad are used.
**+Sem/Amount ** =
**+Sem/Build ** =
**+Sem/Build-room ** =
**+Sem/Cat ** =
**+Sem/Curr ** =
**+Sem/Date ** =
**+Sem/Domain ** =
**+Sem/Domain_Hum ** =
**+Sem/Dummytag ** =
**+Sem/Edu_Hum ** =
**+Sem/Event ** =
**+Sem/Food-med ** =
**+Sem/Group_Hum ** =
**+Sem/Hum ** =
**+Sem/ID ** =
**+Sem/Lang ** =
**+Sem/Mal ** =
**+Sem/Mat ** =
**+Sem/Measr ** =
**+Sem/Money ** =
**+Sem/Obj ** =
**+Sem/Obj-el ** =
**+Sem/Obj-ling ** =
**+Sem/Org ** =
**+Sem/Org_Prod-audio ** =
**+Sem/Org_Prod-vis ** =
**+Sem/Part ** =
**+Sem/Plc ** =
**+Sem/Prod-vis ** =
**+Sem/Route ** =
**+Sem/Rule ** =
**+Sem/Sign ** =
**+Sem/State ** =
**+Sem/State-sick ** =
**+Sem/Substnc ** =
**+Sem/Sur ** =
**+Sem/Time ** =
**+Sem/Time-clock ** =
**+Sem/Title ** =
**+Sem/Tool-it ** =
**+Sem/Txt ** =
**+Sem/Veh ** =
**+Sem/Year ** =
**+Sg ** = Singular
**+Short+ ** = Short determiner, e.g. m’, d’
**+Simp ** = Simple preposition
**+Slender ** = Adj qualifies a plural noun ending in a slender consonant
**+Span ** =
**+St ** = start bracket, quote etc
**+Strong ** = same plural form for all cases
**+Subj ** = Subjunctive mood/particle
**+Subord ** = Subordinating conjunction
**+Subst ** = substantive noun, functions like a noun, but lacks full inflectional pardigm
**+Suf ** = -s vern suffix e.g. a bhíonns
**+Sup ** = Superlative particle (s), e.g. is
+Symbol = independent symbols in the text stream, like £, €, ©
**+Temp ** = Temporal e.g. inniu, amárach etc.
**+Typo ** = Typos, e.g. ta/ata instead of tá/atá
**+Use/-GC ** =
**+Use/-PLX ** =
**+Use/-PMatch ** =
**+Use/-Spell ** =
**+Use/-TTS ** =
**+Use/Circ ** =
**+Use/GC ** =
**+Use/NG ** =
**+Use/PMatch ** =
**+Use/SpellNoSugg ** =
**+Use/TTS ** =
**+V ** = n/a (Verb is used)
**+VD ** = ditransitive verb
**+VF ** = - form used before a word starting with a vowel or f+vowel
**+VI ** = intransitive verb
**+VT ** = transitive verb
**+VTI ** = transitive & intransitive verb
**+Var ** = variant spelling e.g. rabh instead of raibh or dheachaidh
**+Vb ** = Verbal particle (Q)
**+Verb ** = Verb
**+Verbal ** = Verbal noun
**+Voc ** = Vocative case
**+Vc ** = Vocative particle
**+Vow ** = Vowel-initial : used to allow past-tense Len e.g. d´ith
**+Weak ** = Weak plural (noun)
**+XMLTag ** = XML tags in the text, e.g. <p>, etc.
**+Xxx ** = Indecipherable speech (in transcribed speech)
**+v1 ** = n/a
**+v2 ** = n/a
**^Adj ** = Adjective- used in initial mutations
**^Ath ** = Athrú (Change) - in certain plurals the ending changes : “e” -> “í”, “each” -> “í” and “ach” changes to “aí” etc.eg gealach -> gealaí (of the moon)
**^C ** = nominative, genitive & vocative : initial mutations of plural nouns
**^CB ** = compound boundary
**^Caol ** = Caolú (slenderise)- Attenuation : ie slenderise the end of word ! Usually by adding an “i” after the last broad vowel
**^Coim ** = Coimriú - Syncopation - the last unstressed vowel is dropped ! eg saghas (type) ->saghs +anna, solas->soils+e (light) - with attenation also
**^Def ** = dntls rule after definite article
**^Do ** = d’ before Past Imperfect (gnáthchaite) and conditional
**^Emph ** = emphatic forms
**^F ** = feminine: initial mutations of singular nouns depend on whether the noun is masculine or feminine
**^Fr ** = Fréamh (root) use root - i.e.don’t syncopate in these cases
**^G ** = genitive case
**^GUESSNOUN ** = n/a - superseded by guesser FSTs
**^IM ** = initial mutation marker e.g. mo chat, ar an mballa
**^LC ** = Leathan/Caol: (broad/slender) Leathnaítear an tús mura dtosnaíonn an foirceann le “t”
**^Lea ** = Leathnú - Broadening eg an “i” is removed ! súil (eye); radharc na súl (eyesight)
**^LeaS ** = Leathnaítear an tús mura dtosnaíonn an foirceann le “t”
**^M ** = masculine: initial mutations of singular nouns depend on whether the noun is masculine or feminine
**^Sé ** = Lenition (Séimhiú - softening)- h added after certain initial consonants (bcdfgmpst)
**^Urú ** = Eclipsis (Urú)- a letter placed before word initial letter (bcdfgpt), e.g. “g” before “c” - “an cat” in gen. pl. becomes “bia na gcat”, (the cats’ food)
**^V ** = Verb root
**^VH ** = Maintains vowel harmony of broad and slender vowels Motto : “leathan le leathan agus caol le caol” (slender with slender and broad with broad)
**^VN ** = verbal noun
**^aigh ** = remove -aigh ending
**^hv ** = “h” before a vowel (eg éan : Nom. Pl. Masc. na héin - the birds)
**^igh ** = remove -igh ending
**^ts ** = “t” before “s” eg sagart : Gen Sg.Masc. teach an tsagairt - the priest’s house
**^tv ** = “t-“ before a vowel (eg éan : Nom. Sg. Masc. an t-éan - the bird)

Flag diacritics

**@P.Pmatch.Loc ** = XXX

We have manually optimised the structure of our lexicon using following flag diacritics to restrict morhpological combinatorics - only allow compounds with verbs if the verb is further derived into a noun again: | @P.NeedNoun.ON@ | (Dis)allow compounds with verbs unless nominalised | @D.NeedNoun.ON@ | (Dis)allow compounds with verbs unless nominalised | @C.NeedNoun@ | (Dis)allow compounds with verbs unless nominalised

For languages that allow compounding, the following flag diacritics are needed to control position-based compounding restrictions for nominals. Their use is handled automatically if combined with +CmpN/xxx tags. If not used, they will do no harm. | @P.CmpFrst.FALSE@ | Require that words tagged as such only appear first | @D.CmpPref.TRUE@ | Block such words from entering ENDLEX | @P.CmpPref.FALSE@ | Block these words from making further compounds | @D.CmpLast.TRUE@ | Block such words from entering R | @D.CmpNone.TRUE@ | Combines with the next tag to prohibit compounding | @U.CmpNone.FALSE@ | Combines with the prev tag to prohibit compounding | @P.CmpOnly.TRUE@ | Sets a flag to indicate that the word has passed R | @D.CmpOnly.FALSE@ | Disallow words coming directly from root.

Use the following flag diacritics to control downcasing of derived proper nouns (e.g. Finnish Pariisi -> pariisilainen). See e.g. North Sámi for how to use these flags. There exists a ready-made regex that will do the actual down-casing given the proper use of these flags. | @U.Cap.Obl@ | Allowing downcasing of derived names: deatnulasj. | @U.Cap.Opt@ | Allowing downcasing of derived names: deatnulasj.

The Root lexicon etc.

**LEXICON Root ** =
** Abbrev; ** =
** Prepositions; ** = Adpositions = Prepositions in
** Adverb; ** =
** Articles; ** =
** Conjunctions; ** =
** Determiners; ** =
** Interjections; ** =
** Fillers; ** =
** Communicators; ** =
** Events; ** =
** Anonymous; ** =
** Numerals; ** =
** Particles; ** =
** Personal_Pronouns; ** =
** Englishlex; ** = English lexicon including all parts of speech
** Communicators-English; ** = English multi word communicators, e.g. d’ya know
** Bardiclex; ** = classical Irish lexicon from TCD Bardic corpus -
** Latinlex; ** = Latin lexicom from RIA historical corpus
** !Tobar; ** = omitting this (non-standard older forms)
** Punctuation; ** =
** Punctuation_ga; ** =
** Symbols; ** =
** XMLTags; ** = XML tags e.g. <p>, etc.
** AdjA; ** = ORIGINAL TEST LEXICON
** AdjIrregular; ** = ORIGINAL TEST LEXICON
** Adj-BaseOnly; !AdjBASE; ** = ORIGINAL TEST LEXICON
** Adj-IrregComp; !AI-COMP; ** = ORIGINAL TEST LEXICON
** AdjB; ** = punk adjs
** AdjC; ** = FP adjs - auto
** AdjDath; ** = colours
** AdjE; ** = FP adjs - manual
** Adj-FGB1; ** = Foclóir Gaeilge Béarla Uí Dhónaill
** Adj-FGB2; ** = Foclóir Gaeilge Béarla Uí Dhónaill
** AdjVariants; ** = Adj Variants in FGB
** AdjEqualVariants; ** = Adj Variants with Equal Sign in FGB
** AdjF; ** = Nationalities
** AdjG; ** = additions from gaois.ie bitex
** Nouns; ** = ORIGINAL TEST LEXICON
** NounsG; ** = Proper Nouns - MOVED from Nouns TO Proper Nouns
** NP-LEX-FAM; ** = Family Names (Irish)
** NP-LEX-FAM-EN; ** = Family Names (English)
** NP-LEX-PERS; ** = Personal Names (Irish)
** NP-LEX-PERS-EN; ** = Family Names (English)
** NP-LEX-EIRE; ** = Ireland - Counties, Cities and Towns (Irish)
** NP-LEX-EIRE-EN; ** = Ireland - Counties, Cities and Towns (English)
** NP-LEX-TIR; ** = Countries (Irish)
** NP-LEX-TIR-EN; ** = Countries (English)
** NP-Irregular; ** = Various Irregular Proper Nouns
** NP-LEX-ORG; ** = Organisations
** NP-LEX-LOGAINM; ** = Placenames - sample from logainm.ie
** NP-LEX-RIACORPAS1; ** = Various Proper nouns from RIA Historical Corpus of Irish
** VerbalNounsV; ** = Verbal nouns derived from verb roots
** VerbalNounsN; ** = Verbal nouns derived from nouns
** VerbalAdjs; ** = Verbal adjectives derived from verb roots
** VerbalNounsGenV; ** = Verbal nouns (genitive ase) derived from verb roots
** VerbalNounsGenN; ** = Verbal nouns (genitive ase) derived from nouns
** VN-Variants; ** = FGB VN variants (VN, VNG & VA included)
** VNEqualVariants; ** = FGB VN = variants (VN, VNG & VA included)
** Verbs; ** = Irregular verbs (11)
** VerbsC1A; ** = ORIGINAL TEST LEXICON
** VerbsC2A; ** = ORIGINAL TEST LEXICON
** VerbsB; ** = verbs
** VerbsC; ** = FP verbs
** VerbsD; ** = FP verbs
** Verbs-FGB1; ** = FGB verbs
** Verbs-FGB2; ** = FGB verbs
** Verb-Variants; ** = FGB verb variants
** VerbsEqualVariants; ** = FGB verb = variants

This (part of) documentation was generated from src/fst/morphology/root.lexc

Irish NLP Grammar

Page Content

Irish morphological analyser !

Definitions for Multichar_Symbols

Tag symbols for analysis

Flag diacritics

The Root lexicon etc.

Sitemap