Irish morphological analyser !
INTRODUCTION TO MORPHOLOGICAL ANALYSER OF Irish LANGUAGE.
Definitions for Multichar_Symbols
Tag symbols for analysis
The morphological analyses of wordforms for the Irish language are presented in this system in terms of the following symbols.
- **+Corr ** = correction in Learner corpus
- **+Error ** = error in Learner corpus
- **+Start ** = start of error/correction
Tag list:
- **+1P ** = first person inflection
- **+2P ** = second person inflection
- **+3P ** = third person inflection
- **+A ** = XXX check
- **+ABBR ** = Abbreviation
- **+ACR ** = Acronym
- **+Abr ** = “Abbreviation”
- **+Ad ** = Adverbial particle: go
- **+Adj ** = Adjective
- **+Adv ** = Adverb
- **+Anon ** = Anonymisation in transcribed speech
- **+Arab ** = Arabic numerals (1, 2, …)
- **+Art ** = Article determiner (an/na)
- **+Attr ** = Attribute, element preceeding head
- **+Auto ** = Autonomous verb form
- **+Bar ** = hyphen, underscore, dash etc.
- **+Bare ** = bare number form used after number particle “a”
- **+Base ** = Base form of adjective (changed from +Pos to +Base 10/09/03)
- **+Brack ** = round, square and curly brackets
- **+CC ** = Canúint Chonnachta, Connaught dialect
- **+CLB ** = Clause boundary
- **+CLBfinal ** = Final clause boundary
- **+CM ** = Canúint na Mumhan, Munster dialect
- **+CU ** = Canúint Uladh, Ulster dialect
- **+Card ** = Cardinal number(one two three …)
- **+Cmc ** = Communicator (yeah, y’know) in transcribed speech
- **+CmpNP/First ** =
- **+CmpNP/None ** =
- **+Cmpd ** = Compound prepostion
- **+CmpdNoGen ** = Compound prepostion which does not require genitive case on object NP
- **+Cmpl ** = Complementizer: go, gur, nach, nár
- **+Com ** = Common case (nominative/accusative/dative case)
- **+Comp ** = Comparative adjective (c)
- **+Conj ** = Conjunction
- **+Cond ** = Conditional mood
- **+Coord ** = Coordinating conjunction
- **+Cop ** = Copula
- **+Curr ** = Currency symbols
- **+Dat ** = Dative case (e.g. chois) fossilised forms
- **+DeNom ** = Adjectives drived from proper nouns, e.g. Albanach (Scottish adjective), not the same as Albanais (Scottish language noun)
- **+Def ** = Definite article
- **+DefArt ** = noun/number form that follows a definite article (an)
- **+Deg ** = degree particle with Adj/Abstract Noun (so loud, so sharp etc..
- **+Dem ** = Demonstrative determiner (also combined with copula, e.g. Seo
- **+Dep ** = Dependant forms of verbs
- **+Det ** = Determiner, e.g. possessive determiner: mo, do
- **+Dir ** = Directional adverb
- **+Direct ** = Direct relative particle
- **+Ecl ** = Eclipsis (+Urú) initial mutation, e.g. ar an gcat
- **+Emph ** = Emphatic (Contrastive) form of personal pronoun e.g. ár dteachsa, do theachsa, a teachsa
- **+End ** = end bracket, quote etc
- **+English ** = English language words
- **+Err/Hyph ** =
- **+Err/Lex ** =
- **+Err/MissingSpace ** =
- **+Err/Orth ** = Orthografical error
- **+Err/SpaceCmp ** = Compound space error
- **+Event ** = Simple Event (laugh, sneeze etc.) in transcribed speech
- **+Fam ** = Family Name - proper noun
- **+Fem ** = Feminine gender
- **+Filler ** = Filled Pause (eh, em, etc.) in transcribed speech
- **+Fin ** = sentence final punctuation
- **+Foreign ** = words from other languages, mainly English, some Latin
- **+Fut ** = Future tense verbal particle
- **+FutInd ** = Future Indicative verb
- **+Gen ** = Genitive case
- **+Acc ** = Accusative case
- **+Cp ** = ?
- **+Wh ** = wh word?
- **+Gn ** = General adverb
- **+Guess ** = Morphological guesser
- **+hPref ** = h prefixed to a vowel-initial word
- **+Idf ** = Indefinite quantifier/pronoun e.g. aon (any), cibé (whoever), ceachtar/neachtar (either/neither) etc.
- **+Ill ** = n/a
- **+Imp ** = Imperative particle (negative)
- **+Imper ** = Imperative mood
- **+Indirect ** = Indirect relative particle
- **+Inf ** = Infinitival particle
- **+Int ** = Sentence internal punctuation
- **+Itj ** = Interjection
- **+Its ** = Intensifier of adjective e.g. sách, ró- etc.
- **+End ** = end bracket, quote etc
- **+Latin ** = Latin language words
- **+LEFT ** = Left side of parwise symbol (parenthesis or quotation mark)
- **+Len ** = Lenited forms
- **+Loc ** = Locative adverb
- **+MIDDLE ** = Middle punctuation
- **+MWE ** = Multi word expression
- **+Masc ** = Masculine gender
- **+N ** = n/a (Noun is used) – The +N tag is in use, TODO: change it
- **+NER ** = Named Entity Recognition
- **+NG ** = Don’t generate non-standard form
- **+NStem ** = De-nominal verbal noun
- **+Neg ** = Negative particle (n)
- **+NegQ ** = Negative interrogative verbal particle(q)
- **+Nm ** = Number particle (m)
- **+Nom ** =
- **+NotSlen ** = Adj qualifies pl noun with non-slender ending
- **+Noun ** = Noun (common, proper, verbal, substantive)
- **+Num ** = Numeral
- **+OLang/ENG ** = - English language words
- **+OLang/FIN ** = -
- **+OLang/HUN ** = -
- **+OLang/LAT ** = - Latin language words
- **+OLang/NNO ** = -
- **+OLang/NOB ** = -
- **+OLang/RUS ** = -
- **+OLang/SMA ** = -
- **+OLang/SME ** = -
- **+OLang/SMJ ** = -
- **+OLang/SWE ** = -
- **+OLang/UND ** = -
- **+Obj ** = Object e.g. á = “do a” when obj of VN
- *+Op ** = Number Operators, e.g. +,-,,/ etc.
- **+Ord ** = Ordinal (first, second, third..) i.e. mo dhá lámh, an chéad dhá theach
- **+Part ** = Particle (not +Vb) (U)
- **+Past ** = Past tense verbal particle
- **+PastImp ** = Past Habitual - Gháthchaite (Imperfect Indicative)
- **+PastInd ** = Past Indicative tense
- **+PastSubj ** = Past Subjunctive tense
- **+Pat ** = Patronymic particle (p) (e.g. Ó, Ní, Uí, le, de ..)
- **+Pers ** = Personal pronoun
- **+PersName ** = Personal name - proper noun
- **+Pl ** = Plural number
- **+Place ** = Place name - proper noun
- **+Poss ** = Possessive pronoun (can be attached to a prep, e.g. im’, dá, faoina)
- **+Pref ** = Prefix; seperated prefixes in historical texts
- **+Prep ** = Preposition
- **+Pres ** = Copula present & future tense
- **+PresImp ** = Pres Habitual - Gháthláithreach(Verb bí only - and deireann (abair)
- **+PresInd ** = Present Indicative
- **+PresSubj ** = Present Subjunctive
- **+Pro ** = Pronoun with copula or relative particle
- **+Pron ** = Pronoun
- **+Prop ** = Proper noun
- **+Punct ** = Abbreviation
- **+PUNCT ** = Abbreviation (it seems several languages have two tags :-/
- **+Q ** = Interrogative particle(q)
- **+Qty ** = Quantifier
- **+Quo ** = all quotation marks double, single etc.
- **+Ref ** = Reflexive particle
- **+Rel ** = Relative particle
- **+RelInd ** = rel. indirect
- **+RIGHT ** = Right side of parwise symbol (parenthesis or quotation mark)
- **+Rom ** =
- **+Sbj ** = Subject pronouns: sí, sé and siad are used only when pron follows predicate verb in subject position, otherwise í, é and iad are used.
- **+Sem/Amount ** =
- **+Sem/Build ** =
- **+Sem/Build-room ** =
- **+Sem/Cat ** =
- **+Sem/Curr ** =
- **+Sem/Date ** =
- **+Sem/Domain ** =
- **+Sem/Domain_Hum ** =
- **+Sem/Dummytag ** =
- **+Sem/Edu_Hum ** =
- **+Sem/Event ** =
- **+Sem/Food-med ** =
- **+Sem/Group_Hum ** =
- **+Sem/Hum ** =
- **+Sem/ID ** =
- **+Sem/Lang ** =
- **+Sem/Mal ** =
- **+Sem/Mat ** =
- **+Sem/Measr ** =
- **+Sem/Money ** =
- **+Sem/Obj ** =
- **+Sem/Obj-el ** =
- **+Sem/Obj-ling ** =
- **+Sem/Org ** =
- **+Sem/Org_Prod-audio ** =
- **+Sem/Org_Prod-vis ** =
- **+Sem/Part ** =
- **+Sem/Plc ** =
- **+Sem/Prod-vis ** =
- **+Sem/Route ** =
- **+Sem/Rule ** =
- **+Sem/Sign ** =
- **+Sem/State ** =
- **+Sem/State-sick ** =
- **+Sem/Substnc ** =
- **+Sem/Sur ** =
- **+Sem/Time ** =
- **+Sem/Time-clock ** =
- **+Sem/Title ** =
- **+Sem/Tool-it ** =
- **+Sem/Txt ** =
- **+Sem/Veh ** =
- **+Sem/Year ** =
- **+Sg ** = Singular
- **+Short ** = Short determiner, e.g. m’, d’
- **+Simp ** = Simple preposition
- **+Slender ** = Adj qualifies a plural noun ending in a slender consonant
- **+Span ** =
- **+St ** = start bracket, quote etc
- **+Strong ** = same plural form for all cases
- **+Subj ** = Subjunctive mood/particle
- **+Subord ** = Subordinating conjunction
- **+Subst ** = substantive noun, functions like a noun, but lacks full inflectional pardigm
- **+Suf ** = -s vern suffix e.g. a bhíonns
- **+Sup ** = Superlative particle (s), e.g. is
- +Symbol = independent symbols in the text stream, like £, €, ©
- **+Temp ** = Temporal e.g. inniu, amárach etc.
- **+Typo ** = Typos, e.g. ta/ata instead of tá/atá
- **+Use/-GC ** =
- **+Use/-PLX ** =
- **+Use/-PMatch ** =
- **+Use/-Spell ** =
- **+Use/-TTS ** =
- **+Use/Circ ** =
- **+Use/GC ** =
- **+Use/NG ** =
- **+Use/PMatch ** =
- **+Use/SpellNoSugg ** =
- **+Use/TTS ** =
- **+V ** = n/a (Verb is used)
- **+VD ** = ditransitive verb
- **+VF ** = - form used before a word starting with a vowel or f+vowel
- **+VI ** = intransitive verb
- **+VT ** = transitive verb
- **+VTI ** = transitive & intransitive verb
- **+Var ** = variant spelling e.g. rabh instead of raibh or dheachaidh
- **+Vb ** = Verbal particle (Q)
- **+Verb ** = Verb
- **+Verbal ** = Verbal noun
- **+Voc ** = Vocative case
- **+Vc ** = Vocative particle
- **+Vow ** = Vowel-initial : used to allow past-tense Len e.g. d´ith
- **+Weak ** = Weak plural (noun)
- **+XMLTag ** = XML tags in the text, e.g. <p>,
etc. -
**+Xxx ** = Indecipherable speech (in transcribed speech)
- **+v1 ** = n/a
- **+v2 ** = n/a
- **^Adj ** = Adjective- used in initial mutations
- **^Ath ** = Athrú (Change) - in certain plurals the ending changes : “e” -> “í”, “each” -> “í” and “ach” changes to “aí” etc.eg gealach -> gealaí (of the moon)
- **^C ** = nominative, genitive & vocative : initial mutations of plural nouns
- **^CB ** = compound boundary
- **^Caol ** = Caolú (slenderise)- Attenuation : ie slenderise the end of word ! Usually by adding an “i” after the last broad vowel
- **^Coim ** = Coimriú - Syncopation - the last unstressed vowel is dropped ! eg saghas (type) ->saghs +anna, solas->soils+e (light) - with attenation also
- **^Def ** = dntls rule after definite article
- **^Do ** = d’ before Past Imperfect (gnáthchaite) and conditional
- **^Emph ** = emphatic forms
- **^F ** = feminine: initial mutations of singular nouns depend on whether the noun is masculine or feminine
- **^Fr ** = Fréamh (root) use root - i.e.don’t syncopate in these cases
- **^G ** = genitive case
- **^GUESSNOUN ** = n/a - superseded by guesser FSTs
- **^IM ** = initial mutation marker e.g. mo chat, ar an mballa
- **^LC ** = Leathan/Caol: (broad/slender) Leathnaítear an tús mura dtosnaíonn an foirceann le “t”
- **^Lea ** = Leathnú - Broadening eg an “i” is removed ! súil (eye); radharc na súl (eyesight)
- **^LeaS ** = Leathnaítear an tús mura dtosnaíonn an foirceann le “t”
- **^M ** = masculine: initial mutations of singular nouns depend on whether the noun is masculine or feminine
- **^Sé ** = Lenition (Séimhiú - softening)- h added after certain initial consonants (bcdfgmpst)
- **^Urú ** = Eclipsis (Urú)- a letter placed before word initial letter (bcdfgpt), e.g. “g” before “c” - “an cat” in gen. pl. becomes “bia na gcat”, (the cats’ food)
- **^V ** = Verb root
- **^VH ** = Maintains vowel harmony of broad and slender vowels Motto : “leathan le leathan agus caol le caol” (slender with slender and broad with broad)
- **^VN ** = verbal noun
- **^aigh ** = remove -aigh ending
- **^hv ** = “h” before a vowel (eg éan : Nom. Pl. Masc. na héin - the birds)
- **^igh ** = remove -igh ending
- **^ts ** = “t” before “s” eg sagart : Gen Sg.Masc. teach an tsagairt - the priest’s house
- **^tv ** = “t-“ before a vowel (eg éan : Nom. Sg. Masc. an t-éan - the bird)
Flag diacritics
- **@P.Pmatch.Loc ** = XXX
We have manually optimised the structure of our lexicon using following flag diacritics to restrict morhpological combinatorics - only allow compounds with verbs if the verb is further derived into a noun again: | @P.NeedNoun.ON@ | (Dis)allow compounds with verbs unless nominalised | @D.NeedNoun.ON@ | (Dis)allow compounds with verbs unless nominalised | @C.NeedNoun@ | (Dis)allow compounds with verbs unless nominalised
For languages that allow compounding, the following flag diacritics are needed to control position-based compounding restrictions for nominals. Their use is handled automatically if combined with +CmpN/xxx tags. If not used, they will do no harm. | @P.CmpFrst.FALSE@ | Require that words tagged as such only appear first | @D.CmpPref.TRUE@ | Block such words from entering ENDLEX | @P.CmpPref.FALSE@ | Block these words from making further compounds | @D.CmpLast.TRUE@ | Block such words from entering R | @D.CmpNone.TRUE@ | Combines with the next tag to prohibit compounding | @U.CmpNone.FALSE@ | Combines with the prev tag to prohibit compounding | @P.CmpOnly.TRUE@ | Sets a flag to indicate that the word has passed R | @D.CmpOnly.FALSE@ | Disallow words coming directly from root.
Use the following flag diacritics to control downcasing of derived proper nouns (e.g. Finnish Pariisi -> pariisilainen). See e.g. North Sámi for how to use these flags. There exists a ready-made regex that will do the actual down-casing given the proper use of these flags. | @U.Cap.Obl@ | Allowing downcasing of derived names: deatnulasj. | @U.Cap.Opt@ | Allowing downcasing of derived names: deatnulasj.
The Root lexicon etc.
- **LEXICON Root ** =
- ** Abbrev; ** =
- ** Prepositions; ** = Adpositions = Prepositions in
- ** Adverb; ** =
- ** Articles; ** =
- ** Conjunctions; ** =
- ** Determiners; ** =
- ** Interjections; ** =
- ** Fillers; ** =
- ** Communicators; ** =
- ** Events; ** =
- ** Anonymous; ** =
- ** Numerals; ** =
- ** Particles; ** =
-
** Personal_Pronouns; ** =
- ** Englishlex; ** = English lexicon including all parts of speech
- ** Communicators-English; ** = English multi word communicators, e.g. d’ya know
- ** Bardiclex; ** = classical Irish lexicon from TCD Bardic corpus -
- ** Latinlex; ** = Latin lexicom from RIA historical corpus
-
** !Tobar; ** = omitting this (non-standard older forms)
- ** Punctuation; ** =
- ** Punctuation_ga; ** =
- ** Symbols; ** =
-
** XMLTags; ** = XML tags e.g. <p>,
etc. - ** AdjA; ** = ORIGINAL TEST LEXICON
- ** AdjIrregular; ** = ORIGINAL TEST LEXICON
- ** Adj-BaseOnly; !AdjBASE; ** = ORIGINAL TEST LEXICON
- ** Adj-IrregComp; !AI-COMP; ** = ORIGINAL TEST LEXICON
- ** AdjB; ** = punk adjs
- ** AdjC; ** = FP adjs - auto
- ** AdjDath; ** = colours
- ** AdjE; ** = FP adjs - manual
- ** Adj-FGB1; ** = Foclóir Gaeilge Béarla Uí Dhónaill
- ** Adj-FGB2; ** = Foclóir Gaeilge Béarla Uí Dhónaill
- ** AdjVariants; ** = Adj Variants in FGB
- ** AdjEqualVariants; ** = Adj Variants with Equal Sign in FGB
- ** AdjF; ** = Nationalities
-
** AdjG; ** = additions from gaois.ie bitex
- ** Nouns; ** = ORIGINAL TEST LEXICON
- ** Dative; ** = ORIGINAL TEST LEXICON
- ** Other; ** = ORIGINAL TEST LEXICON
- ** NounsB; ** = nouns
- ** NounsC; ** = FP nouns (automatic)
- ** NounsD; ** = FP nouns (manual Decl 1-3)
- ** NounsE; ** = FP nouns (manual Decs 4-5)
- ** NounsF; ** = FP nouns (manual Irregular)
- ** NounsH; ** = Various from corpora
- ** NounsIrregular; ** =
- ** Substantive; ** =
- ** NounsFGB1; ** = FGB (O Donaill) automatic (in NCI corpus)
- ** NounsFGB2; ** = FGB (O Donaill) automatic (additional)
- ** NounsVariants; ** = Variants extracted from FGB
-
** NounsEqualVariants; ** = Variants extracted from FGB (2011 EUD)
- ** NounsG; ** = Proper Nouns - MOVED from Nouns TO Proper Nouns
- ** NP-LEX-FAM; ** = Family Names (Irish)
- ** NP-LEX-FAM-EN; ** = Family Names (English)
- ** NP-LEX-PERS; ** = Personal Names (Irish)
- ** NP-LEX-PERS-EN; ** = Family Names (English)
- ** NP-LEX-EIRE; ** = Ireland - Counties, Cities and Towns (Irish)
- ** NP-LEX-EIRE-EN; ** = Ireland - Counties, Cities and Towns (English)
- ** NP-LEX-TIR; ** = Countries (Irish)
- ** NP-LEX-TIR-EN; ** = Countries (English)
- ** NP-Irregular; ** = Various Irregular Proper Nouns
- ** NP-LEX-ORG; ** = Organisations
- ** NP-LEX-LOGAINM; ** = Placenames - sample from logainm.ie
-
** NP-LEX-RIACORPAS1; ** = Various Proper nouns from RIA Historical Corpus of Irish
- ** VerbalNounsV; ** = Verbal nouns derived from verb roots
- ** VerbalNounsN; ** = Verbal nouns derived from nouns
- ** VerbalAdjs; ** = Verbal adjectives derived from verb roots
- ** VerbalNounsGenV; ** = Verbal nouns (genitive ase) derived from verb roots
- ** VerbalNounsGenN; ** = Verbal nouns (genitive ase) derived from nouns
- ** VN-Variants; ** = FGB VN variants (VN, VNG & VA included)
-
** VNEqualVariants; ** = FGB VN = variants (VN, VNG & VA included)
- ** Verbs; ** = Irregular verbs (11)
- ** VerbsC1A; ** = ORIGINAL TEST LEXICON
- ** VerbsC2A; ** = ORIGINAL TEST LEXICON
- ** VerbsB; ** = verbs
- ** VerbsC; ** = FP verbs
- ** VerbsD; ** = FP verbs
- ** Verbs-FGB1; ** = FGB verbs
- ** Verbs-FGB2; ** = FGB verbs
- ** Verb-Variants; ** = FGB verb variants
- ** VerbsEqualVariants; ** = FGB verb = variants
This (part of) documentation was generated from src/fst/morphology/root.lexc