Somali morphological analyser
INTRODUCTION TO THE MORPHOLOGICAL ANALYSER OF SOMALI.
Multichar_Symbols definitions
Analysis symbols
The morphological analyses of Somali wordforms are presented in this system in terms of the following symbols. (It is highly suggested to follow existing standards when adding new tags).
The parts-of-speech are:
- +N
- +V
- +A
- +Adp
- +Q
- +Pr
- +Adv
- +CC
- +CS
- +Interj
- +Pron
- +Num
Fusional adpositions
- +Adp/u Fusional ú
- +Adp/ka Fusional ká
- +Adp/la Fusional lá
- +Adp/ku Fusional kú
Object pronouns in adpositional+pronoun
- +1SgObj/i E.x.: la + i + lá -> laylá
- +2SgObj/ku E.x.: la + ku + lá -> lagulá
Focus
- +Foc/L Focus markers baa and ayaa
- +Foc/R Focus marker waxaa
- +Foc/V ??2023-12-07 validating Jaska
- +Foc/baa ??2023-12-07 validating Jaska
The parts of speech are further split up into:
- +Interr
- +Interr/ma
- +Attr
- +Short
- +Cmp
- +Der/sho
- +Pfx
- +PP
- +Sep
- +Com
- +Appos
- +Impers
-
+Inch
- +Recit
-
+Restr
- +Pers
- +Dem
- +Coll
- +Mass
- +Acr
- +Abstr
- +Abbr
-
+Prop
- +Null ??2023-12-07 validating Jaska
- +V/ah ??2023-12-07 validating Jaska
- +V/ ??2023-12-07 validating Jaska
Verb and noun declensions for the analysers that want to know about that NOTE: We probably do not want to thag these, this is morphological and not morphosyntactic info. t.
- +Decl/1
- +Decl/2
- +Decl/2A
- +Decl/2B
- +Decl/3 +Decl/3A +Decl/3B
- +Decl/4
- +Decl/5
- +Decl/6
- +Decl/7
The Usage extents are marked using the following tags:
- +Err/Orth
- +Use/-Spell
- +Use/-Spell
- +Use/Circ
- +Use/CircN
- +Err/Lex
- +Use/Marg
- +Use/NG
- +Use/Ped
- +Use/SpellNoSugg+Prog
- +Err/Orth
The nominals are inflected in the following case, number
- +Sg
-
+Pl
- +Nom
- +Abs
-
+Gen
- +Indef
- +Def
Nominals also are inflected for gender
- +Masc
- +Fem
Nominal marked for gender undergo gender polarity changes in plural. We want to mark +Masc and +Fem, such that disambiguation is easier, but knowing the gender of the lemma since it is not predictable from a given plural form is a good thing.
- +M→M
- +M→F
- +F→M
- +F→F
Nominals also have affixed demonstratives
- +Prox -0
- +Dist -ii
- +Near -aas / -aasi
- +Far -eer / -eeri
- +Farther -oo / -ooyi
- +Close -an / -anu / -ani
Are these in use?
- +Adc
- +Apr
- +Prl
- +Apr
- +Cns
- +Ord
The possession is marked as such:
- +PxSg1
- +PxSg2
- +PxSg3F
-
+PxSg3M
- +PxPl1
- +PxPl1Incl
- +PxPl1Excl
- +PxPl2
- +PxPl3
The comparative forms are:
- +Comp
- +Superl
Numerals are classified under:
- +Attr
- +Card
- +Ord
Verb moods are:
- +Ind
- +Opt
- +Imprt
- +Neg
- +Imper
Verb tenses
- +Past
- +Pres
Verb aspects are:
- +Prog
Verb personal forms are (NB: no inclusive/exclusive):
- +1Sg
- +2Sg
- +3Sg
- +3SgM
- +3SgF
- +1Pl
- +2Pl
- +3Pl
Verbs also mark some non-agreement syntactic information
- +Red occurs often with subjects that are focused
- +Rel the verb is within a relative clause, and is also case marked.
Other verb forms are
- +Inf
- +Ger
- +ConNeg
- +ConNegII
- +Neg
- +ImprtII
- +PrsPrc
- +PrfPrc
- +Sup
- +VGen
- +VAbess
Abbreviated words are classified with:
- +ABBR
- +Symbol = independent symbols in the text stream, like £, €, ©
- +ACR
Special symbols are classified with:
- +CLB
- +PUNCT
- +LEFT
- +RIGHT +MIDDLE
The verbs are syntactically split according to transitivity:
- +TV
- +IV
- +DV
Special multiword units are analysed with:
- +Multi
Non-dictionary words can be recognised with:
- +Guess
- ^GUESSNOUNROOT
Question and Focus particles:
- +Qst
- +Foc
Semantics are classified with
- +Sem/Plc
Derivations are classified under the morphophonetic form of the suffix, the source and target part-of-speech.
- +V→N
- +V→V
- +V→A
-
+Der/xxx
- +Incl
- +Excl
Syntaxy stuff, don’t want to use +Acc, because this isn’t relevant in nouns
- +Subj
- +Sem/Obj
Nominal MSP
- +Rel
Derivation
- +Der/A
- +Der/V
- +Der/N
Clitics
- +Clit/ba
- +Clit/se
- +Clit/na
-
+Clit/oo
- +Clit/CS
- +Clit/Without
Style
-
+Use/NG
- +Sty
- +Sty/TODO
- +Sty/i
- +Sty/D
-
+Sty/R
- +TODO
Morphophonology
To represent phonologic variations in word forms we use the following symbols in the lexicon files:
- {N} For tagging certain twolc rules as nominal-only
Going to try to replace these with flag diacritics if possible.
And following triggers to control variation
- {#} # -
TODO: no need for , but needs to be removed in all files
- {m} in nouns: for marking m~n alternations
- {mm} in nouns: rare instance of mm ~ n
-
{C2} in nouns: consonant reduplication in noun declension 4. (yaab ~ yaabab)
- {X} in nouns: insertion of some kind in noun definiteness. TODO: twolc rule no longer exists?
- {ae} in verbs: umlaut of a~i in some verb stems (seems restricted to specific lexemes, not productive)
- {e} in nouns: -e- variation in declension 7 (waraabe ~ waraabaha), not 100% predictable
- {-e} in nouns: delete final -e, often used in conjunction with {a}, possible room for cleaning up.
- {a} in verbs: Mostly V3B: has alternation between o ~ a. (sigo ~ sigaday)
- {-V} in verbs: deletion of specific vowel, used only in affixes, to make stems prettier? room for cleaning
- {-I} in verbs: -i- deletions in V3A and -san adjectives
-
{-a} used specifically in -sho derivations. TODO: change to rule with » ?
-
{E} part of cliticized ee (CS+Appos)
- {y} in verbs: -y- deletion in certain parts of V2
Tone
- ´´
Symbols that need to be escaped on the lower side (towards twolc):
- »7: Literal »
- «7: Literal «
%[%>%] - Literal > %[%<%] - Literal <
Flag diacritics
We have manually optimised the structure of our lexicon using following flag diacritics to restrict morhpological combinatorics - only allow compounds with verbs if the verb is further derived into a noun again: | @P.NeedNoun.ON@ | (Dis)allow compounds with verbs unless nominalised | @D.NeedNoun.ON@ | (Dis)allow compounds with verbs unless nominalised | @C.NeedNoun@ | (Dis)allow compounds with verbs unless nominalised
For languages that allow compounding, the following flag diacritics are needed to control position-based compounding restrictions for nominals. Their use is handled automatically if combined with +CmpN/xxx tags. If not used, they will do no harm. | @P.CmpFrst.FALSE@ | Require that words tagged as such only appear first | @D.CmpPref.TRUE@ | Block such words from entering ENDLEX | @P.CmpPref.FALSE@ | Block these words from making further compounds | @D.CmpLast.TRUE@ | Block such words from entering R | @D.CmpNone.TRUE@ | Combines with the next tag to prohibit compounding | @U.CmpNone.FALSE@ | Combines with the prev tag to prohibit compounding | @P.CmpOnly.TRUE@ | Sets a flag to indicate that the word has passed R | @D.CmpOnly.FALSE@ | Disallow words coming directly from root.
Use the following flag diacritics to control downcasing of derived proper nouns (e.g. Finnish Pariisi -> pariisilainen). See e.g. North Sámi for how to use these flags. There exists a ready-made regex that will do the actual down-casing given the proper use of these flags. | @U.Cap.Obl@ | Allowing downcasing of derived names: deatnulasj. | @U.Cap.Opt@ | Allowing downcasing of derived names: deatnulasj.
- @P.VCLASS.V1@ @R.VCLASS.V1@
- @P.VCLASS.V1ow@ @R.VCLASS.V1ow@
-
@P.VCLASS.V1ow2@ @R.VCLASS.V1ow2@
- @P.VCLASS.V2A@ @R.VCLASS.V2A@
- @P.VCLASS.V2B@ @R.VCLASS.V2B@
- @P.VCLASS.V3A@ @R.VCLASS.V3A@
- @P.VCLASS.V3B@ @R.VCLASS.V3B@
- @P.VCLASS.V3B_ADel@ @R.VCLASS.V3B_ADel@
- @P.VCLASS.V3B_ADelPart@ @R.VCLASS.V3B_ADelPart@
-
@P.VCLASS.PREFIXING@
- @U.VCLASS.V1@
- @U.VCLASS.V2A@
- @U.VCLASS.V2B@
- @U.VCLASS.V3A@
- @U.VCLASS.V3B@
-
@U.VCLASS.UREFIXING@
- @P.ATR.True@
- @R.ATR.True@
Person flags
- @U.Pers.1Sg@
- @U.Pers.2Sg@
- @U.Pers.3SgM@
-
@U.Pers.3SgF@
- @U.Pers.1Pl@
- @U.Pers.2Pl@
-
@U.Pers.3Pl@
- @P.Pers.1Sg@
- @P.Pers.2Sg@
- @P.Pers.3SgM@
-
@P.Pers.3SgF@
- @P.Pers.1Pl@
- @P.Pers.2Pl@
-
@P.Pers.3Pl@
- @R.Pers.1Sg@
- @R.Pers.2Sg@
- @R.Pers.3SgM@
-
@R.Pers.3SgF@
- @R.Pers.1Pl@
- @R.Pers.2Pl@
-
@R.Pers.3Pl@
- @R.Gender.Masc@
-
@P.Gender.Masc@
- @R.Gender.Fem@
- @P.Gender.Fem@
The continuation lexica
The word forms in Somali start from the lexeme roots of basic word classes, or optionally from prefixes:
-
LEXICON Root
-
Abbreviations ;
- Nouns ;
-
ProperNouns ;
-
Numerals ;
-
Pronouns ;
- Verbs ;
- IrregularVerbs ;
- VerbalPrefixes ; Certain VP elements often get combined with the verbs in writing.
-
Adjectives ; Some have verb morphology, and some view them to just be a 4th declension of verbs.
-
Adverbs ;
- Conjunctions ;
-
Subjunctions ;
- Adpositions ;
- Determiners ;
- Interjections ;
- Punctuation ;
- Symbols ;
The following are coming from som-lex.txt
- IrregularAdjective ;
-
Prefixes ;
-
LEXICON FINAL_NG just adds the +Use/NG tag to lower ##
- LEXICON FINAL just adds lower ##
These lexica are dummy lexical to make the source compile, they contain only #.
-
LEXICON Proper
-
LEXICON Unknown_Declensions
-
LEXICON Obj_Pron
-
LEXICON SemiReducedPerson
This (part of) documentation was generated from src/fst/morphology/root.lexc