On this page
- Delimiters, tags and sets
- Delimiters
- Tags and sets
- PP disambiguation
- Case within PP phrases
- POS disambiguation
- Specific lexemes, words
- Adverbs
- Numerals and number symbols
- NP internal constraints
- Case disambiguation
- VP disambiguation
- Case disambiguation
- Pronoun disambiguation
- Verb disambiguation
- Substituting tags
- For Apertium
Faroese disambiguator
Usage, in lang-fao:
cat text.txt|hfst-tokenize -cg tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst |vislcg3 -g src/cg3/disambiguator.cg3
This file documents the Faroese disambiguator file .
Delimiters, tags and sets
Delimiters
Tags and sets
Tags
Declearing all tags from the fst.
- LIST NAGD = Nom Acc Gen Dat ;
- LIST AGD = Acc Gen Dat ;
- LIST AD = Acc Dat ;
- LIST ND = Nom Dat ;
- LIST GENDER = Msc Fem Neu ;
- LIST NUMBER = Sg Pl ;
- @CODE
Sets
Combining tags into useful sets
Noun sets
Adjective sets
Nominal sets
Verb sets
Noun-Verb sets
Number sets
Preposition sets, taking different cases
- Union sets: ACCPREP, DATPREP, ACCDATPREP, ACCGENPREP, ACCDATGENPREP
- Intersection set: SOMEACCPREP, SOMEDATPREP, SOMEGENPREP, SOMEACCDATPREP,
Boundary sets
Case sets
These are sets of cases, not sets of prepositions choosing them.
NOTNOM, NOTDAT, etc.: Some case, but not…
Word sets
-
Sem/ID when preceeded by § or currency (more should be added here)
-
sub3, sub2, sub1 removes non-lexicalised compounds, the most complex ones first
Test: Go for minimal weight. This rules gives priority to lexicalised forms.
- Guess, Err/Guess removes Guess readings when other readings are available
Early and popular rules
Infinitive
-
r50, IM1, IM2, InfAndInf choose Inf and at when cooccuring
-
r36, r37, ImForV choosing CS over IM and Pr** at (Pr, CS, IM)
-
r2 selects Pr for á when subcategorising for Dat, Acc.
A or N
Disambiguate A, N due to context.
TAD, Pron or Det
Adjective disambiguation in NP
Case disambiguation
Noun disambiguation
Conjunctions
No rules so far
Subjunctions
Verbs
Passive
-st
-
bera
-
hava
-
muna
-
síggja
vera
verða
kunna
koma
Plural
PP disambiguation
Preposition or not?
á
av
millum
móti
til
tíður
um
undir
við
Case within PP phrases
POS disambiguation
Adjectives
kalur
Pronouns
Pron Pers or Det
Det
Pron not N
Proper nouns
Specific lexemes, words
-
aftan
-
allur
-
at
-
á
-
ár
-
bara
-
eg
-
ein
-
eingin
-
hava
-
hann
-
her
-
hetta
-
hon
-
húsi
-
ið
-
innan
-
liggja
-
men
-
munandi
-
niðan
-
nú
-
ruður
-
seg
-
sjalvur
-
skal
-
tann
-
tá and tá ið
-
frammanfyri and others
-
um
-
unglingi
-
inni
-
á
-
ver “ver” CmpNP/None N Neu Sg Acc Def = verið
Adverbs
General adverb
Specific adverbs
akkurát
-
bara
-
mikið
-
næstan
-
her
Lexicalised adverbs.
-
bara
-
heldur
-
líka
-
væl
-
saman (not samur)
Adverb verbs
Idioms
Numerals and number symbols
- NumRom in beginning of sentence
NP internal constraints
Determiner disambiguation
Specific determiners
-
tað Nothing so far.
-
summi
Postnominal determiner disambiguation
-
Possessor number…
-
Possessor case…
Definiteness disambiguation
Define definiteness based upon case concordance.
Case disambiguation
Noun disambiguation
Poss disambiguation
Ensuring case concordande within poss phrases
Number disambiguation
Coordination
Embedded clause V topicalisation
Elliptic AP as NP
P chains or not
Pronoun disambiguation
NP Coordination
VP disambiguation
V or A
V or Adv
V or N
Infinitive
Imperative
The best would be to make a corpus of imperative sentences, identify all the imperatives, and then just remove the rest.
Here come all rules selecting Imp. (so far only one)
-
ImpNotSup
-
CoordImp if part of Imp coordination
Then we remove the remaining ones.
- RemoveImp remove all Imp
Present participle
Supine
Present singular
Present plural
V + Refl
Past indicative
Perfect participle
Case disambiguation
Nominative
Predicative
- TadCopNom, TadCopSg for tað + copula + nominative
Subject
-
NotTopAcc when sentence-initial NP cannot be Acc
-
NotTopAcc2 when sentence-initial NP cannot be Acc
-
NomV3 Nom, not Acc in 3rd position when Adv is topicalised
-
atNom gives Nom between CC and vera.
Miscellanious
Accusative
- AccNotDatObj gives Acc not Dat for TV objects
Mær dámar
Genitive
Pronoun disambiguation
seg
Verb disambiguation
A or V
Person
Number disambiguation
Postverbal subject
Gender disamb of adjectives
Gender disamb of numerals
Case disamb of numerals
Ordinals
Coordination
- PrtNotPtc, PrtNotA gives Prt instead of participle. TODO: Find counter examples and write rules (still not found).
Adjective disambiguation outside NP
Substituting tags
Titles
CC Coordinate NPs
AFTER-section
For Apertium
- NotDat removes Dat when we did not find Dat assigners
MAPPING OF CC AND CS
Mostly we map both @CNP and @CVP, then we select @CNP, after that we remove them so @CVP remains
-
CCasCNPCVP Map (@CNP @CVP) to CC
-
killAllahtenotCS All occurrences of “at” are CSs.
-
CS removes CS between CS
-
CmpNPNone removes CmpNPNone
-
Kill Sem/ID
-
killAllCNP removes all remaining @CNP
-
ErrOrth goes for correct forms
-
NomFragment chooses Nom when no case-assigners present for Acc, Dat
-
X removes readings with no syntax
This (part of) documentation was generated from src/cg3/disambiguator.cg3