GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology. Read more about Why. See also How to get started and our Privacy document.
The main introduction to CG-2 is Tapanainen 1996. Karlsson & al 1992 gives a good introduction to CG-1, and also the most thorough presentation of the philosophy behind the constraint grammar framework.
The projects uses the CG-2 formalism, and this formalism is presentation below. The concrete implementation is vislcg.
The disambiguation file has the suffix .rle, in our case it is called sme-dis.rle, smj-dis.rle, etc. The file consists of the following sections (an additional section CORRECTIONS may also be used, it then follows the CONSTRAINTS sections):
There are four delimiters, “.”, “?”, “” and “!”. The section thus contains the following line only:
DELIMITERS = "<.>" "<!>" "<?>" "<¶>";
The last delimiter is inserted by the corpus processor in order to single out titles and other headings.
This section is introduced with the heading SETS. Note that this heading must be removed in order to run the parser with Connexor’s parser mdis.
The tags are introduced to the parser as lists and sets. Tags or tag combinations that are not introduced here must be referred to within parentheses. Lists and sets are defined according to the following principles:
Barriers are used to constrain the scope of rule contexts. By now, there are so many complex barrier sets, that a systematic documentation of barriers, their linguistic implications and consequences of their use might be necessary for rule writers. Often enough, we barriers turned out to implement other linguistic barriers than originally expected.
Another fact is that until now complex barriers do not exist. So, we actually talk about signal words rather than barriers denoting a particular construction.
CC
SET S-BOUNDARY = CP | CS | SEMICOL | COL ;
# remember that (“,”) and CC are potential sentence boundaries, too
SET NP-BOUNDARY = CC | COMMA ;
# remember that those are potential sentence boundaries, too
SET BOUNDARY = S-BOUNDARY OR NP-BOUNDARY ;
SET CRD = COMMA | CC | NEGFOC | XGO ;
# coordinators
Since we get formalism caused problems, such as eingschobene phrases, that again fullfill other barrier criteria and therefore have the consequence that a certain construction is not recognized.
SET INTR = REL | MO | PUNCT-LEFT ;
# interupters
SET NPNH:
This set - a negation of the set PRE-NP-HEAD - originally denotes words,
that are no possible modifiers of an NP. But: sometimes NPNH is used as
a BARRIER at a point of time (or rather order) in the rule file we still
have the Acc option to Gen, if NPNH is used as a BARRIER we do not get
the rule to work since Acc is not a member of the set PRE-NP-HEAD.
The constraints of the North Saami file are documented here.
tbw.