Finite state and Constraint Grammar based analysers, proofing tools and other resources
View the project on GitHub giellalt/lang-sme
This file documents the use of a preprocessor no longer in use. As
explained below, the current preprocessor (called preprocess
) is
documented (the documentation of the current preprocessor is found
here. The present document is left here
partly because it contains issues neutral to the choice of preprocessor
method, and partly because we might return to this preprocessor at some
point.
Within the Xerox framework, this is done with the tokenize tool. The
code itself is written as a set of regular expressions, and the source
file (tok.txt) is compiled by xfst into a preprocessor tok.fst
.
In the present project, we have temporarily abandoned the preprocessor
tok.fst
(from spring 2004 on), and replaced it with a perl-based
preprocessor (see the documentation for
the file gt/script/preprocess
. The main reason why we abandoned the
Xerox preprocessor that we document here is that its compilation time
ecxeeded half an hour on victorio, and several hours on local machines.
The reason why compilaton time exploded was our use of the Replace
operator @-> (see below), it looks for the longest match, which takes
time.
When the abbreviation list was part of the tok.txt file, it meant a pause of more than half an hour for every added abbreviation or multiword expression. Thus, we moved to a non-compiled version, perl, during the developmental phase. In a stable, finished parser, tokenize is probably faster than perl, and we should consider migrating back. This documentation will be important if and when we migrate back, but also in the meantime it contains program-independent documentation on abbreviation handling which deserves to be browsed through.
The starting point for the preprocessor was the tok1.fst preprocessor file, written by Anne Schiller, and printed in the Karttunen/Beesley book (cf. the first cvs versions of the tok.txt file). This file has been revised several times. The leading idea behind the file is the following:
The tokenizer has two purposes: It cuts text into sentences, and it cuts sentences into words. Thus, symbols that are not letters or numbers are separated from words and numbers. Sentence delimiter symbols (.?!) are treated as separate tokens. In the morphological parser itself, these symbols are given the tag ‘+CLB’, for clause boundary.
The file is an xfst source file. It defines sets, joins them together as either words, symbols, abbreviations, initials, or numerals (all being referred to by the variable ‘Token’. Then, a newline (NL) is introduced after each token (Token + NL is called TOK1), all spaces are replaced by newlines (TOK2). The abbreviations get a separate treatment, as described in the next sentence. At the end of the tok.txt file, the different token types (TOK1, TOK1, and the different classes of abbreviations are composed together into one regular expression.
The challenge is to handle abbreviations, like e.g. this one. Even though e.g. contains a final period, it shall not end a sentence. Then there are other abbreviations, like “Ltd.”, that may end sentences. The preprocessor thus divides the abbreviations in 4 different groups, according to whether they take objects or not (i.e. according to whether there is an obligatory word or numeral following them or not):
Here is the rule set that lies behind the treatment of abbreviations:
We thus have four groups:
In other words:
TRANSABBR / INTRANSNUMABBR / INTRANSCAPABBR / INTRANSABBR + small -> no sentence boundary
TRANSABBR + capital -> no sentence boundary
TRANSABBR + number -> no sentence boundary
INTRANSNUMABBR + capital -> no sentence boundary
INTRANSNUMABBR + number -> a sentence boundary
INTRANSCAPABBR + capital -> a sentence boundary
INTRANSCAPABBR + number -> no sentence boundary
INTRANSABBR + capital -> a sentence boundary
INTRANSABBR + number -> a sentence boundary
It is better to have too few sentence boundaries than too many. Of the 4
sets listed above, the first invokes no sentence boundaries, and the
following ones invoke an increasing amount of them. Thus, when in doubt,
put the abbreviation in question in the sets as follows:
TRANSABBR is better than INTRANSNUMABBR, which is better than
INTRANSCAPABBR, which is better than INTRANSABBR.
Jeg kjøpte epler. de var dyre. A sentence boundary, rule 1. A sentence
beginning with a small letter will be found in the grammar checker.
Siv.ing. Pia Aho stakk innom. No sentence boundary, rule 3.
Siv.ing. og kunstner Pia Aho stakk innom. No sentence boundary, rule