Language Independent Tags In The Giella Infra
There are a number of classes of tags where the classes are language independent, but the actual tags are language specific. Some examples of such classes of tags are:
- Error tags: tags describing parts of the language outside the established norm
- Dialect tags: tags describing variation (in the written language) based on dialect
- Derivation tags: tags describing derivational morphology
All such classes of tags are described below. New classes will probably be added in the future, but we’ll try to keep the document updated. See also the documentation for each language.
Each class is recognised by having a tag prefix, a short string starting
with “+” (for suffix tags; prefix tags for prefixing languages end with
+ as their last character) and ending with “/”. Examples of such tag
prefixes are: +Err/
, +Dial/
etc.
It is assumed — and required — that all tags described here (and all other tags,
for that matter) are declared as multichar symbols in the root.lexc
file of
each language.
Error tags
The error tag class is defined as follows:
- Tag prefix:
+Err/
- Definition: tags describing parts of the language outside the established norm
- FST implication: all strings containing one or more such tags are removed from all normative transducres
Dialect tags
The dialect tag class is defined as follows:
- Tag prefix:
+Dial/
- Definition: tags describing (written) variation based on dialect
- FST implication: when the
DIALECTS
variable is set inconfigure.ac
, one filter for each dialect defined there is built automatically. Each filter will remove all strings tagged with a dialect different from the one specific to the filter. Untagged strings will be left as is. The dialect tags are presently only made use of in Oahpa generators.
Other notes:
- The first character after the / must be one of
+
or–
, denoting either inclusion (the entry/form is valid for the specified dialect) or exclusion (the entry/form is NOT valid for the specified dialect - but for all others) - The string following / and +/– must be one of the strings
specified in
configure.ac
for the variableDIALECTS
.
Area tags
The area/country tag class is defined as follows:
- Tag prefix:
+Area/
- Definition: tags describing (written) variation based on country or another geographical unit, as per the ISO 3166 standard.
- FST implication: not yet actively used, but will be used to build proofing tools and possibly other normative fst’s where strings for other areas than the specified one will be removed. This will e.g. make the Lule Sámi speller for Sweden a better tool, as all strings with Norwegian æ (except for in names) will be removed: smaller and faster, and with less irrelevant suggestions.
Other notes:
- The tag prefix must be followed by an ISO 3166 string.
Semantic tags
The semantic tag class is defined as follows:
- Tag prefix:
+Sem/
- Definition: tags describing semantic properties of the lexeme
- FST implication: all semantic tags are automatically identified, and a couple of filters for manipulating them are built and applied, see further notes below.
Other notes:
- the raw fst: the semantic tags are moved relative to the POS tag, to ensure consistent tag ordering
- all fst’s except disambiguators and grammar checkers: the semantic tags are removed.
- disambiguators and grammar checkers: the tags are kept (i.e. they are untouched)
Derivation tags
The derivation tag class is defined as follows:
- Tag prefix:
+Der/
- Definition: tags describing derivational morphology
- FST implication: there is no language-independent processing of these tags ATM
Originating language tags
The originating language tag class is defined as follows:
- Tag prefix:
+OLang/
- Definition: tags describing originating language for loan words in cases where such information is required to get proper pronunciation in speech synthesis
- FST implication: there is no language-independent processing of these tags ATM,
and they are removed from all fst’s; for North Sámi there is some
language-specific processing to split the lexical fst into separate
fst’s for each defined
+OLang/
language, after which it is possible to apply OLang-specific phonetic rules
Other notes:
So far the only speech synthesis system we have built is for North Sámi. It was furthermore built without using our text processing technology, and the features being made possible with these tags (ie pronouncing «u» as /ʉː/ instead of the default /uː/) has so far not been put to use. But we expect that to change in the future, as the text processing is applied to open-source speech synthesis systems such as Festival and Simple4All.