GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology. Read more about Why. See also How to get started, and our Privacy document.
This document describes the main tasks involved with editing proper nouns, first laying out all steps as detailed as possible, then trying to generalise and abstract somewhat. The end result will be used as a guideline for developing the web interface for our proper noun lexicon in risten.no.
Simple editing of single entries will not be covered, as we already have a basic model and interface for that case.
Starting point: we have a list of names for a language
rev + sort + rev & manual check/sort
1. cons-final, heavy syll => BERN 1. cons-final, light syll => LONDON 1. … ( cf. the not entirely up-to-date-documentation) 1. manually tagging cont-lex (with some help from the phonol/metr. structure)
(see table below)Type
StemCoda CG IllChange Loc Lexicon Status
LightVow no yes -s ACCRA tmp
LightVow yes yes -s MARJA
HeavyVow no no -as NYSTØ tmp
HeavyCns no no -as BERN
LightCns no no -is LONDON
-nen no no -as/nenis C-FI-NEN
GUOLBBA ! Trisyll. Inanim. Gradating 0-Nouns
DUORTNUS ! Cns-final, cons.grad.
ANAR ! Cns-fin, no cons.grad.
HEANDARAT ! Bisyll. Non-Gradating C-Proper Names
NIILLAS ! Trisyll. Non-Gradating C-Proper Names
GEAVNNIS
Plural names
VARGGAT
ALEUHTAT
SULLOT
EATNAMAT
HEANDARAT
User:
Application:
User:
Application:
+-----------------+
| select language | (pop-up, menu from langmenu.xq)
+-----------------+
+-----------+
| |
| paste |
| names |
| here |
| |
+-----------+
+------------------+
| select cont-lex | (pop-up menu, includes option 'unspecified')
+------------------+
+------------------+
| select sem-tag | (pop-up menu, default option 'unspecified')
+------------------+
+----------+
| SUBMIT |
+----------+
User:
Application:
Normally changing is done on single entries, but occasionally groups of names can be changed.
Something similar to what we have for SD-terms now
Similar to the “Add info to group” above (adding and changing is conceptually different actions, but identical when it comes to implementation), but populates the fields in the form with values taken from elements with only one unique value across all found entries (that is, no value in the fields corresponding to elements with several values across entries).
This one is the one needed for efficient correction of parallel names earlier treated as instances of multiple, monolingual names.
This one requires that it is possible to look at two lists at the same time
Take a list of parallel names, and check whether they are really linked to the same concept; if not, link them
This can be done similar to how it is done in SD-terms.
When we considered the many-to-many relationship between names in different languages (one “concept” can have different names in different languages, and one name in one language can have more than one meaning/”concept”), we didn’t foresee the situation that language forms from all majority languages may just as well be used in the minority language alongside the minority language form. Example:
<entry id="Piera">
<sem>
<mal/>
</sem>
<langentry lang="sme" ref="Piera"/>
<langentry lang="nob" ref="Per"/>
<langentry lang="swe" ref="Pär"/>
<langentry lang="nno" ref="Per"/>
<langentry lang="fin" ref="Piera"/>
</entry>
<!--Petri? Georg/Yrjö/Jyrki Snellman/Virkkunen Genetz/Jännes,
Hällsten/Paasikivi-->
The equivalence between names in different languages can be classified as follows:
Per Klemetsen lea Helssegis.
Piera - Per = weak equivalence = fem, mal
Helsset - Helsinki = strong equivalence = plc
(not absolute, cf. Karasjok Produkter in a Saami text)
Samisk høgskole - Sámi Állaskuvla = Absolute equivalence = org
Strategy summary: | Equiv. strength <=> sem. tag | compilation action | — | — | Weak <=> mal, fem, sur | Export all langentries to the sme transducer. | Strong <=> plc| Export all langentries to sme (but perhaps discard when needed) | Absolute <=> org | Export only the explicit sme entries to the sme transducer
More examples:
Peras / *Pieras ii leat vejolašvuohta...
Nils lea dappe, Niillas lea maid bohtán.
The following points need consideration:
split the names
weak association linkage document
=> linking surnames, linking first names (Pekka appendices) => linking synonyms (Jovnna-Ánde Vest) => linking hypo- and hyperonyms (Nielsen, last part, Wordnet)
Synonyms in SD-terms is stored as links between entries in the language files
Conversion from one lg-unspecified list (35000 names) to our future system:
either 1. a. find correspondence sets b. multiply the rest
or 2: a. multiply all <===================== b. unify and prune when needed
Finding correspondence sets
<?xml version=’1.1’ encoding=”UTF-8”?>