GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology. Read more about Why. See also How to get started and our Privacy document.
Til stades: BM, Cip, Trond
Saksliste:
Ciprian har laga tre katalogar
2x2-evaluering:
Gt-data (=nobsme-ordboka) nob-del og sme-del mot norsk og samisk freksvensliste (hvilken frekvensliste skal brukes?)
Fad-data (=nye fra fad-prosjektet) nob-del og sme-del mot norsk og samisk freksvensliste
Frekvensdata for norsk:
$GTBIG/langs/nob/nowac/nowac-1-1.1.lemmas.freq
Same fil er tilgjengeleg på nettet, og blir der med ujamne mellomrom oppdatert (i dag har vi siste versjon).
Frekvensdata for samisk: på divvun.no:
../hoavda/Public/corp/analysed/2013-04-11/sme*.dis
Ulike typar overlapping:
Evaluere term-F opp i mot administrative termar i risten.no.
TILTAK
Frekvensinfo:
fordeling i ulike frekvenskohortar? tekniske termar?
Kanskje dei samiske orda i fad-korpuset er henta frå daglegspråket til ein større grad enn dei norske orda
main/words/dicts/nobsme/terms/admin/src_fad-only
** data som kun finnes i fadsrc_fad-only>grep 'mg_c' N_nobsme.xml | sort | uniq -c | sort -nr
1470 <e src="fad" mg_c="2">
357 <e src="fad" mg_c="3">
106 <e src="fad" mg_c="4">
38 <e src="fad" mg_c="5">
18 <e src="fad" mg_c="6">
9 <e src="fad" mg_c="7">
4 <e src="fad" mg_c="8">
1 <e src="fad" mg_c="9">
1 <e src="fad" mg_c="10">
src_fad-only>grep 'mg_c' V_nobsme.xml | sort | uniq -c | sort -nr
58 <e src="fad" mg_c="2">
25 <e src="fad" mg_c="3">
9 <e src="fad" mg_c="4">
3 <e src="fad" mg_c="5">
1 <e src="fad" mg_c="8">
1 <e src="fad" mg_c="6">
1 <e src="fad" mg_c="15">
src_fad-only>grep 'mg_c' A_nobsme.xml | sort | uniq -c | sort -nr
8 <e src="fad" mg_c="3">
4 <e src="fad" mg_c="5">
4 <e src="fad" mg_c="2">
1 <e src="fad" mg_c="7">
main/words/dicts/nobsme/terms/admin/src_fad-gt_commons
** = fad-data som overlapper (nob) med gt-datamain/words/dicts/nobsme/src_gt-fad_commons
** = gt-data som overlapper med fad-datamain/words/dicts/nobsme/src_gt-only
** = data som kun finnes i gtDigging for domain-specific terms in North Saami
In translation, one of the main problems is lexical selection, this is even more the case for translating specific terms. Such kind of terms can be looked up in special dictionaries. Yet unless majority languages, minority languages lack special dictionaries.
We report on the onging work of building terminology resources for North Saami by using Norwegian Bokmål-North Saami parallel texts provided by the Saami Parliament. Using established computational linguistics methods such as sentence and word alignment, the output of the processing pipeline is then evaluated and manually selected by native speakers. Comparing the current data in Giellatekno’s Norwegian Bokmål-North Saami dictionary with the data acquired from the parallel texts, both new lemma pairs and new meanings of the already existing lexical material can be discovered. How much new and/or specific material is found is a question of the very final evaluation.
As for the approach, we are aware that terms are the result of normative work but our methods are definitely descriptive. Nevertheless, the resources developed in our project are useful both for users interested in looking up a specific term and for professional translators using Computer-Assisted Translation methods.
Authors: