GiellaLT

GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology.

View GiellaLT on GitHub

Page Content

FAD-møte 20.6.

Til stades: BM, Cip, Trond

Saksliste:

Status quo

Ciprian har laga tre katalogar

Enare

Moment til artikkelen

2x2-evaluering:

Gt-data (=nobsme-ordboka) nob-del og sme-del mot norsk og samisk freksvensliste (hvilken frekvensliste skal brukes?)

Fad-data (=nye fra fad-prosjektet) nob-del og sme-del mot norsk og samisk freksvensliste

Frekvensdata

Frekvensdata for norsk:

$GTBIG/langs/nob/nowac/nowac-1-1.1.lemmas.freq

Same fil er tilgjengeleg på nettet, og blir der med ujamne mellomrom oppdatert (i dag har vi siste versjon).

nowac-1-1.1.lemmas.freq

Frekvensdata for samisk: på divvun.no:

../hoavda/Public/corp/analysed/2013-04-11/sme*.dis

Moment til artikkelen

Ulike typar overlapping:

felles lemma ulik omsetjing

felles lemma og minst ei omsetjing er lik

felles lemma og felles omsetjing (identitet)

Evaluere term-F opp i mot administrative termar i risten.no.

TILTAK

kva kan vi seie om nob-orda?

Frekvensinfo:

fordeling i ulike frekvenskohortar? tekniske termar?

Samiskdelen

Kanskje dei samiske orda i fad-korpuset er henta frå daglegspråket til ein større grad enn dei norske orda

Unifisere

src_fad-only>grep 'mg_c' N_nobsme.xml | sort | uniq -c | sort -nr
1470    <e src="fad" mg_c="2">
 357    <e src="fad" mg_c="3">
 106    <e src="fad" mg_c="4">
  38    <e src="fad" mg_c="5">
  18    <e src="fad" mg_c="6">
   9    <e src="fad" mg_c="7">
   4    <e src="fad" mg_c="8">
   1    <e src="fad" mg_c="9">
   1    <e src="fad" mg_c="10">

   
   src_fad-only>grep 'mg_c' V_nobsme.xml | sort | uniq -c | sort -nr
  58    <e src="fad" mg_c="2">
  25    <e src="fad" mg_c="3">
   9    <e src="fad" mg_c="4">
   3    <e src="fad" mg_c="5">
   1    <e src="fad" mg_c="8">
   1    <e src="fad" mg_c="6">
   1    <e src="fad" mg_c="15">

   
   src_fad-only>grep 'mg_c' A_nobsme.xml | sort | uniq -c | sort -nr
   8    <e src="fad" mg_c="3">
   4    <e src="fad" mg_c="5">
   4    <e src="fad" mg_c="2">
   1    <e src="fad" mg_c="7">

For referanse: Abstractet

Digging for domain-specific terms in North Saami

In translation, one of the main problems is lexical selection, this is even more the case for translating specific terms. Such kind of terms can be looked up in special dictionaries. Yet unless majority languages, minority languages lack special dictionaries.

We report on the onging work of building terminology resources for North Saami by using Norwegian Bokmål-North Saami parallel texts provided by the Saami Parliament. Using established computational linguistics methods such as sentence and word alignment, the output of the processing pipeline is then evaluated and manually selected by native speakers. Comparing the current data in Giellatekno’s Norwegian Bokmål-North Saami dictionary with the data acquired from the parallel texts, both new lemma pairs and new meanings of the already existing lexical material can be discovered. How much new and/or specific material is found is a question of the very final evaluation.

As for the approach, we are aware that terms are the result of normative work but our methods are definitely descriptive. Nevertheless, the resources developed in our project are useful both for users interested in looking up a specific term and for professional translators using Computer-Assisted Translation methods.

Authors: