GiellaLT

GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology. Read more about Why. See also How to get started and our Privacy document.

View GiellaLT on GitHub

Shared resources

Some resources can be shared across languages. There are two types:

Build instructions

The build instructions are found in giella-core, and they are required for proper functioning of the infrastructure. giella-core is always cloned automatically if not already exsting in the default or specified location.

Linguistic data

There are two types of shared linguistic data:

Resources common to many languages

By default, all language repositories get data from shared-mul, which contains some lexical data believed to be useful to all languages, like symbols and emojis. The setup consists of two parts:

An inclusion specification in configure.ac:

gt_USE_SHARED([common], [shared-mul], [giella-shared-mul])
AM_CONDITIONAL([HAVE_SHARED_COMMON], [test x$gt_SHARED_common != xfalse])

and processing instructions in relevant Makefile.am files, e.g. in src/fst/Makefile.am:

# change handling of shared lexical data here:
if HAVE_SHARED_COMMON
url.tmp.lexc: $(gt_SHARED_common)/src/fst/url.lexc
    $(AM_V_CP)cp -f $< $@

generated_files/mul-$(GLANG)-%.lexc: $(gt_SHARED_common)/src/fst/stems/%.lexc
    $(AM_V_at)$(MKDIR_P) generated_files
    $(AM_V_CP)cp -f $< $@
else
# this is "safe" fallback (compiles but you miss everything)
url.tmp.lexc:
    echo "LEXICON Root" > $@
    echo "< h t t p (s) %: %/ %/ ?*> ## ;" >> $@

generated_files/mul-$(GLANG)-%.lexc:
    $(AM_V_at)$(MKDIR_P) generated_files
    echo "! Missing shared common data" > $@
endif
# add other lexical shared data handling here

Please note the use of the else clause, to provide a safe fallback in case the shared resource is not available for whatever reason.

Add more sections like the above if you need or want to include more shared resources. A list of repositories with shared linguistic resources can be found here.

Resources in one language used by another language

In addition to sharing resources common to many languages, one can also share resources among languages. This is used by the Sámi languages, to avoid duplication of data and maintenance spaghetti. The idea is that for example place names from all over Sápmi are useful in all languages, but maintaining a list of these names in each language repo is a waste of time, and prone to errors.

Thus, we maintain all SME names in the SME repo, and then include these names in the other Sámi repositories.

The setup is very much like above for shared resources, with one additional step to process the included data to fit the including language setup. This can involve changing some multichars, continuation lexicons, etc.

In the case of Sámi names, the inclusion is done according to the following algorithm: