GiellaLT

GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology. Read more about Why. See also How to get started and our Privacy document.

View GiellaLT on GitHub

Outline of major goals

More detailed plans and progress

There are some details further down, but the meat of the plan is found on a separate page. The same goes for the progress.

Other goals

Some details

Dir structure

The basic dir structure could be something like this:

$GTHOME/
        gtcore/
                scripts/    ## the old gt/script/ dir
                mk-files/   ## shared core mk-files
                templates/  ## src file templates and dir structure
                shared/     ## old common/ - shared linguistic src files
        gtlangs/
                sme/
                smj/
                sma/
                fao/
                kom/
                langgroups/
                           smi/

Comments: the dirs in $GTHOME/gtcore/ are intended to be used as follows:

Longer term, one can consider the following additions:

$GTHOME/
        gtcore/         ## as above
        gtlangs/        ## as above
        gtlangpairs/    ## language pairs, typically dictionaries and MT
        gtlanggroups/   ## multilingual resources, typically terminology
                        ## collections and shared name resources

The idea is to gather resources that are specific to the given language pairs within these directories. They should also serve as the starting point for ‘‘CS’s Dream’’ (Cip’s and Sjur’s Dream), where all monolingual information is stored in gtlangs/, and all multilingual information is stored in one of the two dirs indicated above. Language pair names are directional, indicating the source and target languages.

In this scenario, resources for an MT application would then probably be divided among three dir trees: gtlangs/ for the monolingual resources, gtlangpairs/ for the transfer dictionaries, and gtlanggroups/ for terminology resources.

Filenames and extension

Filenames need to be standardised, as well as the use of filename extensions. The extension should reflect the content type. A possible list of extensions could be:

There are probably other file types we need to handle, add mmore extensions here as needed.

Language codes

So far we have used ISO639-2 codes for all languages, and applied that to both dir names and as part of file names. We should probably move to (the relevant subparts of) proper locale codes, following the standards used by the rest of the world. This means changing all sme strings to se, nob to nb, etc.

Execution plan

Small and big in the list below refers to the size of the linguistic resources. Simplifying a bit it is roughly equal to the number of lexc entries.

  1. start small - only one language. First language is fao, which is reasonably big but still not too complex. Create the basic dir tree, and use svn copy to copy over the fao sources, so that the old fao dir remains intact and usable all the time (only when everything is working ok, the old dir will be removed).
  2. get all the basic infrastructure and build functionality to work for the needs of fao
  3. add another language, probably a small one, to test the multilingual behaviour as well as templating system (the small language will most likely not have all features that fao has)
  4. add a third language - a big one this time, e.g. kal (which is using xfst instead of twolc and thus provides a slightly new use case). Make sure all build targets are working as they should, and extend the build system, template files, etc as needed. kal has probably more requirements than fao.
  5. then add one language at a time, all the time ensuring that everything is working for all langauges, and that the small languages automatically pick up new functionality from the big ones as the template dir is expanded to follow the big languages being added
  6. gradually remove the old language dirs as the new location and build infrastructure becomes stable, also forcing the whole group to start using the new infrastructure. This is important to get feedback and correct bugs.

Testing the remake

We need to ensure that nothing changes in terms of the output of the transducers as part of the remake - unless there are some intended changes (e.g. unifying tags across languages). It is probably best to first do the infra remake, and then later do such tag unifying in the output. So what we need for each language is:

The testing then amounts to ensuring that the output is the same from both the old and new transducers. This should guarantee stability in the output, and thus reliability from a linguistic point of view.

There might be problems with this testing scenario in cases where we want to change tags as part of the infrastructure remake. One example could be that we want to standardise some of the compounding tags, to ensure that a compound filter works the same for all languages. Or that some tags that are visible now will be removed in the output of the new transducers, since they really should not be part of the output even in the old transducers (e.g. the +Der1 tags).

Open questions

Applications - top-level or bottom-level?

Should we build end-user applications in a separate dir tree, one tree for each application, or should the applications be included in the regular language dirs? As long as the application basically only involves one technology and a few files, it would probably seem easiest to build the application as part of the other builds for that language. One such example is spell checkers, which basically are an application of normative transducers.

But as soon as the application builds on multiple technologies, requires several installation packages for different plattforms and a multitude of files as part of the application, it might get more complicated. In this case it might be easier to maintain separate directory trees for each application. Take Oahpa as an example, which uses both several transducers, disambiguators, SQL data, and user interface files.

There is no easy answer to this, we probably have to try both, and see how things develop.