GiellaLT

GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology. Read more about Why. See also How to get started, and our Privacy document.

View GiellaLT on GitHub

Edmonton presentation

University of Alberta, Edmonton, June 8th & 13th 2015

Sjur Moshagen, UiT The Arctic University of Norway

Overview of the presentation

Background and goals

Background

Goals

General principles

Be explicit (use non-cryptic catalogue and file names)

Be clear (files should be found in non-surprising locations)

Be consistent (identical conventions in all languages as far as possible)

Be modular

Divide language-dependent and language-independent code

Reuse resources

Build all tools for all languages

… but only as much as you want (parametrised build process)

Bird’s Eye View and Down

The House

[../images/hus_eng_2015.png]

The House and the Infra

[../images/hus_eng_2015_with_infra.png]

$GTHOME - directory structure

Some less relevant dirs removed for clarity:

$GTHOME/                     ## root directory, can be named whatever
├── experiment-langs         ## language dirs used for experimentation
├── giella-core              ## $GTCORE - core utilities
├── giella-shared            ## shared linguistic resources
├── giella-templates         ## templates for maintaining the infrastructure
├── keyboards                ## keyboard apps organised roughly as the language dirs
├── langs                    ## The languages being actively developed, such as:
│   ├─[...]                  #
│   ├── crk                  ## Plains Cree
│   ├── est                  ## Estonian
│   ├── evn                  ## Evenki
│   ├── fao                  ## Faroese
│   ├── fin                  ## Finnish
│   ├── fkv                  ## Kven
│   ├── hdn                  ## Northern Haida
│   └─[...]                  #
├── ped                      ## Oahpa etc.
├── prooftools               ## Libraries and installers for spellers and the like
├── startup-langs            ## Directory for languages in their start-up phase
├── techdoc                  ## technical documentation
├── words                    ## dictionary sources
└── xtdoc                    ## external (user) documentation & web pages

Organisation - Dir Structure

.
├── src                  = source files
│   ├── filters          = adjust fst's for special purposes
│   ├── hyphenation      = nikîpakwâtik >  ni-kî-pa-kwâ-tik
│   ├── morphology       =
│   │   ├── affixes      = prefixes, suffixes
│   │   └── stems        = lexical entries
│   ├── orthography      = latin -> syllabics, spellrelax
│   ├── phonetics        = conversion to IPA
│   ├── phonology        = morphophonological rules
│   ├── syntax           = disambiguation, synt. functions, dependency
│   ├── tagsets          = get your tags as you want them
│   └── transcriptions   = convert number expressions to text or v.v.
├── test                 =
│   ├── data             = test data
│   └── src              = tests for the fst's in the src/ dir
└── tools                =
    ├── grammarcheckers  = prototype work, only SME for now
    ├── mt               = machine translation
    │   └── apertium     = ... for certain MT platforms
    ├── preprocess       = split text in sentences and words
    └── spellcheckers    = spell checkers are built here

Technologies

Technology for morphological analysis

We presently use three different technologies:

Technology for syntactic parsing

## We like finite verbs:
SELECT:Vfin VFIN ;

Templated Build Structure And Source Files

[../images/newinfra.png]

Configurable builds

We support a lot of different tools and targets, but in most cases one only wants a handful of them. When running ./configure, you get a summary of the things that are turned on and off at the end:

$ ./configure --with-hfst
[...]
-- Building giella-crk 20110617:


  -- Fst build tools: Xerox, Hfst or Foma - at least one must be installed
  -- Xerox is default on, the others off unless they are the only one present --
  * build Xerox fst's: yes
  * build HFST fst's: yes
  * build Foma fst's: no


  -- basic packages (on by default): --
  * analysers enabled: yes
  * generators enabled: yes
  * transcriptors enabled: yes
  * syntactic tools enabled: yes
  * yaml tests enabled: yes
  * generated documentation enabled: yes


  -- proofing tools (off by default): --
  * spellers enabled: no
    * hfst speller fst's enabled: no
    * foma speller enabled: no
    * hunspell generation enabled: no
  * fst hyphenator enabled: no
  * grammar checker enabled: no


  -- specialised fst's (off by default): --
  * phonetic/IPA conversion enabled: no
  * dictionary fst's enabled: no
  * Oahpa transducers enabled: no
    * L2 analyser: no
    * downcase error analyser: no
  * Apertium transducers enabled: no
  * Generate abbr.txt: no


For more ./configure options, run ./configure --help

The build - schematic

[../images/new_infra_build_overview.png]

Closer View Of Selected Parts:

*Documentation *Testing *From Source To Final Tool: **Relation Between Lexicon, Build And Speller

Closer View: Documentation

Background

Implementation

Example cases:

Documentation:

Closer View: Testing

Testing Framework

All automated testing done within the infrastructure is based on the testing facilities provided by Autotools.

All tests are run with a single command:

make check

Autotools gives a PASS or FAIL on each test as it finishes:

[../images/make_check_output.png]

Yaml Tests

These are the most used tests, and are named after the syntax of the test files. The core syntax is:

Config:
  hfst:
    Gen: ../../../src/generator-gt-norm.hfst
    Morph: ../../../src/analyser-gt-norm.hfst
  xerox:
    Gen: ../../../src/generator-gt-norm.xfst
    Morph: ../../../src/analyser-gt-norm.xfst
    App: lookup


Tests:
  Noun - mihkw - ok : ## -m inanimate noun, blood, Wolvengrey
     mihko+N+IN+Sg: mihko
     mihko+N+IN+Sg+Px1Sg: nimihkom
     mihko+N+IN+Sg+Px2Sg: kimihkom
     mihko+N+IN+Sg+Px1Pl: nimihkominân
     mihko+N+IN+Sg+Px12Pl: kimihkominaw
     mihko+N+IN+Sg+Px2Pl: kimihkomiwâw
     mihko+N+IN+Sg+Px3Sg: omihkom
     mihko+N+IN+Sg+Px3Pl: omihkomiwâw
     mihko+N+IN+Sg+Px4Pl: omihkomiyiw

Yaml test output

[../images/make_check_output.png]

In-Source Tests

LexC tests

As an alternative to the yaml tests, one can specify similar test data within the source files:

LEXICON MUORRA !!= @CODE@ Standard even stems with cg (note Q1). OBS: Nouns with invisible 3>2 cg (as bus'sa) go to this lexicon.
 +N:   MUORRAInfl ;
 +N:%> MUORRACmp  ;


### €gt-norm: kárta ## Even-syllable test
### € kártta:         kártta+N+Sg+Nom
### € kártajn:        kártta+N+Sg+Com

Such tests are very useful to serve as checks for whether an inflectional lexicon behaves as it should.

The syntax is slightly different from the yaml files:

Twolc tests

The twolc tests look like the following:

### € iemed9#
### € iemet#


### € gål'leX7tj#
### € gål0lå0sj#

The point is to ensure that the rules behave as they should.

Other Tests

You can write any test you want, using your favourite programming language. There are a number of shell scripts to test speller functionality, and more tests will be added as the infrastructre develops.

Closer View: From Source To Final Tool:

Relation Between Lexicon, Build And Speller

Tag Conventions

We use certain tag conventions in the infrastructure:

Automatically Generated Filters

Dealing with descriptive vs normative grammars

Summary

Giitu

Sitemap