GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology. Read more about Why. See also How to get started, and our Privacy document.
University of Alberta, Edmonton, June 8th & 13th 2015
Sjur Moshagen, UiT The Arctic University of Norway
[../images/hus_eng_2015.png]
[../images/hus_eng_2015_with_infra.png]
*Machine translation: fst's built by the infra, the rest handled by Apertium
*Speech synthesis is not (yet) built by the infra, conversion to IPA is part of the infrastructure though
Supported: fst's and syntactic parsers used are built by the infrastructure
Some less relevant dirs removed for clarity:
$GTHOME/ ## root directory, can be named whatever
├── experiment-langs ## language dirs used for experimentation
├── giella-core ## $GTCORE - core utilities
├── giella-shared ## shared linguistic resources
├── giella-templates ## templates for maintaining the infrastructure
├── keyboards ## keyboard apps organised roughly as the language dirs
├── langs ## The languages being actively developed, such as:
│ ├─[...] #
│ ├── crk ## Plains Cree
│ ├── est ## Estonian
│ ├── evn ## Evenki
│ ├── fao ## Faroese
│ ├── fin ## Finnish
│ ├── fkv ## Kven
│ ├── hdn ## Northern Haida
│ └─[...] #
├── ped ## Oahpa etc.
├── prooftools ## Libraries and installers for spellers and the like
├── startup-langs ## Directory for languages in their start-up phase
├── techdoc ## technical documentation
├── words ## dictionary sources
└── xtdoc ## external (user) documentation & web pages
.
├── src = source files
│ ├── filters = adjust fst's for special purposes
│ ├── hyphenation = nikîpakwâtik > ni-kî-pa-kwâ-tik
│ ├── morphology =
│ │ ├── affixes = prefixes, suffixes
│ │ └── stems = lexical entries
│ ├── orthography = latin -> syllabics, spellrelax
│ ├── phonetics = conversion to IPA
│ ├── phonology = morphophonological rules
│ ├── syntax = disambiguation, synt. functions, dependency
│ ├── tagsets = get your tags as you want them
│ └── transcriptions = convert number expressions to text or v.v.
├── test =
│ ├── data = test data
│ └── src = tests for the fst's in the src/ dir
└── tools =
├── grammarcheckers = prototype work, only SME for now
├── mt = machine translation
│ └── apertium = ... for certain MT platforms
├── preprocess = split text in sentences and words
└── spellcheckers = spell checkers are built here
We presently use three different technologies:
## We like finite verbs:
SELECT:Vfin VFIN ;
$GTHOME/core/
[../images/newinfra.png]
We support a lot of different tools and targets, but in most cases one only
wants a handful of them. When running ./configure
, you get a summary of the
things that are turned on and off at the end:
$ ./configure --with-hfst
[...]
-- Building giella-crk 20110617:
-- Fst build tools: Xerox, Hfst or Foma - at least one must be installed
-- Xerox is default on, the others off unless they are the only one present --
* build Xerox fst's: yes
* build HFST fst's: yes
* build Foma fst's: no
-- basic packages (on by default): --
* analysers enabled: yes
* generators enabled: yes
* transcriptors enabled: yes
* syntactic tools enabled: yes
* yaml tests enabled: yes
* generated documentation enabled: yes
-- proofing tools (off by default): --
* spellers enabled: no
* hfst speller fst's enabled: no
* foma speller enabled: no
* hunspell generation enabled: no
* fst hyphenator enabled: no
* grammar checker enabled: no
-- specialised fst's (off by default): --
* phonetic/IPA conversion enabled: no
* dictionary fst's enabled: no
* Oahpa transducers enabled: no
* L2 analyser: no
* downcase error analyser: no
* Apertium transducers enabled: no
* Generate abbr.txt: no
For more ./configure options, run ./configure --help
[../images/new_infra_build_overview.png]
*Documentation *Testing *From Source To Final Tool: **Relation Between Lexicon, Build And Speller
Example cases:
Documentation:
All automated testing done within the infrastructure is based on the testing facilities provided by Autotools.
All tests are run with a single command:
make check
Autotools gives a PASS
or FAIL
on each test as it finishes:
[../images/make_check_output.png]
These are the most used tests, and are named after the syntax of the test files. The core syntax is:
Config:
hfst:
Gen: ../../../src/generator-gt-norm.hfst
Morph: ../../../src/analyser-gt-norm.hfst
xerox:
Gen: ../../../src/generator-gt-norm.xfst
Morph: ../../../src/analyser-gt-norm.xfst
App: lookup
Tests:
Noun - mihkw - ok : ## -m inanimate noun, blood, Wolvengrey
mihko+N+IN+Sg: mihko
mihko+N+IN+Sg+Px1Sg: nimihkom
mihko+N+IN+Sg+Px2Sg: kimihkom
mihko+N+IN+Sg+Px1Pl: nimihkominân
mihko+N+IN+Sg+Px12Pl: kimihkominaw
mihko+N+IN+Sg+Px2Pl: kimihkomiwâw
mihko+N+IN+Sg+Px3Sg: omihkom
mihko+N+IN+Sg+Px3Pl: omihkomiwâw
mihko+N+IN+Sg+Px4Pl: omihkomiyiw
[../images/make_check_output.png]
As an alternative to the yaml tests, one can specify similar test data within the source files:
LEXICON MUORRA !!= @CODE@ Standard even stems with cg (note Q1). OBS: Nouns with invisible 3>2 cg (as bus'sa) go to this lexicon.
+N: MUORRAInfl ;
+N:%> MUORRACmp ;
### €gt-norm: kárta ## Even-syllable test
### € kártta: kártta+N+Sg+Nom
### € kártajn: kártta+N+Sg+Com
Such tests are very useful to serve as checks for whether an inflectional lexicon behaves as it should.
The syntax is slightly different from the yaml files:
The twolc tests look like the following:
### € iemed9#
### € iemet#
### € gål'leX7tj#
### € gål0lå0sj#
The point is to ensure that the rules behave as they should.
You can write any test you want, using your favourite programming language. There are a number of shell scripts to test speller functionality, and more tests will be added as the infrastructre develops.
We use certain tag conventions in the infrastructure:
+Err/...
(+Err/Orth
, +Err/Cmp
)+Sem/...
+Err/
or +Sem/
gets filters for
different purposes automatically+Err/...
tags+Err/...
strings