Edmonton presentation

University of Alberta, Edmonton, June 8th & 13th 2015

Sjur Moshagen, UiT The Arctic University of Norway

Overview of the presentation

background and goals
bird’s eye view
closer view of selected parts: ** documentation ** testing ** from source to final tool

Background and goals

Background
Goals

Background

need for simpler maintenance
scalability, both for languages, tools and linguists and other developers
developing NLP resources is a lot of work, and languages are complex - we need a tool and an infrastructure to handle the complexity in a managable way
keep technical details out of the way
make the daily work as simple as possible
division of labour
Recognition: know the basic setup of one language - know the setup of them all

Goals

easy support for many languages
easy support for many tools
keep language independent and language specific code apart
easily upgradable
the resources in our infrastructure should live on for decades or more

General principles

Be explicit (use non-cryptic catalogue and file names)

Be clear (files should be found in non-surprising locations)

Be consistent (identical conventions in all languages as far as possible)

Be modular

Divide language-dependent and language-independent code

Reuse resources

Build all tools for all languages

… but only as much as you want (parametrised build process)

Bird’s Eye View and Down

the house
organisation - directory structure
technologies (xerox, hfst, foma + cg)
templated build structure and source files
configuration of builds

The House

[../images/hus_eng_2015.png]

The House and the Infra

[../images/hus_eng_2015_with_infra.png]

*Machine translation: fst's built by the infra, the rest handled by Apertium
*Speech synthesis is not (yet) built by the infra, conversion to IPA is part of the infrastructure though
Supported: fst's and syntactic parsers used are built by the infrastructure

$GTHOME - directory structure

Some less relevant dirs removed for clarity:

$GTHOME/                     ## root directory, can be named whatever
├── experiment-langs         ## language dirs used for experimentation
├── giella-core              ## $GTCORE - core utilities
├── giella-shared            ## shared linguistic resources
├── giella-templates         ## templates for maintaining the infrastructure
├── keyboards                ## keyboard apps organised roughly as the language dirs
├── langs                    ## The languages being actively developed, such as:
│   ├─[...]                  #
│   ├── crk                  ## Plains Cree
│   ├── est                  ## Estonian
│   ├── evn                  ## Evenki
│   ├── fao                  ## Faroese
│   ├── fin                  ## Finnish
│   ├── fkv                  ## Kven
│   ├── hdn                  ## Northern Haida
│   └─[...]                  #
├── ped                      ## Oahpa etc.
├── prooftools               ## Libraries and installers for spellers and the like
├── startup-langs            ## Directory for languages in their start-up phase
├── techdoc                  ## technical documentation
├── words                    ## dictionary sources
└── xtdoc                    ## external (user) documentation & web pages

Organisation - Dir Structure

.
├── src                  = source files
│   ├── filters          = adjust fst's for special purposes
│   ├── hyphenation      = nikîpakwâtik >  ni-kî-pa-kwâ-tik
│   ├── morphology       =
│   │   ├── affixes      = prefixes, suffixes
│   │   └── stems        = lexical entries
│   ├── orthography      = latin -> syllabics, spellrelax
│   ├── phonetics        = conversion to IPA
│   ├── phonology        = morphophonological rules
│   ├── syntax           = disambiguation, synt. functions, dependency
│   ├── tagsets          = get your tags as you want them
│   └── transcriptions   = convert number expressions to text or v.v.
├── test                 =
│   ├── data             = test data
│   └── src              = tests for the fst's in the src/ dir
└── tools                =
    ├── grammarcheckers  = prototype work, only SME for now
    ├── mt               = machine translation
    │   └── apertium     = ... for certain MT platforms
    ├── preprocess       = split text in sentences and words
    └── spellcheckers    = spell checkers are built here

Technologies

All technologies are rule-based as opposed to statistical and similar technologies.
This allows us to write grammars that are precise descriptions of the languages - reference grammars in a way
Goal: The documentation for your grammar - with suitable examples etc - could be the next published grammar for your language (we’ll return to that shortly)

Technology for morphological analysis

We presently use three different technologies:

Xerox - closed source, not properly maintained, fast, no weights
Hfst - open source, actively maintained, used in our proofing tools
Foma - Open source, actively maintained, fast (newly added, not available for all fst’s yet)

Technology for syntactic parsing

Cg (VISLCG3, from University of Southern Denmark)
used for syntactic parsing
also for grammar checking
Basic idea: remove unwanted readings or select wanted ones based on the morphosyntactic context (= output of the morphological analysis)
Example:

## We like finite verbs:
SELECT:Vfin VFIN ;

Templated Build Structure And Source Files

Common resources in $GTHOME/core/
Template for new languages, including build instructions
The template is merged (using svn merge) with each language when updated

[../images/newinfra.png]

Configurable builds

We support a lot of different tools and targets, but in most cases one only wants a handful of them. When running ./configure, you get a summary of the things that are turned on and off at the end:

$ ./configure --with-hfst
[...]
-- Building giella-crk 20110617:


  -- Fst build tools: Xerox, Hfst or Foma - at least one must be installed
  -- Xerox is default on, the others off unless they are the only one present --
  * build Xerox fst's: yes
  * build HFST fst's: yes
  * build Foma fst's: no


  -- basic packages (on by default): --
  * analysers enabled: yes
  * generators enabled: yes
  * transcriptors enabled: yes
  * syntactic tools enabled: yes
  * yaml tests enabled: yes
  * generated documentation enabled: yes


  -- proofing tools (off by default): --
  * spellers enabled: no
    * hfst speller fst's enabled: no
    * foma speller enabled: no
    * hunspell generation enabled: no
  * fst hyphenator enabled: no
  * grammar checker enabled: no


  -- specialised fst's (off by default): --
  * phonetic/IPA conversion enabled: no
  * dictionary fst's enabled: no
  * Oahpa transducers enabled: no
    * L2 analyser: no
    * downcase error analyser: no
  * Apertium transducers enabled: no
  * Generate abbr.txt: no


For more ./configure options, run ./configure --help

The build - schematic

[../images/new_infra_build_overview.png]

Closer View Of Selected Parts:

*Documentation *Testing *From Source To Final Tool: **Relation Between Lexicon, Build And Speller

Closer View: Documentation

Background
Implementation

Background

Documentation is always out-of-date
It tends to be much more out-of-date when heavily separated from the thing to be documented, and vice versa
How to improve: make it possible to write documentation within the source code
This is similar to JavaDoc, Doxygen and many other such system
Ultimate goal: ** Document the source code so that it can be published as the next reference grammar!

Implementation

The infrastructure will automatically extract comments of a certain type, and convert them into html
One can cite portions of the source code, as well as test data.
The syntax of the comments must follow the jspwiki syntax

Example cases:

[https://giellalt.uit.no/lang/fin/root-morphology.html]
[https://giellalt.uit.no/lang/smj/nouns-affixes.html]

Documentation:

[https://giellalt.uit.no/infra/infraremake/In-sourceDocumentation.html]

Closer View: Testing

testing framework
yaml tests
in-source tests
other tests

Testing Framework

All automated testing done within the infrastructure is based on the testing facilities provided by Autotools.

All tests are run with a single command:

make check

Autotools gives a PASS or FAIL on each test as it finishes:

[../images/make_check_output.png]

Yaml Tests

These are the most used tests, and are named after the syntax of the test files. The core syntax is:

a header
test sets: ** test name ** test data
syntax requirements: indents using spaces, multiple choices as lists within brackets, colons after everything except the word forms

Config:
  hfst:
    Gen: ../../../src/generator-gt-norm.hfst
    Morph: ../../../src/analyser-gt-norm.hfst
  xerox:
    Gen: ../../../src/generator-gt-norm.xfst
    Morph: ../../../src/analyser-gt-norm.xfst
    App: lookup


Tests:
  Noun - mihkw - ok : ## -m inanimate noun, blood, Wolvengrey
     mihko+N+IN+Sg: mihko
     mihko+N+IN+Sg+Px1Sg: nimihkom
     mihko+N+IN+Sg+Px2Sg: kimihkom
     mihko+N+IN+Sg+Px1Pl: nimihkominân
     mihko+N+IN+Sg+Px12Pl: kimihkominaw
     mihko+N+IN+Sg+Px2Pl: kimihkomiwâw
     mihko+N+IN+Sg+Px3Sg: omihkom
     mihko+N+IN+Sg+Px3Pl: omihkomiwâw
     mihko+N+IN+Sg+Px4Pl: omihkomiyiw

Yaml test output

[../images/make_check_output.png]

each yaml test file has its own line of output with PASS / FAIL /TOTAL
at the end of each yaml test run (= all yaml files for the same fst) there is a summary of the total results for that yaml test run
… followed by the Automake PASS / FAIL message

In-Source Tests

LexC tests
Twolc tests

LexC tests

As an alternative to the yaml tests, one can specify similar test data within the source files:

LEXICON MUORRA !!= @CODE@ Standard even stems with cg (note Q1). OBS: Nouns with invisible 3>2 cg (as bus'sa) go to this lexicon.
 +N:   MUORRAInfl ;
 +N:%> MUORRACmp  ;


### €gt-norm: kárta ## Even-syllable test
### € kártta:         kártta+N+Sg+Nom
### € kártajn:        kártta+N+Sg+Com

Such tests are very useful to serve as checks for whether an inflectional lexicon behaves as it should.

The syntax is slightly different from the yaml files:

word form first
multiple alternative word forms on separate lines

Twolc tests

The twolc tests look like the following:

### € iemed9#
### € iemet#


### € gål'leX7tj#
### € gål0lå0sj#

The point is to ensure that the rules behave as they should.

Other Tests

You can write any test you want, using your favourite programming language. There are a number of shell scripts to test speller functionality, and more tests will be added as the infrastructre develops.

Closer View: From Source To Final Tool:

Relation Between Lexicon, Build And Speller
Fst’s And Dictionaries

Relation Between Lexicon, Build And Speller

tag conventions
automatically generated filters
spellers and different writing system / alternative orthographies

Tag Conventions

We use certain tag conventions in the infrastructure:

+Err/... (+Err/Orth, +Err/Cmp)
+Sem/...
and more…

Automatically Generated Filters

Many of these clusters of tags are used for specific purposes, and are removed from other fst’s.
tag using a common prefix (like +Err/ or +Sem/ gets filters for different purposes automatically
there are filers for: ** removing the tags themselves ** remvoing strings / words containing the tags
by adhering to these conventions, you get a lot of functionality for free
this system is used when…

Dealing with descriptive vs normative grammars

the normative is a subset of the descriptive
tag the non-normative forms using +Err/... tags
write your grammar as descriptive
remove the +Err/... strings
=> normative fst!

Summary

scalability
division of labour
language independence
… but still flexible wrt the needs of each language

Giitu

Thank you!

Sitemap