Edmonton presentation
University of Alberta, Edmonton, June 8th & 13th 2015
Sjur Moshagen, UiT The Arctic University of Norway
Overview of the presentation
- background and goals
- bird’s eye view
- closer view of selected parts: ** documentation ** testing ** from source to final tool
Background and goals
- Background
- Goals
Background
- need for simpler maintenance
- scalability, both for languages, tools and linguists and other developers
- developing NLP resources is a lot of work, and languages are complex - we need a tool and an infrastructure to handle the complexity in a managable way
- keep technical details out of the way
- make the daily work as simple as possible
- division of labour
- Recognition: know the basic setup of one language - know the setup of them all
Goals
- easy support for many languages
- easy support for many tools
- keep language independent and language specific code apart
- easily upgradable
- the resources in our infrastructure should live on for decades or more
General principles
Be explicit (use non-cryptic catalogue and file names)
Be clear (files should be found in non-surprising locations)
Be consistent (identical conventions in all languages as far as possible)
Be modular
Divide language-dependent and language-independent code
Reuse resources
Build all tools for all languages
… but only as much as you want (parametrised build process)
Bird’s Eye View and Down
- the house
- organisation - directory structure
- technologies (xerox, hfst, foma + cg)
- templated build structure and source files
- configuration of builds
The House
[../images/hus_eng_2015.png]
The House and the Infra
[../images/hus_eng_2015_with_infra.png]
*Machine translation: fst's built by the infra, the rest handled by Apertium
*Speech synthesis is not (yet) built by the infra, conversion to IPA is part of the infrastructure though
Supported: fst's and syntactic parsers used are built by the infrastructure
$GTHOME - directory structure
Some less relevant dirs removed for clarity:
$GTHOME/ ## root directory, can be named whatever
├── experiment-langs ## language dirs used for experimentation
├── giella-core ## $GTCORE - core utilities
├── giella-shared ## shared linguistic resources
├── giella-templates ## templates for maintaining the infrastructure
├── keyboards ## keyboard apps organised roughly as the language dirs
├── langs ## The languages being actively developed, such as:
│ ├─[...] #
│ ├── crk ## Plains Cree
│ ├── est ## Estonian
│ ├── evn ## Evenki
│ ├── fao ## Faroese
│ ├── fin ## Finnish
│ ├── fkv ## Kven
│ ├── hdn ## Northern Haida
│ └─[...] #
├── ped ## Oahpa etc.
├── prooftools ## Libraries and installers for spellers and the like
├── startup-langs ## Directory for languages in their start-up phase
├── techdoc ## technical documentation
├── words ## dictionary sources
└── xtdoc ## external (user) documentation & web pages
Organisation - Dir Structure
.
├── src = source files
│ ├── filters = adjust fst's for special purposes
│ ├── hyphenation = nikîpakwâtik > ni-kî-pa-kwâ-tik
│ ├── morphology =
│ │ ├── affixes = prefixes, suffixes
│ │ └── stems = lexical entries
│ ├── orthography = latin -> syllabics, spellrelax
│ ├── phonetics = conversion to IPA
│ ├── phonology = morphophonological rules
│ ├── syntax = disambiguation, synt. functions, dependency
│ ├── tagsets = get your tags as you want them
│ └── transcriptions = convert number expressions to text or v.v.
├── test =
│ ├── data = test data
│ └── src = tests for the fst's in the src/ dir
└── tools =
├── grammarcheckers = prototype work, only SME for now
├── mt = machine translation
│ └── apertium = ... for certain MT platforms
├── preprocess = split text in sentences and words
└── spellcheckers = spell checkers are built here
Technologies
- All technologies are rule-based as opposed to statistical and similar technologies.
- This allows us to write grammars that are precise descriptions of the languages - reference grammars in a way
- Goal: The documentation for your grammar - with suitable examples etc - could be the next published grammar for your language (we’ll return to that shortly)
Technology for morphological analysis
We presently use three different technologies:
- Xerox - closed source, not properly maintained, fast, no weights
- Hfst - open source, actively maintained, used in our proofing tools
- Foma - Open source, actively maintained, fast (newly added, not available for all fst’s yet)
Technology for syntactic parsing
- Cg (VISLCG3, from University of Southern Denmark)
- used for syntactic parsing
- also for grammar checking
- Basic idea: remove unwanted readings or select wanted ones based on the morphosyntactic context (= output of the morphological analysis)
- Example:
## We like finite verbs:
SELECT:Vfin VFIN ;
Templated Build Structure And Source Files
- Common resources in
$GTHOME/core/
- Template for new languages, including build instructions
- The template is merged (using svn merge) with each language when updated
[../images/newinfra.png]
Configurable builds
We support a lot of different tools and targets, but in most cases one only
wants a handful of them. When running ./configure
, you get a summary of the
things that are turned on and off at the end:
$ ./configure --with-hfst
[...]
-- Building giella-crk 20110617:
-- Fst build tools: Xerox, Hfst or Foma - at least one must be installed
-- Xerox is default on, the others off unless they are the only one present --
* build Xerox fst's: yes
* build HFST fst's: yes
* build Foma fst's: no
-- basic packages (on by default): --
* analysers enabled: yes
* generators enabled: yes
* transcriptors enabled: yes
* syntactic tools enabled: yes
* yaml tests enabled: yes
* generated documentation enabled: yes
-- proofing tools (off by default): --
* spellers enabled: no
* hfst speller fst's enabled: no
* foma speller enabled: no
* hunspell generation enabled: no
* fst hyphenator enabled: no
* grammar checker enabled: no
-- specialised fst's (off by default): --
* phonetic/IPA conversion enabled: no
* dictionary fst's enabled: no
* Oahpa transducers enabled: no
* L2 analyser: no
* downcase error analyser: no
* Apertium transducers enabled: no
* Generate abbr.txt: no
For more ./configure options, run ./configure --help
The build - schematic
[../images/new_infra_build_overview.png]
Closer View Of Selected Parts:
*Documentation *Testing *From Source To Final Tool: **Relation Between Lexicon, Build And Speller
Closer View: Documentation
- Background
- Implementation
Background
- Documentation is always out-of-date
- It tends to be much more out-of-date when heavily separated from the thing to be documented, and vice versa
- How to improve: make it possible to write documentation within the source code
- This is similar to JavaDoc, Doxygen and many other such system
- Ultimate goal: ** Document the source code so that it can be published as the next reference grammar!
Implementation
- The infrastructure will automatically extract comments of a certain type, and convert them into html
- One can cite portions of the source code, as well as test data.
- The syntax of the comments must follow the jspwiki syntax
Example cases:
- [https://giellalt.uit.no/lang/fin/root-morphology.html]
- [https://giellalt.uit.no/lang/smj/nouns-affixes.html]
Documentation:
- [https://giellalt.uit.no/infra/infraremake/In-sourceDocumentation.html]
Closer View: Testing
- testing framework
- yaml tests
- in-source tests
- other tests
Testing Framework
All automated testing done within the infrastructure is based on the testing facilities provided by Autotools.
All tests are run with a single command:
make check
Autotools gives a PASS
or FAIL
on each test as it finishes:
[../images/make_check_output.png]
Yaml Tests
These are the most used tests, and are named after the syntax of the test files. The core syntax is:
- a header
- test sets: ** test name ** test data
- syntax requirements: indents using spaces, multiple choices as lists within brackets, colons after everything except the word forms
Config:
hfst:
Gen: ../../../src/generator-gt-norm.hfst
Morph: ../../../src/analyser-gt-norm.hfst
xerox:
Gen: ../../../src/generator-gt-norm.xfst
Morph: ../../../src/analyser-gt-norm.xfst
App: lookup
Tests:
Noun - mihkw - ok : ## -m inanimate noun, blood, Wolvengrey
mihko+N+IN+Sg: mihko
mihko+N+IN+Sg+Px1Sg: nimihkom
mihko+N+IN+Sg+Px2Sg: kimihkom
mihko+N+IN+Sg+Px1Pl: nimihkominân
mihko+N+IN+Sg+Px12Pl: kimihkominaw
mihko+N+IN+Sg+Px2Pl: kimihkomiwâw
mihko+N+IN+Sg+Px3Sg: omihkom
mihko+N+IN+Sg+Px3Pl: omihkomiwâw
mihko+N+IN+Sg+Px4Pl: omihkomiyiw
Yaml test output
[../images/make_check_output.png]
- each yaml test file has its own line of output with PASS / FAIL /TOTAL
- at the end of each yaml test run (= all yaml files for the same fst) there is a summary of the total results for that yaml test run
- … followed by the Automake PASS / FAIL message
In-Source Tests
- LexC tests
- Twolc tests
LexC tests
As an alternative to the yaml tests, one can specify similar test data within the source files:
LEXICON MUORRA !!= @CODE@ Standard even stems with cg (note Q1). OBS: Nouns with invisible 3>2 cg (as bus'sa) go to this lexicon.
+N: MUORRAInfl ;
+N:%> MUORRACmp ;
### €gt-norm: kárta ## Even-syllable test
### € kártta: kártta+N+Sg+Nom
### € kártajn: kártta+N+Sg+Com
Such tests are very useful to serve as checks for whether an inflectional lexicon behaves as it should.
The syntax is slightly different from the yaml files:
- word form first
- multiple alternative word forms on separate lines
Twolc tests
The twolc tests look like the following:
### € iemed9#
### € iemet#
### € gål'leX7tj#
### € gål0lå0sj#
The point is to ensure that the rules behave as they should.
Other Tests
You can write any test you want, using your favourite programming language. There are a number of shell scripts to test speller functionality, and more tests will be added as the infrastructre develops.
Closer View: From Source To Final Tool:
- Relation Between Lexicon, Build And Speller
- Fst’s And Dictionaries
Relation Between Lexicon, Build And Speller
- tag conventions
- automatically generated filters
- spellers and different writing system / alternative orthographies
Tag Conventions
We use certain tag conventions in the infrastructure:
+Err/...
(+Err/Orth
,+Err/Cmp
)+Sem/...
- and more…
Automatically Generated Filters
- Many of these clusters of tags are used for specific purposes, and are removed from other fst’s.
- tag using a common prefix (like
+Err/
or+Sem/
gets filters for different purposes automatically - there are filers for: ** removing the tags themselves ** remvoing strings / words containing the tags
- by adhering to these conventions, you get a lot of functionality for free
- this system is used when…
Dealing with descriptive vs normative grammars
- the normative is a subset of the descriptive
- tag the non-normative forms using
+Err/...
tags - write your grammar as descriptive
- remove the
+Err/...
strings - => normative fst!
Summary
- scalability
- division of labour
- language independence
- … but still flexible wrt the needs of each language
Giitu
- Thank you!