GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology. Read more about Why. See also How to get started and our Privacy document.
This document follows the cgi-interface.xsl
and
cgi-index.xsl
document. They document what cgi-bin processes
each and every language is subject to. The processes themselves are
governed by conf.pl
and other cgi-bin scripts in the www directory.
The source files are compiled in the maintainer’s home directory
on gtweb (and for oahpa, on gtoahpa). A suitable command for
compiling the fsts might be to
run makeall.sh
(and thereafter go for a cup of coffee).
After the fsts are compiled they should be copied manually or semi-manually to the /opt/smi catalogue by the person who compiled them.
The reason why this is not done automatically (say, by a cronjob) is that we want to be sure fsts are checked before they are put online. With the large number of fsts we still need to make the process semi-automatic, with scripts a la update mhr. This documents paves the way for such a setup.
In short: We want to have a selective button to press, called e.g. web-update LANG.
A (now outdated) script to do this, and to get inspiration, is:
$GTLANG/gt/script/fst2opt
Source files are in these catalogues:
$GTHOME/langs/LANG
$GTHOME/gt/sme/src
This list is taken from the sme catalogue, and thus a maximum version. Other languages contain less.
Name in /opt/smi/LANG/bin | Name in langs/LANG/src
---------------------------+--------------------------
abbr.txt = (ikkje i ny infra enno)
clock-LANG.fst = transcriptor-clock2text-desc.xfst
corr.txt = (ikkje i ny infra enno)
date-LANG.fst = transcriptor-date2text-desc.xfst
dict-iLANG-norm.fst = generator-dict-gt-norm.xfst
dict-LANG-norm.fst = analyser-dict-gt-norm.xfst
hyphrules-LANG.fst = hyphenation/hyphenation.xfst
hyph-LANG.fst = (ikkje i ny infra enno)
hyph-LANG.save = (ikkje i ny infra enno)
iclock-LANG.fst = transcriptor-text2clock-desc.xfst
idate-LANG.fst = transcriptor-text2date-desc.xfst
iLANG.fst = generator-gt-desc.xfst
iLANG-GG.restr.fst = ikkje i ny infra
iLANG-norm.fst = generator-gt-norm.xfst
korpustags.LANG.txt = ../test/data/korpustags.LANG.txt
oahpa-iLANG-norm.fst = generator-oahpa-gt-norm.xfst
paradigm_full.LANG.txt = ../test/data/paradigm_full.LANG.txt (temporary place)
paradigm_min.LANG.txt = ../test/data/paradigm_min.LANG.txt (temporary place)
paradigm.LANG.txt = ../test/data/paradigm.LANG.txt (temporary place)
paradigm_standard.LANG.txt = ../test/data/paradigm_standard.LANG.txt (temporary place)
ped-LANG.fst input til s/v = ???
ped-tol-LANG.fst = ???
phon-LANG.fst = phonetics/text2ipa.xfst
LANG-dep.bin = syntax/dependency.bin
LANG-dep.rle = syntax/dependency.cg3
LANG-dis.bin = syntax/disambiguation.bin
LANG-dis.rle = syntax/disambiguation.cg3
LANG.fst = analyser-gt-desc.xfst (see LANG-site.fst!)
LANG-inum.fst = transcriptor-text2numbers-desc.xfst
LANG-norm.fst = analyser-gt-norm.xfst
LANG-num.fst = transcriptor-numbers2text-desc.xfst
LANG-site.fst sme.fst u/sem= analyser-gt-desc.xfst (see LANG-site.fst!)
smi-syn.rle = syntax/syntax.cg3
typos.fst = ../test/data/typos.fst
We have analysis and paradigm for all languages
LANG.fst = src/analyser-gt-desc.xfst
iLANG.fst = src/src/generator-gt-desc.xfst
iLANG-norm.fst = src/src/generator-gt-norm.xfst
The distribution is governed by cgi-interface.xsl. Thus, one possibility is to just copy all files over.
if dependency="1", copy:
LANG/src/syntax/dependency.bin /opt/smi/LANG/bin/LANG-dep.bin
LANG/src/syntax/dependency.cg3 /opt/smi/LANG/bin/LANG-dep.rle
if nodisamb="0", copy:
LANG/src/syntax/disambiguation.bin /opt/smi/LANG/bin/LANG-dis.bin
LANG/src/syntax/disambiguation.cg3 /opt/smi/LANG/bin/LANG-dis.rle
LANG/src/... /opt/smi/LANG/bin/abbr.txt
if nohyph="0", copy:
LANG/src/hyphenation/hyphenation.xfst /opt/smi/LANG/bin/hyph-LANG.fst
if orth="1" & lang="kal", copy:
LANG/src/transcriptions/kleinschmidt2norm.xfst /opt/smi/LANG/bin/orth-LANG.fst
if translate_dan="1", copy
LANG/
if translate_nob="1", copy
LANG/
#### Other transducers
if orth="1" & lang="ipk", copy:
if phon="1", copy
LANG/src/phonetics/text2ipa.xfst /opt/smi/LANG/bin/phon-LANG.fst
#### Paradigm generation
if dialpara="1"
if minpara="1"
if standardpara="1"
if fullpara="1"
if nonum=”0”
LANG/src/transcriptions/transcriptor-numbers2text-raw.xfst /opt/smi/LANG/bin/LANG-num.fst
if oahpa=”1”
src/transcriptions/transcriptor-clock2text-raw.xfst /opt/smi/LANG/bin/iclock-LANG.fst
src/transcriptions/transcriptor-date2text-raw.xfst /opt/smi/LANG/bin/idate-LANG.fst
Invert these transducers and copy them to clock-LANG.fst and date-LANG.fst
CHECK: It might be that Oahpa uses fsts stored in he /home/oahpa catalogue.