GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology. Read more about Why. See also How to get started and our Privacy document.
The infrastructure has several FSTs for transcribing from one text string to another.
The folder lang-xxx/src/transcriptions/
contains setup for various number and symbol representations to their text representation. The source files in the catalogue are:
transcriptor-abbrevs2text.lexc # for abbreviations
transcriptor-clock-digit2text.lexc # for time expressions
transcriptor-date-digit2text.lexc # for dates
transcriptor-numbers-digit2text.lexc # for cardinals and ordinals
Each lexc
file gives rise to two transducers, here with clock
as example:
transcriptor-clock-digit2text.lexc
[...]
transcriptor-clock-digit2text.filtered.lookup.hfstol
transcriptor-clock-text2digit.filtered.lookup.hfstol
The direction (from digit to text or vice versa) is shown in the filename.
Here are some resources for testing the transcriptors. You may generate the first 100 numbers as follows (replace the digits after seq
according to what you want to test):
seq 1 100 | \
hfst-lookup -q src/transcriptions/transcriptor-numbers-digit2text.filtered.lookup.hfstol
Then you may check the output against the normative analyser:
seq 1 100 | \
hfst-lookup -q src/transcriptions/transcriptor-numbers-digit2text.filtered.lookup.hfstol | \
cut -f2 | \
cut -c1- | \
grep -v '^$' | \
hfst-lookup -q src/analyser-gt-norm.hfstol
There are ready-made files for all numeral formats:
$GTHOME/ped/doc/common/numratesting/cardinal
$GTHOME/ped/doc/common/numratesting/clock
$GTHOME/ped/doc/common/numratesting/date
$GTHOME/ped/doc/common/numratesting/ordinal
You may thus test with these files (here with clock
as example):
cat $GTHOME/ped/doc/common/numratesting/clock | \
hfst-lookup src/transcriptions/transcriptor-clock-digit2text.filtered.lookup.hfstol
(If you don’t have GTHOME, the files are here
The folder lang-xxx/src/phonetics/
contains setup for text-to-IPA transcription.
The folder lang-xxx/src/orthography/
contains files for translating sloppy writing and non-standard encoding to standard forms.