GiellaLT

GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology.

View GiellaLT on GitHub

Page Content

How to update monolingual content

This steps are valid for korp, u_korp and f_korp and need to be repeated for each language.

Step 1 - Update to the latest fst:

cd lang-<ISO>
git pull or svn up

Make sure that you have in your configuration:

./configure --prefix=/Users/<USERNAME>/.local

Then run:

make
make install

Step 2 - Convert original files to xml

Make sure you have the CorpusTools installed.

Run the following:

convert2xml $GTFREE/orig/<ISO>
convert2xml $GTBOUND/orig/<ISO>

Step 3 - Analyse converted files

Run the following:

analyse_corpus <ISO> $GTFREE/converted/<ISO>/
analyse_corpus <ISO> $GTBOUND/converted/<ISO>/

This may take a while to run depending on the size of the converted folders.

Step 4 - Convert the analysed files in the required korp format

Run the following:

korp_mono <ISO> $GTFREE/analysed/<ISO>
korp_mono <ISO> $GTBOUND/analysed/<ISO>

Correct errors in the conversion if they occur, and run the conversion again. Known errors:

Do not proceed before the conversion errors until the errors are fixed.

Step 5 - Compile converted files in one .vrt file per genre

Repeat this for each genre:

cd CorpusTools/korp_scripts/update_mono
mkdir _od_<ISO>._.<GENRE>/
rsync -av $GTFREE/korp/<ISO>/<GENRE>/  _od_<ISO>._.<GENRE>/
rsync -av $GTBOUND/korp/<ISO>/<GENRE>/  _od_<ISO>._.<GENRE>/

Only for the genre “ficti” we want to change the order of all sentences. To do this run the following:

python3 scramble.py _od_<ISO>._.ficti

Change cDomain, cLang in compile_corpus.xsl and then run the following:

java -Xmx2048m -cp ~/main/tools/TermWikiExporter/lib/saxon9.jar -Dfile.encoding=UTF8 net.sf.saxon.Transform -it:main compile_corpus.xsl

Copy the file loc_metadata_xxx.json to a new file replacing xxx with the ISO code of the language you are processing. Edit it manually based on the example in the xxx file. As date, set the date you have in compile_corpus.xsl.

Also, rename the folder vrt_<ISO>_<DATE> to the ISO code of the language you work on, e.g. vrt_fit_20210625 to fit.

Step 6 - Produce data for Korp (using cwb)

Change in_dir, metaFile, date, lang_code in korp_scripts/update_mono/loc_run_gt_corpus_encoding.sh as needed.

Also change root_dir in loc_encode_gt_corpus_20181106.sh.

Make a folder in update_mono and name it after the ISO code of your language (here: fit).

Then run the following:

sh loc_run_gt_corpus_encoding.sh