GiellaLT

GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology.

View GiellaLT on GitHub

Page Content

How to update parallel content

This steps are valid for korp, u_korp and f_korp and need to be repeated for each language. Note: this is the “old” way of updating parallel content. But it is suggested to try out the korp_para command from the CorpusTools instead of running the script analyse_xxx_tmx.py.

Step 1 - Update to the latest fst:

Run the following for both the majority language () and the minority language ():

cd lang-<ISO>
git pull or svn up

Make sure that you have in your configuration: ./configure --prefix=/Users/<USERNAME>/.local

Then run:

make
make install

Step 2 - Copy 2 folder and prepare data

Make sure you have the CorpusTools installed.

Run the following:

cd CorpusTools/korp_scripts/update_parallel
y|cp -r ~/freecorpus/stable/tmx/<ISO1>2<ISO2> .
cd <ISO1>2<ISO2>
find . -type f| xargs perl -i -p -e 's/xml:lang/lang/g;'
cd ../

Step 3 - Analyse files

Run the following for each genre.

Analyse the texts first for the minority language () and then for the majority language ():

python3 analyse_xxx_tmx.py <ISO2> <ISO1>2<ISO2>/<GENRE> <GENRE>
python3 analyse_xxx_tmx.py <ISO1> out_<ISO2>_<ISO1>2<ISO2>

This may take a while to run depending on the size of the original folder.

Step 4 - Convert the analysed files in the required korp format

Run the following for each genre:

time python3 extract_sent_pairs.py <ISO1> <GENRE> <ISO1>2<ISO2> <DATE> out_<ISO1>_out_<ISO2>_<ISO1>2<ISO2>
time python3 extract_sent_pairs.py <ISO2> <GENRE> <ISO1>2<ISO2> <DATE> out_<ISO1>_out_<ISO2>_<ISO1>2<ISO2>

Step 5 - Produce data for Korp (using cwb)

Change lang_code, plang_code, in_dir, metaFile, date in run_para_corpus_encoding.sh as needed. Then run the following:

sh run_para_corpus_encoding.sh