GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology. Read more about Why. See also How to get started and our Privacy document.
This steps are valid for korp, u_korp and f_korp and need to be repeated for each language.
Note: this is the “old” way of updating parallel content. But it is suggested to try out the korp_para
command from the CorpusTools instead of running the script analyse_xxx_tmx.py
.
Run the following for both the majority language (
cd lang-<ISO>
git pull or svn up
Make sure that you have in your configuration:
./configure --prefix=/Users/<USERNAME>/.local
Then run:
make
make install
Make sure you have the CorpusTools installed.
Run the following:
cd CorpusTools/korp_scripts/update_parallel
y|cp -r ~/freecorpus/stable/tmx/<ISO1>2<ISO2> .
cd <ISO1>2<ISO2>
find . -type f| xargs perl -i -p -e 's/xml:lang/lang/g;'
cd ../
Run the following for each genre.
Analyse the texts first for the minority language (
python3 analyse_xxx_tmx.py <ISO2> <ISO1>2<ISO2>/<GENRE> <GENRE>
python3 analyse_xxx_tmx.py <ISO1> out_<ISO2>_<ISO1>2<ISO2>
This may take a while to run depending on the size of the original folder.
Run the following for each genre:
time python3 extract_sent_pairs.py <ISO1> <GENRE> <ISO1>2<ISO2> <DATE> out_<ISO1>_out_<ISO2>_<ISO1>2<ISO2>
time python3 extract_sent_pairs.py <ISO2> <GENRE> <ISO1>2<ISO2> <DATE> out_<ISO1>_out_<ISO2>_<ISO1>2<ISO2>
Change lang_code, plang_code, in_dir, metaFile, date in run_para_corpus_encoding.sh as needed. Then run the following:
sh run_para_corpus_encoding.sh