GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology.
This steps are valid for korp, u_korp and f_korp and need to be repeated for each language.
Note: this is the “old” way of updating parallel content. But it is suggested to try out the korp_para
command from the CorpusTools instead of running the script analyse_xxx_tmx.py
.
Run the following for both the majority language (
cd lang-<ISO>
git pull or svn up
Make sure that you have in your configuration:
./configure --prefix=/Users/<USERNAME>/.local
Then run:
make
make install
Make sure you have the CorpusTools installed.
Run the following:
cd CorpusTools/korp_scripts/update_parallel
y|cp -r ~/freecorpus/stable/tmx/<ISO1>2<ISO2> .
cd <ISO1>2<ISO2>
find . -type f| xargs perl -i -p -e 's/xml:lang/lang/g;'
cd ../
Run the following for each genre.
Analyse the texts first for the minority language (
python3 analyse_xxx_tmx.py <ISO2> <ISO1>2<ISO2>/<GENRE> <GENRE>
python3 analyse_xxx_tmx.py <ISO1> out_<ISO2>_<ISO1>2<ISO2>
This may take a while to run depending on the size of the original folder.
Run the following for each genre:
time python3 extract_sent_pairs.py <ISO1> <GENRE> <ISO1>2<ISO2> <DATE> out_<ISO1>_out_<ISO2>_<ISO1>2<ISO2>
time python3 extract_sent_pairs.py <ISO2> <GENRE> <ISO1>2<ISO2> <DATE> out_<ISO1>_out_<ISO2>_<ISO1>2<ISO2>
Change lang_code, plang_code, in_dir, metaFile, date in run_para_corpus_encoding.sh as needed. Then run the following:
sh run_para_corpus_encoding.sh