GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology. Read more about Why. See also How to get started and our Privacy document.
We look at alternatives to our tca2 aligners.
The sentence aligner used to align the Europarl parallel corpus is a perl script based upon Church and Gale algorithm. Here is the README file.
The script may be downloaded from the europarl site. At Giellatekno, it is placed under the $GTHOME/tools/alignment-tools/europarl/
.
In order to run it: Add se and no abbreviation files to the nonbreaking_prefixes catalogue (this has been done). Then add files to the directories se and no. The filenames in se and no must be identical. The command then is
./sentence-align-corpus.perl se no
This might not work. For some hints, see:
TODO:
Feel free to add.