GiellaLT

GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology.

View GiellaLT on GitHub

Page Content

Sentence alignment

We look at alternatives to our tca2 aligners.

The Europarl aligner

The sentence aligner used to align the Europarl parallel corpus is a perl script based upon Church and Gale algorithm. Here is the README file.

The script may be downloaded from the europarl site. At Giellatekno, it is placed under the $GTHOME/tools/alignment-tools/europarl/.

In order to run it: Add se and no abbreviation files to the nonbreaking_prefixes catalogue (this has been done). Then add files to the directories se and no. The filenames in se and no must be identical. The command then is

./sentence-align-corpus.perl se no

This might not work. For some hints, see:

TODO:

Other aligners?

Feel free to add.