GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology. Read more about Why. See also How to get started and our Privacy document.
This document discusses ways of improving the sentence alignment results provided by TCA2.
The 2011 TCA2 version is installed in two versions (GUI and command line).
All involved parties need to have a look-and-feel of the alignment in order to see what is going on.
We might parametrise the anchor list into one general part, and one thematic part, e.g. along the division in the corpus catalogue structure.
The existing anchor list should be both trimmed and extended.
Is there an optimal length for the anchor list?
This parameter should be given to TCA2. It is measured in characters, not in words.
Ciprian used the pre-set parameter for the last run.
Sentences:
ccat -l sme -r converted/sme/admin/ | \
preprocess --abbr=~/gtsvn/gt/sme/bin/abbr.txt | \
(count the number of sentences = units given to TCA2)
Characters:
ccat -l sme -r converted/sme/admin/ | wc -c
TODO:
The weight of TCA2 is preset at the following values (“for no scientific reason”) to:
Investigate whether these values are sensible.
Quoting the documentation:
'’If the first n (n is read as a parameter to the program) characters are equal for a word in an English and a Norwegian sentence, the two words are assumed to be cognates. For English/Norwegian a value of n=6 or 7 gives good results. Dice’s similarity coefficient is the number of matching bigrams in the two words divided by the mean of the number of bigrams for the two words (2a/(b+c), where a is the matching number of bigrams, and b and c are the number of bigrams in the two words. For English and Norwegian, a value of more than 0.7 or 0.8 gives reasonable results. For other languages, the acceptable value for the coefficient can be less.’’
Now, the question is how to find the coefficient. It is probably far smaller than for eng-nob.
The two languages entering the preprocessing procedure might be preprocessed according to different principles. The difference might be subtle: One common abbreviation oder initial letter classified differently in language A and language B might be enough to eschew the result.
Investigate this.