Maintenance of parallel corpus
Many documents are parallel with the parallel content in the same file. Other documents are simply placed in the wrong catalogues. Algorithm to fix this:
- For each file, count the number of words
- For each file, count the number of words marked with the language of the catalogue
- Estimate the ratio
- Pick the files with a bad ratio, and investigate them. Split and reallocate.