GiellaLT

GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology. Read more about Why. See also How to get started and our Privacy document.

View GiellaLT on GitHub

Page Content

Maintenance of parallel corpus

Many documents are parallel with the parallel content in the same file. Other documents are simply placed in the wrong catalogues. Algorithm to fix this:

  1. For each file, count the number of words
  2. For each file, count the number of words marked with the language of the catalogue
  3. Estimate the ratio
  4. Pick the files with a bad ratio, and investigate them. Split and reallocate.