GiellaLT

GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology. Read more about Why. See also How to get started and our Privacy document.

View GiellaLT on GitHub

OCR reading

Algorithm for dealing with OCR errors

Finding these errors

TODO:

Tne Error Detection Algorithm runs as follows:

For each file:

  1. Analyse the main language text morphologically
  2. Count the missing ones
  3. Register the missing/total ratio, and pick the worst files

Look at the worst files, and figure out how to mend them, or move them, e.g. to an OCR gold standard

Results of finding errors for North Sami

Here is a list of errors per file in each folder in the admin directory. For each file we list error/total ratio - total number of words - words not recognized - filename, and we sort the file list according to error/total ratio:

A list of error analyses can be found from corpus error analysis.

Error typology (summarising the corpus error analysis):

TODO:

Results of finding errors for South Sami

TODO:

Finding catalogue errors:

List all files in langX-catalogue with more non-langX content than langX-content.

TODO:

Correcting OCR errors

Develop algorithms for automatic correction of OCR errors. This work must be done separately for each language.