OCR reading

Algorithm for dealing with OCR errors

Finding these errors

Problem: There are document-specific conversion errors that result in letters, not in garbage, errors that can be found only by linguistic means.
Solution: identify the problematic files via error detection with fst

TODO:

Tne Error Detection Algorithm runs as follows:

For each file:

Analyse the main language text morphologically
Count the missing ones
Register the missing/total ratio, and pick the worst files

Look at the worst files, and figure out how to mend them, or move them, e.g. to an OCR gold standard

Results of finding errors for North Sami

Here is a list of errors per file in each folder in the admin directory. For each file we list error/total ratio - total number of words - words not recognized - filename, and we sort the file list according to error/total ratio:

A list of error analyses can be found from corpus error analysis.

Error typology (summarising the corpus error analysis):

Conversion errors
- ==> Improve conversion
Typing errors
- ==> Add to typos.txt, evt. move to typos gold corpus
Linguistic spelling errors
- ==> Add to typos.txt, evt. move to typos gold corpus
Scanning errors
- ==> Analyse the scanning errors and add search-replace to xsl file
Language recognition errors
- ==> Check whether the xsl file lists the relevant languages
- ==> Improve language rec module
Numbers not recognised
- ==> Improve fst
Unknown words (bad fst)
- ==> Improve fst
Corrupted original
- ==> Consider removing it

TODO:

Improve conversion according to error type, as sketched above

Results of finding errors for South Sami

[April 20 Analysis of sma corpus corpus_errors_sma.txt], and breakdown of the error types.

TODO:

Sma improvement of the test results above

Finding catalogue errors:

List all files in langX-catalogue with more non-langX content than langX-content.

TODO:

Still not done.

Correcting OCR errors

Develop algorithms for automatic correction of OCR errors. This work must be done separately for each language.

GiellaLT

Page Content

OCR reading

Algorithm for dealing with OCR errors

Finding these errors

Results of finding errors for North Sami

Results of finding errors for South Sami

Finding catalogue errors:

Correcting OCR errors