OCR reading
Algorithm for dealing with OCR errors
Finding these errors
- Problem: There are document-specific conversion errors that result in letters, not in garbage, errors that can be found only by linguistic means.
- Solution: identify the problematic files via error detection with fst
TODO:
Tne Error Detection Algorithm runs as follows:
For each file:
- Analyse the main language text morphologically
- Count the missing ones
- Register the missing/total ratio, and pick the worst files
Look at the worst files, and figure out how to mend them, or move them, e.g. to an OCR gold standard
Results of finding errors for North Sami
Here is a list of errors per file in each folder in the admin directory. For each file we list error/total ratio - total number of words - words not recognized - filename, and we sort the file list according to error/total ratio:
- admin/depts/others
- admin/guovda
- admin/others
- admin/regjering
- admin/sd/others
- admin/sd/samediggi
- sma corpus errors
- sma corpus error analysis
- sme corpus errors analysis
- sme corpus errors admin/
- sme corpus errors analysis
- sme corpus errors guovda/
- sme corpus errors regjering/
- smj corpus errors
- smj corpus error analysis
A list of error analyses can be found from corpus error analysis.
Error typology (summarising the corpus error analysis):
- Conversion errors
- ==> Improve conversion
- Typing errors
- ==> Add to typos.txt, evt. move to typos gold corpus
- Linguistic spelling errors
- ==> Add to typos.txt, evt. move to typos gold corpus
- Scanning errors
- ==> Analyse the scanning errors and add search-replace to xsl file
- Language recognition errors
- ==> Check whether the xsl file lists the relevant languages
- ==> Improve language rec module
- Numbers not recognised
- ==> Improve fst
- Unknown words (bad fst)
- ==> Improve fst
- Corrupted original
- ==> Consider removing it
TODO:
- Improve conversion according to error type, as sketched above
Results of finding errors for South Sami
-
[April 20 Analysis of sma corpus corpus_errors_sma.txt], and breakdown of the error types.
TODO:
- Sma improvement of the test results above
Finding catalogue errors:
List all files in langX-catalogue with more non-langX content than langX-content.
TODO:
- Still not done.
Correcting OCR errors
Develop algorithms for automatic correction of OCR errors. This work must be done separately for each language.