GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology. Read more about Why. See also How to get started and our Privacy document.
TODO:
Tne Error Detection Algorithm runs as follows:
For each file:
Look at the worst files, and figure out how to mend them, or move them, e.g. to an OCR gold standard
Here is a list of errors per file in each folder in the admin directory. For each file we list error/total ratio - total number of words - words not recognized - filename, and we sort the file list according to error/total ratio:
A list of error analyses can be found from corpus error analysis.
Error typology (summarising the corpus error analysis):
TODO:
[April 20 Analysis of sma corpus | corpus_errors_sma.txt], and breakdown of the error types. |
TODO:
List all files in langX-catalogue with more non-langX content than langX-content.
TODO:
Develop algorithms for automatic correction of OCR errors. This work must be done separately for each language.