GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology.
TODO:
Tne Error Detection Algorithm runs as follows:
For each file:
Look at the worst files, and figure out how to mend them, or move them, e.g. to an OCR gold standard
Here is a list of errors per file in each folder in the admin directory. For each file we list error/total ratio - total number of words - words not recognized - filename, and we sort the file list according to error/total ratio:
A list of error analyses can be found from corpus error analysis.
Error typology (summarising the corpus error analysis):
TODO:
[April 20 Analysis of sma corpus | corpus_errors_sma.txt], and breakdown of the error types. |
TODO:
List all files in langX-catalogue with more non-langX content than langX-content.
TODO:
Develop algorithms for automatic correction of OCR errors. This work must be done separately for each language.