GiellaLT provides rule-based language technology aimed at minority and indigenous languages
This page will at some point document our OCR work.
We have been experimenting with OCR in 2016 and earlier (cf. the meeting memo from 2016 below). With recent advances in OCR techniques we will have to start this work again, with new programs. This page sketches how.
The open source program Tesseract can be fetched from Github:
git clone firstname.lastname@example.org:tesseract-ocr/tesseract.git git clone email@example.com:tesseract-ocr/tessdata.git ...
Tesseract comes with a set of languages (see tessdata). Most GiellaLT languages are not included, though. TODO: Document how to add them.
A pdf document as a picture should be
In Preview, set the document in Thumbs view and drag one page at a time to the desktop. TODO: Find a way to do this on the command line.
Let us say the document contained 8 pages, after the split named 1.pdf, 2.pdf, … Then do the following:
for i in 1 2 3 4 5 6 7 8 9 10 ; do sips -s format png $i.pdf --out $i.png ; done
Let us say the document contains Norwegian and Finnish. Standing in tesseract-ocr/tesseract, run the 8 pdf files through tesseract:
for i in 1 2 3 4 5 6 7 8 ; do tesseract --tessdata-dir ../tessdata/ $i.png $i.txt -l fin+nor ; done
The resulting files may then be collected into one text file.