GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology. Read more about Why. See also How to get started and our Privacy document.
Note! This documents experimenting done in 2022 (?).
The open source program Tesseract can be fetched from Github:
git clone git@github.com:tesseract-ocr/tesseract.git
git clone git@github.com:tesseract-ocr/tessdata.git
...
Tesseract comes with a set of languages (see tessdata). Most GiellaLT languages are not included, though. TODO: Document how to add them.
A pdf document as a picture should be
In Preview, set the document in Thumbs view and drag one page at a time to the desktop. TODO: Find a way to do this on the command line.
Let us say the document contained 8 pages, after the split named 1.pdf, 2.pdf, … Then do the following:
for i in 1 2 3 4 5 6 7 8 9 10 ; do sips -s format png $i.pdf --out $i.png ; done
Let us say the document contains Norwegian and Finnish. Standing in tesseract-ocr/tesseract, run the 8 pdf files through tesseract:
for i in 1 2 3 4 5 6 7 8 ; do tesseract --tessdata-dir ../tessdata/ $i.png $i.txt -l fin+nor ; done
The resulting files may then be collected into one text file.