Experimenting with Tesseract

Note! This documents experimenting done in 2022 (?).

Fetching the program

The open source program Tesseract can be fetched from Github:

git clone git@github.com:tesseract-ocr/tesseract.git
git clone git@github.com:tesseract-ocr/tessdata.git
...

Tesseract comes with a set of languages (see tessdata). Most GiellaLT languages are not included, though. TODO: Document how to add them.

A pdf document as a picture should be

In Preview, set the document in Thumbs view and drag one page at a time to the desktop. TODO: Find a way to do this on the command line.

Let us say the document contained 8 pages, after the split named 1.pdf, 2.pdf, … Then do the following:

for i in 1 2 3 4 5 6 7 8 9 10 ; do sips -s format png $i.pdf --out $i.png ; done

Let us say the document contains Norwegian and Finnish. Standing in tesseract-ocr/tesseract, run the 8 pdf files through tesseract:

for i in 1 2 3 4 5 6 7 8 ; do tesseract --tessdata-dir ../tessdata/ $i.png $i.txt -l fin+nor ; done

The resulting files may then be collected into one text file.