GiellaLT

GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology. Read more about Why. See also How to get started and our Privacy document.

View GiellaLT on GitHub

Page Content

Tesseract development

The Sámi languages (more general: The GiellaLT languages) are missing from tesseract.

Train on data

In order to train on the document filename.pdf, do the following:

Convert to html:

pdftohtml filename.pdf

Open the resulting filename.html and find a nice page to train on. In e.g. Preview, cut lines one by one: Mark with mose, cmd C, cmd N. Save the file as filename_page_line.png.

In filename.html, find the corresponding line. Copy it to a file filename_page_line,gt.txt, correct it if needed and save.

The files filename_page_line.png and filename_page_line.gt.txt should be placed in divvungellatekno/tesstrain/training-data/sme-ground-truth/.

Then train the model, as follows:

gmake training MODEL_NAME=sme

Check in the resulting sme.traineddata

Then copy the resulting `sme.traineddate to where tesseract may find it:

On Mac Intel:

cp divvungiellatekno/tesstrain/training-data/sme.traineddata /usr/local/share/tessdata/

On other processors and machines:

cp divvungiellatekno/tesstrain/training-data/sme.traineddata /opt/homebrew/share/tessdata/sme.traineddata