GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology. Read more about Why. See also How to get started and our Privacy document.
The Sámi languages (more general: The GiellaLT languages) are missing from tesseract.
In order to train on the document filename.pdf
, do the following:
Convert to html:
pdftohtml filename.pdf
Open the resulting filename.html and find a nice page to train on. In e.g. Preview, cut lines one by one: Mark with mose, cmd C, cmd N. Save the file as filename_page_line.png
.
In filename.html, find the corresponding line. Copy it to a file filename_page_line,gt.txt
, correct it if needed and save.
The files filename_page_line.png
and filename_page_line.gt.txt
should be placed in divvungellatekno/tesstrain/training-data/sme-ground-truth/
.
Then train the model, as follows:
gmake training MODEL_NAME=sme
Check in the resulting sme.traineddata
Then copy the resulting `sme.traineddate to where tesseract may find it:
On Mac Intel:
cp divvungiellatekno/tesstrain/training-data/sme.traineddata /usr/local/share/tessdata/
On other processors and machines:
cp divvungiellatekno/tesstrain/training-data/sme.traineddata /opt/homebrew/share/tessdata/sme.traineddata