GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology.
Note: The document is from 2011
freecorpus/converted/sme/admin/others/
** STM200420050011000SE_PDFS.pdf.xml
STM200420050044000SE_PDFS.pdf.xml
have encoding errors that đ is represented as and the document is full of
’s; thus these files should be deleted
** file OTP200620070025000SE_PDFS.pdf.xml
has paragraphs with content ‘——–’ so it should be deleted.
** file STM200320040010000SE_PDFA.pdf.xml
has so many errors, it should be rescanned
** uito-ohpenplana.txt.xml
the original file is corrupted