GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology. Read more about Why. See also How to get started and our Privacy document.
Note: The document is from 2011
freecorpus/converted/sme/admin/others/
** STM200420050011000SE_PDFS.pdf.xml
STM200420050044000SE_PDFS.pdf.xml
have encoding errors that đ is represented as and the document is full of
’s; thus these files should be deleted
** file OTP200620070025000SE_PDFS.pdf.xml
has paragraphs with content ‘——–’ so it should be deleted.
** file STM200320040010000SE_PDFA.pdf.xml
has so many errors, it should be rescanned
** uito-ohpenplana.txt.xml
the original file is corrupted