GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology. Read more about Why. See also How to get started and our Privacy document.
This document will teach the user how to convert files in the corpus repositories to xml, and how to extract text from these documents.
To be able to convert files in our repository, you have to check out our tools and do a basic setup.
The corpus is stored using git.
For each language, the corpus files are divided in a free and a restricted part. Each corpus has two directories, one for the original files (*docx, pdf, html, …) and one for the converted text files (xml). For a language with ISO code xxx, the four repositories are:
corpus-xxx
corpus-xxx-orig
corpus-xxx-x-private
corpus-xxx-orig-x-private
Access to the private
repositories are restricted for copyright reasons, and are given only to project employees working on the corpora. The open ones are available for all, at giithub.com/giellalt/ (search for corpus-xxx
, xxx being the ISO code of your language) in the search field).
The converted xml files are found in the corpus-xxx/converted/
catalogue. To get
all North Saami text, issue the command ccat -a -r -l sme corpus-sme/converted
.
The options available for ccat
are listed with the
command ccat -h
.
If you do not have ccat
installed (part of our corpustools), you may use cat
and get xml files.
(this documentation may be obsolete)