GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology. Read more about Why. See also How to get started and our Privacy document.
For some of our languages we use Wikipedia as a corpus. This may be done in several ways, here is a documentation of how to make a text-only version of a wikipedia edition.
We use Moksha Mordvin (ISO code mdf) as an example, replace mdf with the language of your choice.
Download the wikipedia from http://dumps.wikimedia.org/mdfGwiki/latest/
,
see Moksha Mordvin for
an example. The version relevant to plain-text corpus is called
$LANGwiki-latest-pages-articles.xml.bz2
.
Unzip the .bz2 file, and save the resulting .xml file somewhere.
The text will be split in smaller files, so make a directory, say mdfwiki for the output.
Extract the xml with the WikiExtractor.py script:
cat mdfwiki-latest-pages-articles.xml | WikiExtractor.py -o mdfwiki
The result is stored in directories AA, AB, etc. with 100 files in each directory. You may bundle the whope Wikipedia into one file by a command like the following:
cat mdfwiki/*/wiki*|sed 's/<[^>]*>//g;' > mdfwikicorp.txt
Other versions of Wikipedia on the download site include all corrections of all articles (just go for the biggest file). This makes the Wikipedia an interesting source for spellchecker development and other types of research on text production.
A paper describing this in some detail is the following: