GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology. Read more about Why. See also How to get started and our Privacy document.
This page explains how to fetch whole Wikipedia editions as raw text
You now want to convert the xml files to text. Use e.g. the script https://pypi.org/project/wikiextractor/. If you have downloaded the svn giellalt file tree from Tromsø, you already have this script, in $GTHOME/gt/script/corpus/
. If not, look at the documentation on the script’s homepage. The script has a –help option explaining
usage. Let us say you call the folder for output outf
.
Here are two ways of stripping xml tags. First, just with sed:
cat outf/* | sed 's/<[^>]*>//g;' | ...
For Tromsø users we have made a script to somewhat refine this command, also that in $GTHOME/gt/script/corpus/. It is called rydd_i_wikipedia.sh
cat outf/* | sh $GTHOME/gt/script/corpus/rydd_i_wikipedia.sh | ...