This page explains how to fetch whole Wikipedia editions as raw text
You now want to convert the xml files to text. Use e.g. the script https://pypi.org/project/wikiextractor/. If you have downloaded the svn giellalt file tree from Tromsø, you already have this script, in
$GTHOME/gt/script/corpus/. If not, look at the documentation on the script’s homepage. The script has a –help option explaining
usage. Let us say you call the folder for output
Here are two ways of stripping xml tags. First, just with sed:
cat outf/* | sed 's/<[^>]*>//g;' | ...
For Tromsø users we have made a script to somewhat refine this command, also that in $GTHOME/gt/script/corpus/. It is called
cat outf/* | sh $GTHOME/gt/script/corpus/rydd_i_wikipedia.sh | ...