GiellaLT provides rule-based language technology aimed at minority and indigenous languages
This page explains how to fetch whole Wikipedia editions as raw text
You now want to convert the xml files to text. Use e.g. the script https://pypi.org/project/wikiextractor/. If you have downloaded the svn giellalt file tree from Tromsø, you already have this script, in
$GTHOME/gt/script/corpus/. If not, look at the documentation on the script’s homepage. The script has a –help option explaining
usage. Let us say you call the folder for output
Here are two ways of stripping xml tags. First, just with sed:
cat outf/* | sed 's/<[^>]*>//g;' | ...
For Tromsø users we have made a script to somewhat refine this command, also that in $GTHOME/gt/script/corpus/. It is called
cat outf/* | sh $GTHOME/gt/script/corpus/rydd_i_wikipedia.sh | ...