This document will teach the user how to convert files in the corpus repositories to xml, and how to extract text from these documents.
To be able to convert files in our repository, you have to check out our tools and do a basic setup.
To be able to easily use the content of the corpus, we need to convert the original files stored in various formats to a uhttps://giellalt.uit.no/infra/anonymous-svn.html#Preparationniform format. This is achieved using the script convert2xml.
To convert all the original files inside a repository to xml, open a terminal and go to the working copy, and issue the command:
If you want to convert a specific file, issue the command:
The graphical show that follows indicate lines looking like this: “…|……”. Here, each symbol represents a file. The dots indicate “(formally) successful conversion”, and the pipe symbols indicate flawed conversion. Note that the conversion will be flawed if you are offline while converting, as the conversion scripts takes the dtd from the net (we should consider making this more robust).
The converted xml files are found in the
converted/ catalogue. To get
all North Saami text, issue the command
ccat -a -r -l sme converted.
To get other languages, exchange
sme with e.g.
sma (South Sami) or
smj (Lule Sami). The options available for
ccat are listed with the