The corpus is divided in a free part, where texts that we can redistribute are gathered (essentially, texts in the public domain), and a bound part, where we gather texts that we can’t redistribute.
The free part is available in an svn repository that is world readable available on the address https://gtsvn.uit.no/freecorpus/.
Check out a working copy either by adding the above address to your svn
program, or use the command line program like this:
svn checkout https://gtsvn.uit.no/freecorpus.
The bound part is available in an svn repository that is accessible only from the machine gtsvn.uit.no. This repository is readable by people that have signed our time limited non disclosure agreement, and is writable by the corpus maintainers. The address to this repository is svn://gtsvn.uit.no/boundcorpus.
To check out a working copy, first login to victorio. Then issue the
svn co svn://gtsvn.uit.no/boundcorpus .
If you would like to have access to the bound corpus, contact us at firstname.lastname@example.org
The main directory in the corpus repository is orig. orig contains the original files with the original names (bar spaces replaced with _) and in the original format. Metadata about the original file is in an xsl file in the same directory as the original file, and has the same name as the original file plus an .xsl extension.
Inside these directories are directories for various languages. They contain parallel translations to the sami documents. Below is an outline of the directory structure.
orig/ eng/ fin/ nob/ sma/ sme/ smj/ swe/
Inside each of the language directories the structure outlined below is used.
admin/ depts/ guovda/ karas/ others/ sd/ bible/ ot/ nt/ facta/ ficti/ laws/ news/ Assu MinAigi NRK YLE other science/
If you run the conversion process the directory converted is created. It has the same structure as orig, but contains the files converted from the original format to our internal xml-format. The converted directory contains copies of all the files in the corpus database.
These 7 overarching genres should be understood as follows:
Todo: Write more about this.
Some parts of our corpus are used as test data for different purposes, presently mainly for proofing tools. These files have additional markup to add info about linguistic errors of different types and their corrections. These files are located within a directory named goldstandard, which has the same internal structure as shown above:
orig/ # same orig/ as above goldstandard/ orig/ sma/ sme/ smj/
It is essentially that goldstandard files do NOT enter the regular corpus, as that will destroy any reliability on reported coverage test results (by way of lexicalising missing words). So make sure you decide whether a document is to be used as a regular or goldstandard document before you add it to the repository, and make sure it is added in only one location. A goldstandard document can always be demoted to a regular corpus document, but the other way around is not possible.
To add all files found in a directory to a working copy of a corpus, you can use the add_files_to_corpus program.
If you have a file that you want to add to the corpus repository you have to have a working copy of either the free or bound part of our corpus.
eng, fin, nno, nob, sma, sme, smj, swe, and then genre:
admin, bible, facta, ficti, laws, news.
convert2xml <filename>. This command converts the original document to xml, and makes a default metadata document with empty values.