One of the goals of the giellatekno-project is to provide easy access to the text materials for non-commercial purposes such as research. The text materials will be available through a query processing tool: a tool with which a user can fetch different types of data from the Sámi corpora. The raw corpus material is collected in co-operation with the owners of the documents. The documents are preprocessed so that the texts can be used in research. The process of text collection is described in documents corpus_conversion.html and corpus_conversion_tech.html. This documents describes the process where the document is transferred to the graphical corpus interface. The graphical corpus interface is developed and maintaned byTextlaboratorietin the university of Oslo.
Files that are ready to be parellellised exist in
$GTFREE/prestable/converted. The steps to parallelize between sme and
make GTLANG=sme abbr
generate-anchor-list.pl --lang1=sme --lang2=nob --outdir=$GTFREE $GTHOME/gt/common/src/anchor.txt $GTHOME/gt/common/src/anchor_admin.txt
The files may be parallellised in commandline mode.
$GTFREE/prestable/converted/nobusing this command:
for file in `find $GTFREE/prestable/converted/sme -name \*.xml | grep -v .svn`; do corpus-parallel.pl --lang1=sme --lang2=nob $file ; done
The files may also be parallellised in graphical mode.
To parallelize the other way, exchange the values for lang1 and lang2 in
step 2 and 4, and change the find command in step 4 to
The project-internal corpus format contains the basic elements, such as paragraphs, lists and tables that can be extracted from the original document format. The xml-format of the Saami corpus resources is documented in corpus_conversion.html
The original name of the document is preserved in the process with the
suffix indicating the document type, e.g.
file.doc. When the text is
extracted from the original document and moved to xml-structure the file
gets the extension
.xml. So the resulting file is
file is used as a basis for analysis. The analyzed corpus text is by
default stored to a file
file.doc.analyzed.xml. There is one
intermediate format which is used for alignment of the parallel texts,
those files are indicated with suffix
The xml-files reside in either
The XML format of the analyzed text is basically the following:
<p> <s> <w form="The"> <reading lemma="the" POS="DET" /> </w> <w form="flies"> <reading lemma="fly" POS="N" /> </w> </s> </p>
See the description of the dtd.
corpus-analyze.pl [OPTIONS] XML-FILE
The document is stored to the corpus database in xml-format that consist of sections, paragraphs <p>, lists and tables. Table and list elements, which often contain numeric data, are excluded from the document when it’s prepared for analysis. The paragraphs that are marked in some other language than the main language of the document can be exclulded as well.
The following options regulate the exclusion and inclusion of elements:
--tables Take also <table> -elements, which are excluded by default. -T --lists Take also <list> -elements, which are excluded by default. -L --all Take all elements. -a
The other options:
--help Print the help text and exit. -h --lang <lang> The main language of the document. The language defines the path to the tools. -l <lang> --tags=<file> Location of the file korpustags.txt -t <file> --output=<file> The file for output. -o <file> --add_sentences Add <s>-tags to the document during analysis. Use with files which are not aligned. --s --only_add_sentences Add <s> tags using preprocessor and abbr.txt. Do not analyze. -n
For example, to analyze the file
use the following command:
corpus-analyze.pl --add_sentences --lang=sme --output=file.doc.analyzed.xml $GTFREE/sme/admin/sd/file.doc.xml
The files in the corpus hierarchy do not contain sentence elements (<s>). Sentence elements are the basic units of analysis and have to be added with –add_sentences or -s. If this option is not given, the <s> tags are assumed to be already added. The <s>-elements may be added without analysis, with command:
corpus-analyze.pl --only_add_sentences --lang=sme --output=file.doc.sent.xml $GTFREE/sme/admin/sd/file.doc.xml
At the same time when <s>-tags are added, the sentences are numbered and given id’s. These id’s are used in alignment process.
Each xml-document in the corpus database contains field for parallel
documents, which exist in the corpus hierarchy. For example, the header
section of the North Saami document
file.doc.xmlmay contain the
<parallel_text location="file_n.doc" xml:lang="nob"/> <parallel_text location="file_s.doc" xml:lang="smj"/>
This means that there are two parallel documents for this document in
the corpus hierarchy. The “location” attribute contains the name of the
parallel file, which is assumed to be found in the same subdirectory as
file.doc.xml. The information of the parallel files can be
updated to the xml-document by editing the file-specific xsl-file, see
instructions. These fields conduct the search and processing of the
Parallel files are processed by the script
corpus-parallel.pl [OPTIONS] [XML-FILE] --help Print this help text and exit. -h --dir=<dir> The directory where the files are searched. -d --lang=<lang> The main language. -l <lang> --plang=<lang> The language of the parallel document(s). -p <lang> --list List the parallel files, use with option --dir. -L --outdir=<dir> The directory where the output files are stored. Default is /usr/local/share/corp/tmp -o
The parallel documents in some directory may be listed with command:
corpus-parallel.pl --list --lang=sme --dir=$GTFREE/sme/admin > file-list.txt
The parallel files are preprocessed for alignment by detecting sentence boundaries, numbering each sentence and placing it inside <s>-element. The command to use is:
corpus-parallel.pl --lang=sme --plang=nob $GTFREE/sme/admin/file.doc.xml
The script calls the Perl script corpus-analyze.pl for adding the
sentence-elements. The tools that are used for sentence boundary
detection have to be changed in that script. The resulting files are
$GTFREE/tmp, the resulting file names are:
The documents are aligned using TRANSLATION CORPUS ALIGNER (TCA) 2by Knut Hofland and Øystein Reigem, slightly modified by us so to easily use it from the command line.
The program gets as input the files that contain numbered
The program outputs three files:
file.doc.sent.xml_file_n.doc.sent.xml.xmlwhich indicates the order of the paired id lists.
The file korpustags.txt contains the list of tags and their internal distrbution. The list below list is not up-to-date, please see the file korpustags.txt in cvs.
There are the following tag categories: