This document describes the more technical side of the corpus infrastructure. The document corpus_conversion.html contains the user documentation as well as the description of the basic infrastructure. The usage of the script convert2xml.pl is documented there, this document describes the contents of the script.
All the binaries and scripts used in the corpus conversion process are
stored under $GTHOME/gt/script/corpus, except the tools that are already
installed to some common bin. The xsl-files which transfer the
structural information of different source files (docbook2xml.xsl and
xhtml2xml.xsl) and the template for file specific xsl-files
(XSL-template.xsl) are also stored there. The directory
$GTFREE|$GTBOUND/tmp/ is reserved for temporary files and log files
that are created during the conversion. The log files contain the system
commands executed during the conversion as well as the warnings and
error messages. The log file is named after the file that is converted.
The next sections describe the conversion processes for each document type and the tools used. The conversion includes hyphen-detection and language recognition as well as decoding the wrongly utf-8 -encoded characters.
Microsoft Word documents are converted with the program
produce a docbook xml, and piped to the xslt program xsltproc, that
converts to our XML-format. We have an xsl-document docbook2corpus.xsl,
that is used in converting the document.
These documents are converted to our xml format using avvir2corpus.xsl
These documents are converted to our xml format using svg2corpus.xsl
Plain text documents that are stored to the database should have the
.txt. The encoding of a text document is solved using the
iconv -tool. If there are special markings for headings and paragraphs,
like in some newspaper texts, they are used in creating the document
structure. Otherwise, the empty lines mark paragraph breaks and short
lines beginning with numbers are treated as titles.
Web documents are first cleaned using the program HTML
Tidy , using several command line
options.. The output is converted to xml using the xsl-file
Paratext is a file format for publishing and interchanging basic Scripture texts in multiple languages. It is intended to be used for all aspects for Bible layout and publishing. The paratext format is based on backslash codes, format called USFM, see https://paratext.org/usfm/ The paratext files are converted using a Perl script paratext2xml.pl which forms basic xml-structure which roughly corresponds our corpus.dtd. The files that are added to the corpus repository should have the suffix .ptx.
Usage: add-hyph-tags.pl [OPTIONS] FILES Tag the hyphenation marks. Options --all search the whole text for hyphenation points. The default is to search only the end of the lines. --infile=<file> the input file. --outfile=<file> output file. --help prints the help text and exit.
The script replaces hyphens in the text with tag <hyph/>. The hyphens are searched by default at the end of the line. The option –all can be used for replacing the hyphens all over the text. The aim of the script is to replace only “real” hyphens, i.e. the ones that mark real hyphenation points. The hyphens in e.g. numeric expressions are not replaced. The words which precede a hyphen but are not hyphenation points such as “ja” in expression “davvi-, julev- ja oarjelsámegillii” are taken into account.
The reason why the hyphens are tagged by default only at the line ends is the existence of e.g. the following correct hyphenation marks which do not mark a hyphenation point. Some of these occur at the line ends and cause errors.
teknihkalaš-luonddudieđalaš Norplus-prográmma norgga-ruoŧa dánska-norgalaš
The script takes into account the xml-notation of the file. If the last word of a paragraph marked with </p> ends with a hyphen, the next paragraph beginning with <p> is searched for the rest of the word. Some text extraction tools such as antiword may create this kind of structures. The script also reformats the text by removing white space, moving <p>-tags and changing the place of the line break.
The newly created xml-document is parsed and the language of each
paragraph is recognized using the tool
pytextcat. The tool is
described in the document Language recognition using
pytextcat. The language recognition tool is not perfect,
but mostly it gets it right.
The document always has a main language, and only the differing languages are marked in the xml-structure. By default, all the languages in the language model (there are many) can occur in the document and they are taken into account in the categorization process. However, since the e.g. the different Sámi language easily confuse with each other and Finnish, the language recognizion can be restricted to some subset of these languages. The document can be explicitely marked as monolingual, or multilingual containg text fractions of some of the abovementioned languages. You should set these variables in the file-specific xsl-file.
The structural information, such as titles and paragraphs, that is contained in MS Word of pdf document is preserved in the xml-document. The antiword program that is used in converting the Word documents produces xml docbook format. That format is further transformed to our xml-format, using xsl-script docbook2xml.xsl. The similar script xhtml2xml.xsl is used in transforming the structural information in html document to our xml-format. Pdf-files are first converted to html and the same xsl-script is used.
Word documents may contain metainformation, such as the name of the owner of the file, which is preserved as well. The other metainformation is added to the xml-file through the file-specific xsl-script. The process is explained in the usage documentation. The file specific xml-file is copied from the file XSL-template.xsl, located in corp/bin -directory. It contains variables for adding the metainformation. These variables always overrides the metainformation coming from the original document. The metainformation recieved from the web upload script is stored straight to the file-specific xsl-file, so the information can be altered manually.
common.xsl contains instructions for building the final
xml-structure of the corpus file. The structure is validated against the
document type definition, http://giellatekno.uit.no/dtd/corpus.dtd
common.xsl is included in every file-specific xsl-script.
There is a special script
empty.xsl to be used instead of common.xsl
when the document cannot be converted to xml-structure. This can happen
for several reasons, but the most common reason is that the character
encoding in the original document is somehow broken; the Saami
characters may be missing or there are several character encodings used,
when the result of the conversion is not satisfactory. The document
could be removed from the database as well, but e.g. some newspaper
documents are considered to be part of the distribution.
There are several small scripts for corpus database maintenance and
cleaning. They reside in
gt/script -catalog. The most important ones
are listed here:
gt/doc/lang/corp. The script calls the Perl-script ` corpus_summary.pl
, which generates the summaries. The filecorpus-content.xml
contains a list of all the files in the corpus database and some relevant information like the license and size. The filescorpus-susummary.xml
andcorpus-summary-YYYY-MM-DD.xml` contain the total count of the documents as well as the number of the documents in different subdirectories. The xml-files are further transformed to forrest documentation.
change_xsl.xslfor the transformation. The xsl-script should be modified for the different uses and the variable containing the path to the script changed. The version control of the xsl-files is handled automatically, although sometimes stealing a lock is necessary and requires some typing.
make LANGUAGE=sme GENRE=factaor
corpus.dtd contains the document type definition for the xml-structure. It is stored in http://giellatekno.uit.no/dtd/corpus.dtdThe fields are briefly described in the following:
The document is divided into to elements: header that contains the metainformation and body for the document content. The header contains the following fields:
The document body contains sections and text-entities (
pwith type “listitem”.
pwith type “tablecell”.
Some of the sámi characters are wrongly utf-8-encoded by the conversion tools, like pdftotext. There is a Perl module samiChar::Decode.pm for decoding the sámi characters.
use samiChar::Decode; my $file = "file.txt"; my $outfile = "file.txt"; my $encoding; my $lang = "sme"; $encoding = &guess_encoding($file, $lang); &decode_file($file, $encoding, $outfile); $encoding = &guess_text_encoding($file, $outfile, $lang); &decode_text_file($file, $encoding, $outfile);
samiChar::Decode.pm decodes characters to utf-8 byte-wise, using code tables. It is planned for decoding the Saami characters in a situation, where the document is converted to utf-8 without knowing the original encoding. The decoding is implemented by using code table files, so the module can be used for other conversions as well. The output is however always utf-8.
guess_encoding is used for guessing the original
encoding. It takes into account only the most common Sámi characters and
their frequency in the text. The file is further decoded using the
decode_text_file can be used for guessing the encoding and decoding a
file which is not utf-8 encoded but in it’s original encoding. This is
the case with many text files that are not converted by any tool. Thes
functions are implemented using the iconv conversion tool.
Code tables are text files with the following format: Three space-separated columns:
Column #1 is the input char (in hex as 0xXX or 0xXXXX)) Column #2 is the Unicode char (in hex as 0xXXXX) Column #3 the Unicode name
Most of the code tables are available at the Unicode Consortium: ftp://ftp.unicode.org/Public/MAPPINGS/
. Some of the code tables like samimac_roman and levi_winsam are composed from two code tables, the one that is used as input encoding and another that is used as the file was converted to utf-8.
samimac_roman: codetables samimac.txt and ROMAN.txt levi_winsam: codetables levi.txt and CP1258.txt
levi.txt and samimac.txt are available under Trond’s home page at: smi-kodetabell.html. The codetables are composed using the function “combine_two_codings($coding1, $coding2, $outfile)” which is available in this package.
These encodings are available:
latin6 => iso8859-10-1.txt plainroman => ROMAN.txt CP1258 => CP1258.txt iso_ir_197 => iso_ir_197.txt samimac_roman => samimac_roman.txt levi_winsam => levi_CP1258.txt 8859-4 => 8859-4.txt winsam => winsam.txt
The original input encoding is guessed by examining the text and searching the most common characters. The unicode characters in hex are listed in hash %Char_Tables for North Saami for example. The uncommented characters are the ones that take part into guessing the encoding.
The encodings are listed in the hash %Charfiles, they are tested one at the time. The occurences of the selected characters in that encoding are counted and the one with most occurences is returned. There is a place for more statistical analysis, but this simple test worked for me.
If there is no certain amount of characters found, the test returns -1, which means that the characters should be already correctly utf-8 encoded. Or, the encoding was not found from the code tables.
To your own computer: copy the directory
/System/Library/Perl/5.8.6/. The module is now installed to victorio.
The main page of the web upload interface is https://divvun.no/upload/upload_corpus_file.html
The source of the page can be found in directory xtdoc/sd/src/documentation/content/xdocs/upload/. There are two cgi-scripts involved in the upload, upload.cgi and xsl-process.cgi. They are both in gt/script/cgi-scripts/ -catalog. upload.cgi uploads the file the user has selected, converts it to xml and prints out the metainformation form. The xsl-process.cgi handles the metainformation form and stores the contents of the fields to the file-specific xsl-file. The xml-file is converted once more with the updated metainformation.
All the files remain in the corp/tmp -directory. Every succeeded upload triggers an email message to the maintainer, who has to move the files manually to their place. The email notification is send as well if there is an error during the upload.
The file names are changed to secure ones and orig-hierarchy is checked for a file with the same content.