GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology. Read more about Why. See also How to get started, and our Privacy document.
#Intro
Plan for aligner and analysis work
Goal: Send analysed texts to Oslo on order
1. find parallel texts
2. align
several alignments of the same text
3. analyse the sme part
4. send to Oslo
Solution:
1. A document in CVS where we check in what is done
2. Have a directory on victorio (G5?) where we keep track of what is done
Store the aligned files, so that we can see how old they are
Processing phases:
corpus conversion file.sme.xml, file.nob.xml
numbered sentence phase file.sme.xml.sent, file.nob.xml.sent
aligned sentence phase (alignment.jar => xml_cor.sent files
analysed phase (corpus-analyze.pl =>
To be sent:
file.nob.xml.sent (analysed in oslo, for the time being, at least)
file.sme.xml.sent.analyzed
file.sme.xml_file.nob.xml.xml
cor = result of the alignment
Principled names:
file.sme_new.txt numbered <s id="n"> nodes only. This is
the file which tca2 actually reads.
file.sme.xml_file.nob.xml.xml <s> number correspondences only
file.sme.xml_cor.sent original xml file with sme and nob <s> numbers
Example:
dc_1_01.doc.xml_sp_1_01.doc.xml.xml
dc_1_01.doc.xml_cor.sent
dc_1_01.doc.xml_new.txt
parallel/
namm-namn/
namn.sent
namm.sent
namm.sent.analyzed
namm.xml_namn.xml.xml
...
1. Trond: Find texts, mark the xsl files (cf. dc_ files for model), tell Saara
2. Saara: make .sent and .sent.analyzed
Børre: make namm_namn.xml
3. Trond: Send files to Lars