File alignment plan
#Intro
Plan for aligner and analysis work
Goal: Send analysed texts to Oslo on order
1. find parallel texts
2. align
several alignments of the same text
3. analyse the sme part
4. send to Oslo
Solution:
1. A document in CVS where we check in what is done
2. Have a directory on victorio (G5?) where we keep track of what is done
Store the aligned files, so that we can see how old they are
Processing phases:
corpus conversion file.sme.xml, file.nob.xml
numbered sentence phase file.sme.xml.sent, file.nob.xml.sent
aligned sentence phase (alignment.jar => xml_cor.sent files
analysed phase (corpus-analyze.pl =>
To be sent:
file.nob.xml.sent (analysed in oslo, for the time being, at least)
file.sme.xml.sent.analyzed
file.sme.xml_file.nob.xml.xml
cor = result of the alignment
Principled names:
file.sme_new.txt numbered <s id="n"> nodes only. This is
the file which tca2 actually reads.
file.sme.xml_file.nob.xml.xml <s> number correspondences only
file.sme.xml_cor.sent original xml file with sme and nob <s> numbers
Example:
dc_1_01.doc.xml_sp_1_01.doc.xml.xml
dc_1_01.doc.xml_cor.sent
dc_1_01.doc.xml_new.txt
parallel/
namm-namn/
namn.sent
namm.sent
namm.sent.analyzed
namm.xml_namn.xml.xml
...
1. Trond: Find texts, mark the xsl files (cf. dc_ files for model), tell Saara
2. Saara: make .sent and .sent.analyzed
Børre: make namm_namn.xml
3. Trond: Send files to Lars