GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology. Read more about Why. See also How to get started, and our Privacy document.

Page Content

File alignment plan

#Intro

Plan for aligner and analysis work

Goal: Send analysed texts to Oslo on order



1. find parallel texts
2. align
    several alignments of the same text
3. analyse the sme part
4. send to Oslo


Solution:

1. A document in CVS where we check in what is done
2. Have a directory on victorio (G5?) where we keep track of what is done
Store the aligned files, so that we can see how old they are
Processing phases:
corpus conversion  file.sme.xml, file.nob.xml
numbered sentence phase file.sme.xml.sent, file.nob.xml.sent
aligned  sentence phase (alignment.jar => xml_cor.sent files
analysed phase (corpus-analyze.pl =>

To be sent:
file.nob.xml.sent (analysed in oslo, for the time being, at least)
file.sme.xml.sent.analyzed
file.sme.xml_file.nob.xml.xml

cor = result of the alignment

Principled names:
file.sme_new.txt               numbered <s id="n"> nodes only. This is
                               the file which tca2 actually reads.
file.sme.xml_file.nob.xml.xml  <s> number correspondences only
file.sme.xml_cor.sent          original xml file with sme and nob <s> numbers

Example:
dc_1_01.doc.xml_sp_1_01.doc.xml.xml
dc_1_01.doc.xml_cor.sent
dc_1_01.doc.xml_new.txt

parallel/
 namm-namn/
  namn.sent
  namm.sent
  namm.sent.analyzed
  namm.xml_namn.xml.xml
 ...

1. Trond: Find texts, mark the xsl files (cf. dc_ files for model), tell Saara
2. Saara: make .sent and .sent.analyzed
   Børre: make namm_namn.xml
3. Trond: Send files to Lars

Sitemap