GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology. Read more about Why. See also How to get started and our Privacy document.
Example: Sentences as paragraphs:
Hallitus esitteli maaliskuussa 2009 uuden | Ráđđehus ovdanbuvttii njukčamánus 2009 ođđa |
vähemmistöpoliittisen strategian esityksessään | unnitlogopolitihkalaš strategiija proposišuvnnas Från |
Tunnustamisesta omaan valtaan – Hallituksen kansallisia | strategiija nášuvnnalaš unnitloguid váras) (prop. |
vähemmistöjä koskeva strategia (esitys 2008/2009:158). | 2008/2009:158). |
Strategia merkitsee useita muutoksia, jotka koskevat | Strategiija sisdoallá máŋga rievdadusa |
kansallisten vähemmistöjen oikeuksien vahvistamista ja | nannen dihte daid nášuvnnalaš unnitloguid vuoigatvuođaid ja |
vähemmistöpolitiikan toteuttamisambitioiden parantamista. | alidit gudneáŋgirvuođa got unnitlogopolitihkka galgá |
Hallitus panostaa 70 miljoonaa kruunua uudistukseen, jota | čađahuvvot. Ráđđehus bidjá 70 miljovnna ruvnnu reforbmii mii |
aletaan toteuttaa vuodesta 2010. | galgá čađahuvvot 2010 rájis. |
Above is an example of a sentence alignment of two pdf documents. We see that sentences have been cut off in the middle. The reason for this is that these pdf documents have been converted as if each sentence is a paragraph.
The name of the file with this content is
prestable/tmx/fin2sme/admin/lansstyrelsen.se/faktablad_finska.pdf.tmx.html
To fix this, one has to:
realign --files prestable/tmx/fin2sme/admin/lansstyrelsen.se/faktablad_finska.pdf.tmx.html
<xsl:variable name="linespacing" select="''"/>
to become
<xsl:variable name="linespacing" select="'all=2'"/>
realign prestable/tmx/fin2sme/admin/lansstyrelsen.se/faktablad_finska.pdf.tmx.html
prestable/tmx/fin2sme/admin/lansstyrelsen.se/faktablad_finska.pdf.tmx.html
in the web browser to see if the sentence alignment has been
improved.The above steps improves the situation somewhat, but the sentence alignment has not become perfect.
To improve the sentence alignment, one must improve the quality of the converted xml file. The general steps are:
realign --files <path-to-tmx-html-file>
to find
the names of metadata and converted files.realign --convert <path-to-tmx-html-file>
to
reconvert the filesrealign <path-to-tmx-html-file>
to realign the
sentencesWhen the sentence alignment between two pdf files is bad, it is possible to improve it a lot by editing the metadata file and setting various variables that are specific to pdf files.
The following variables in the metadata file affect the content of the converted file:
A more thorough documentation of these variables are found in the metadata template file.