GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology. Read more about Why. See also How to get started, and our Privacy document.

A collection of examples

This is a short collection of examples serving as a starting point for how to use XMLSH. It is a shell-friendly interface to xml files, and allows fast and easy access to structured data, as long as you know your XPath! :D

Count the number of sme words in parallel files

First run the parallel info xsl script using Saxon (Saxon must be on your CLASSPATH - the saxonXSL alias assumes that it is found in ~/lib/saxon9.jar):

$ saxonXSL -it main $GTHOME/gt/script/corpus/parallel_corpus_info.xsl lang1=nob lang2=sme inDir=$GTFREE/converted

Then start xmlsh and extract some statistics from the xml files produced above:

$ xmlsh
xmlsh$ xquery 'count(//file[@parallelity="true"])' < corpus_report/nob2sme_parallel-corpus_summary.xml
2307
xmlsh$ xquery 'count(//file[@parallelity="true"])' < corpus_report/sme2nob_parallel-corpus_summary.xml
2288

Then off to some slightly more advanced XQuery: get all elements for which we have found a parallel file (as per above), extract the path to that file, and print it (we do this with both the created report files, and sort -u later):

xmlsh$ xquery 'for $i in //file[@parallelity="true"] return $i/location/t_loc/text()' \
        < corpus_report/nob2sme_parallel-corpus_summary.xml > sme-files.txt
xmlsh$ xquery 'for $i in //file[@parallelity="true"] return $i/location/h_loc/text()' \
        < corpus_report/sme2nob_parallel-corpus_summary.xml >> sme-files.txt
xmlsh$ exit

Finally some traditional processing to extract the words and count them. The most conservative (and probably most reliable) method is to just count the words using wc:

$ sort -u sme-files.txt > sme-files.sorted.txt
$ cat sme-files.sorted.txt | xargs ccat -l sme | wc -w
  849855
$ cat sme-files.sorted.txt | xargs ccat -l sme | preprocess | wc -l
  964529
$ cat sme-files.sorted.txt | xargs ccat -l sme | preprocess | wc -w
  977348

Sitemap