GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology. Read more about Why. See also How to get started, and our Privacy document.
This is a short collection of examples serving as a starting point for how to use XMLSH. It is a shell-friendly interface to xml files, and allows fast and easy access to structured data, as long as you know your XPath! :D
First run the parallel info xsl script using Saxon (Saxon must be on your CLASSPATH - the saxonXSL alias assumes that it is found in ~/lib/saxon9.jar
):
$ saxonXSL -it main $GTHOME/gt/script/corpus/parallel_corpus_info.xsl lang1=nob lang2=sme inDir=$GTFREE/converted
Then start xmlsh and extract some statistics from the xml files produced above:
$ xmlsh
xmlsh$ xquery 'count(//file[@parallelity="true"])' < corpus_report/nob2sme_parallel-corpus_summary.xml
2307
xmlsh$ xquery 'count(//file[@parallelity="true"])' < corpus_report/sme2nob_parallel-corpus_summary.xml
2288
Then off to some slightly more advanced XQuery: get all elements for which we have
found a parallel file (as per above), extract the path to that file, and print it
(we do this with both the created report files, and sort -u
later):
xmlsh$ xquery 'for $i in //file[@parallelity="true"] return $i/location/t_loc/text()' \
< corpus_report/nob2sme_parallel-corpus_summary.xml > sme-files.txt
xmlsh$ xquery 'for $i in //file[@parallelity="true"] return $i/location/h_loc/text()' \
< corpus_report/sme2nob_parallel-corpus_summary.xml >> sme-files.txt
xmlsh$ exit
Finally some traditional processing to extract the words and count them. The most
conservative (and probably most reliable) method is to just count the words using
wc
:
$ sort -u sme-files.txt > sme-files.sorted.txt
$ cat sme-files.sorted.txt | xargs ccat -l sme | wc -w
849855
$ cat sme-files.sorted.txt | xargs ccat -l sme | preprocess | wc -l
964529
$ cat sme-files.sorted.txt | xargs ccat -l sme | preprocess | wc -w
977348