A collection of examples
This is a short collection of examples serving as a starting point for how to use XMLSH. It is a shell-friendly interface to xml files, and allows fast and easy access to structured data, as long as you know your XPath! :D
Count the number of sme words in parallel files
First run the parallel info xsl script using Saxon (Saxon must be on your CLASSPATH - the saxonXSL alias assumes that it is found in ~/lib/saxon9.jar
):
$ saxonXSL -it main $GTHOME/gt/script/corpus/parallel_corpus_info.xsl lang1=nob lang2=sme inDir=$GTFREE/converted
Then start xmlsh and extract some statistics from the xml files produced above:
$ xmlsh
xmlsh$ xquery 'count(//file[@parallelity="true"])' < corpus_report/nob2sme_parallel-corpus_summary.xml
2307
xmlsh$ xquery 'count(//file[@parallelity="true"])' < corpus_report/sme2nob_parallel-corpus_summary.xml
2288
Then off to some slightly more advanced XQuery: get all elements for which we have
found a parallel file (as per above), extract the path to that file, and print it
(we do this with both the created report files, and sort -u
later):
xmlsh$ xquery 'for $i in //file[@parallelity="true"] return $i/location/t_loc/text()' \
< corpus_report/nob2sme_parallel-corpus_summary.xml > sme-files.txt
xmlsh$ xquery 'for $i in //file[@parallelity="true"] return $i/location/h_loc/text()' \
< corpus_report/sme2nob_parallel-corpus_summary.xml >> sme-files.txt
xmlsh$ exit
Finally some traditional processing to extract the words and count them. The most
conservative (and probably most reliable) method is to just count the words using
wc
:
$ sort -u sme-files.txt > sme-files.sorted.txt
$ cat sme-files.sorted.txt | xargs ccat -l sme | wc -w
849855
$ cat sme-files.sorted.txt | xargs ccat -l sme | preprocess | wc -l
964529
$ cat sme-files.sorted.txt | xargs ccat -l sme | preprocess | wc -w
977348