CorpusTools documentation
Overview
CorpusTools is a set of tools to administrate Giellatekno's corpora.
A few examples:
- add_files_to_corpus - Add raw source material files to a corpus
- convert2xml - Converts original files to the Giellatekno-internal xml format.
- analyse_corpus - Orchistrates the hfst (etc) tools to run analysis on a corpus.
- ccat - Output text sections from an analysed or non-analysed corpus.
- korp_mono - Convert analysed files to korp-input
Installation from apertium nightly
CorpusTools is available as a package in Apertium Nightly. Depending on your
system, the package may be named slightly differently. For example, in debian,
the package is called divvun-corpustools
. Search for corpustools
, and you
will find it.
Installation using pipx
pipx lets you install python packages that has runnable scripts easily, onto your system.
- Install pipx
- Run
pipx install --force git+https://github.com/giellalt/CorpusTools.git
Editable install (alternate pipx installation method)
An editable install lets you make changes in the source script files, and still use the same global command on the command line to run the (modified) scripts. Recommended if you intend to do development on the scripts.
- Clone the CorpusTools repository: (
git clone https://github.com/giellalt/CorpusTools.git CorpusTools
) - Install with the editable flag (
-e
):pipx install -e --force /path/to/CorpusTools
Requirements
- python3
- wvHtml (only needed for convert2xml)
- pdftohtml (only needed for convert2xml)
- latex2html (only needed for convert2xml)
- Java (only needed for parallelize)
- pandoc (maybe only needed for convert2xml?)
Installation commands
sudo port install wv latex2html poppler pandoc
sudo apt-get install vw poppler-utils pandoc
sudo pacman -S wv