Scripts for evaluation and statistics
These have been used to generate figures for articles and may be used to generate some statistics for some websites.
Dependencies
- coreutils, etc. i.e. grep, sed, wc, and such unix tools
- CorpusTools/ccat
- libdivvun
- python + libhfst for python
Environment
Some scripts depend on proper environment variable setup:
$GTHOME= path togiella-core$GTLANG= parent path of directories likelang-xxxandcorpus-xxx- You need to have
lang-xxxandcorpus-xxxas a subdirectory of one common directory to use these scripts, this is a best common practice in GiellaLT infra atm
- You need to have
Scripts
Corpus coverages, frequencies, missing lists
Script: scripts/corpus-stats.bash
- uses ccat to create corpus
- tokenises corpus to
.tokensfile - makes a frequency sorted list of tokens
- analyse frequency sorted list to count naïve coverage
- Naïve coverage means the proportion of tokens that get any analysis over all tokens in %
Example:
$ bash scripts/corpus-stats.bash sme
...
Tokens Covered OOV
1124309 1074616 49693
100.0 95.58012966186342 4.41987033813658
Types Covered OOV
118163 90872 27291
100.0 76.90393778086204 23.09606221913797
$ head sme.missinglist
560 of
371 and
355 og
182 det
176 to
169
162 for
135 is
131 av
118 til
Copy-pasta this into the article e.g. like so:
\hline
\bf Language & \bf Words & \bf Coverage % & \bf Types % \bf Coverage \\
\hline
Northern Sámi & 1,124,309 & 95.6~\% & 118,163~\% \\
Lexicon, ruleset sizes
Script: scripts/language-stats.bash
- uses some greps and seds to estimate root morphs in lexc files
- assumes fairly standard structure of lexc files in
src/fst/stems/*.lexcandsrc/fst/generated_files/*.lexc
bash scripts/language-stats.bash sme
sme
154861 root morphs or similar
56179 shared root morphs (proper nouns, symbols, etc.)
Copy-pasta this into the article e.g. like so:
\hline
\bf Language & \bf Lemmas \\
\hline
Northern Sámi & 154,861$^1$ \
...
\caption{$^1$ excluding proper nouns shared between languages}