Build tools and build support files as well as developer support tools for the GiellaLT repositories.
These have been used to generate figures for articles and may be used to generate some statistics for some websites.
Some scripts depend on proper environment variable setup:
$GTHOME = path to giella-core$GTLANG = parent path of directories like lang-xxx and corpus-xxx
lang-xxx and corpus-xxx as a subdirectory of one
common directory to use these scripts, this is a best common practice in
GiellaLT infra atmScript: scripts/corpus-stats.bash
.tokens fileExample:
$ bash scripts/corpus-stats.bash sme
...
Tokens Covered OOV
1124309 1074616 49693
100.0 95.58012966186342 4.41987033813658
Types Covered OOV
118163 90872 27291
100.0 76.90393778086204 23.09606221913797
$ head sme.missinglist
560 of
371 and
355 og
182 det
176 to
169
162 for
135 is
131 av
118 til
Copy-pasta this into the article e.g. like so:
\hline
\bf Language & \bf Words & \bf Coverage % & \bf Types % \bf Coverage \\
\hline
Northern Sámi & 1,124,309 & 95.6~\% & 118,163~\% \\
Script: scripts/language-stats.bash
src/fst/stems/*.lexc and
src/fst/generated_files/*.lexcbash scripts/language-stats.bash sme
sme
154861 root morphs or similar
56179 shared root morphs (proper nouns, symbols, etc.)
Copy-pasta this into the article e.g. like so:
\hline
\bf Language & \bf Lemmas \\
\hline
Northern Sámi & 154,861$^1$ \
...
\caption{$^1$ excluding proper nouns shared between languages}