GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology. Read more about Why. See also How to get started and our Privacy document.
In order to make a smesmn bidix we need a better smefin.
Many t nodes contain parantheses. They shall be moved into
<e>
<lg>
<l pos="N">beassanvuohta</l>
</lg>
<mg>
<tg xml:lang="fin">
<t pos="N">pääsy (rel.)</t>
</tg>
</mg>
</e>
The Finnish words should be proofread. The following command gives all words, the next one only the ones we do not recognize.
cat n_smefin.xml|grep '<l '|tr '<' '>'|cut -d">" -f3|see
cat n_smefin.xml|grep '<l '|tr '<' '>'|cut -d">" -f3|ufin|grep '?'|cut -f1|see
There are some Saami words not recognized:
cat *_smefin.xml|grep '<l '|tr '<' '>'|cut -d">" -f3|usme|grep '?'|wc -l
2921 out of 13131 translations contain a space. This number will get smaller as the parentheses are removed, but some will remain. Todo: