GiellaLT

GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology. Read more about Why. See also How to get started and our Privacy document.

View GiellaLT on GitHub

Page Content

smefin inprovement

In order to make a smesmn bidix we need a better smefin.

Many t nodes contain parantheses. They shall be moved into elements.

   <e>
      <lg>
         <l pos="N">beassanvuohta</l>
      </lg>
      <mg>
         <tg xml:lang="fin">
            <t pos="N">pääsy (rel.)</t>
         </tg>
      </mg>
   </e>

The Finnish words should be proofread. The following command gives all words, the next one only the ones we do not recognize.

cat n_smefin.xml|grep '<l '|tr '<' '>'|cut -d">" -f3|see
cat n_smefin.xml|grep '<l '|tr '<' '>'|cut -d">" -f3|ufin|grep '?'|cut -f1|see

There are some Saami words not recognized:

cat *_smefin.xml|grep '<l '|tr '<' '>'|cut -d">" -f3|usme|grep '?'|wc -l

2921 out of 13131 translations contain a space. This number will get smaller as the parentheses are removed, but some will remain. Todo:

  1. First remove parentheses to (above)
  2. Then take out entries where all translations contain spaces,
  3. Then look at they separately, and try to add one-word translations if possible