GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology. Read more about Why. See also How to get started and our Privacy document.
This script shows all bidix pair where the second (rightmost) word in the pairs (the smj word in the sme-smj bidix, etc) can not be generated.
To run the script, go to the dev
directory in apertium-sme-smX: cd dev/
Then, in dev
-directory, write:
sh bidix-sanity.sh > sanityoutput
The output is then smX-entries which are not possible to generate with the information given in bidix
sme-lemma<PoS>:smj-lemma<PoS>:^input to analyser/analysis
ahccát<vblex>:ahttsát<vblex>:^ahttsát/*ahttsát$
- “ahttsát” is not in FST.
aggregáhta<n>:aggregáhta<n>:^aggregáhta/aggregáhtta<n><sem_dummytag><pl><nom>/aggregáhtta<n><sem_dummytag><sg><gen>$
– lemma in bidix, “aggregáhta”, should be the same as lemma in FST, “aggregáhtta”.
ahte<cnjcoo>:jut<cnjcoo>:^jut/jut<cnjsub>$
- lemma in bidix, “jut”, is marked as cnjcoo, but FST analysis gives cnjsub. PoS should be changed in bidix or in FST.
ahkitvuohta<n>:ahketvuohta<n>:^ahketvuohta/ahket<adj><sem_dummytag><der_vuohta><n><sg><nom>$
- “ahketvuohta” is not lexicalised in FST. It can be lexicalised, or, because the words in sme and smj have the same derivation, one can remove the word pair from bidix and, if the wordpair “ahkit”-“ahket” is in bidix, transfer rules should make it possible to generate “ahketvuohta”.
Lea vejolaš heivehit sanityoutput nu ahte oaččut listtu mas eai leat namat, ja mas smX-sánit leat sorterejuvvon sáni loahpa mielde. Dalle lea álkit árvvoštallat sániid (seammá sánit bohtet maŋŋálaga) ja maiddái kopieret sániid FST:i.
Go leat dev
-máhpas:
sh sorting_sanityoutput.sh