GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology. Read more about Why. See also How to get started and our Privacy document.
Make sure you have the CorpusTools installed.
Follow the instrucitions here.
The concept lemgram is the Språkbanken way of modeling what linguists and lexicographers refer to as lexemes and lemmas. Ordbild uses lemgrams.
Generation of lemgrams from lexc (note: this may be obsolete, read with care):
Use generator-dict-gt-norm.hfstol. We remove the tags v1, v2.. from the fst. It is better for the user that all variants of the same paradigm are in the same lemgram. Many fst-lemmas have more than one entry in lexc, so the list should be uniqed before generating forms. I suggest that we start with these files:
For nouns, we pick different 3 lists: The ordinary nouns, the actors (NomAg), and the G3-marked nouns. For the other parts of speech, one command is enough. Commands to filter (ir)relevant forms:
*Ordinary words:
egrep -v "(G3|ACTOR|CmpN/Only|ShCmp|RCmpnd|\+V\+|^\!)"
grep N+NomAg
grep N+G3
egrep -v "(ENDLEX|\+V|^\!)"
egrep -v "(LEXICON|Der| Rreal | R |^\!)"
egrep -v "(LEXICON| K |^\!)"