GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology. Read more about Why. See also How to get started and our Privacy document.
(or: how to fix decomposed Sami letters)
In Unicode, many glyphs (letter symbols) may either be represented by one character, or by a sequence of many. The letter á may thus be either one character á or two characters a and combining ´ . Normalisation forms are used to standardise the representation.
The first, NFKD, decomposes the characters (á as two characters), whereas the second, NFKC, composes it (á as one character).
Our North Sami analysers use the composed representation.
If you get text with decomposed letters (UnicodeChecker will tell you that č is two characters), you must compose them with the following command
cat infile.txt \
| uconv -f utf8 -t utf8 -x Any-NFKC > outfile.txt
See also man uconv
The uconv program should be installed on your machine as part of the ICU installation.