GiellaLT provides rule-based language technology aimed at minority and indigenous languages
You must have set up the environment. If you did not, look at the Getting Started page under the Overview section on the frontpage.
Open the terminal. We assume you stand in the
lang-XXX folder, where
XXX is the 3-letter code of your language (
sme for North Saami, etc.).
hfst-lookup -q src/analyser-gt-desc.hfstol
ENTER) - Then write the words that shall be analysed, one word at a time, followed by
hfst-lookup -q src/generator-gt-desc.hfstol
ENTER) - Then write lemma + tags for the wordforms that shall be analysed, one word at a time, followed by
ENTER. - The tag format and the tags themselves are the same as for the output of analysis mode
cat testfile.txt | hfst-lookup -q src/analyser-gt-desc.hfstol | less
For hfst, we have an alternative procedure for preprocessing text, using transducers instead of perl. The command to tokenise, analyse and print the output in a CG compatible format is:
cat testfile.txt | hfst-tokenise --giella-cg tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst
In case the transducer contains weights, the constraint grammar may make use of them, as follows
cat text | hfst-tokenise --giella-cg tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst | ...
Please note that the file
tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst is not built by
default. To enable building it, configure as follows:
A shorter version of
hfst-lookup -q src/analyser-gt-desc.hfstol is
husme (given that you language is
sme. See the documentation SOMEWHERE to ensure you have the aliases set up.
You may have a family of aliases set up on your machine. Find out if you have by writing
alias smedis. If the answer is
sent-proc.sh -s dis, they are set up. If the answer is
-bash: alias: smedis: not found, they are not.
The aliases contain a pipeline combining perl pre- and postprocessing with xfst transducers and constraint grammar. These aliases may be written
anywhere (replace “sme” with your own language code). Note that they need the
These aliases may be used in two ways: either write the alias followed by a sentence in quotes
smedis "Mun lean boahtán."
Or, alternatively, pipe a file through it:
`cat testfile.txt | smedis``
Instead of just showing the result on the screen as running text (as above), much can be done to manipulate it. Here are some examples, all the textstrings should be added after the smedis etc. above.
| grep '+N+Pl' > plnouns
(to get all plural nouns and save them to the file plnouns)
| grep -v '\?' | cut -f2 | sort | uniq -c | sort -nr | less
(to get a frequency list of the lexemes that the parser recognizes.
| grep '\?' | sort | uniq -c | sort -nr | less
(to get a frequency list of the words that the parser does not recognize)
| grep '\+\?' | sort | uniq -c | sort -nr | less
(to get a frequency list of the word forms that the parser does not recognize)