GiellaLT

GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology. Read more about Why. See also How to get started, and our Privacy document.

View GiellaLT on GitHub giellalt/giellalt.github.io

Page Content

The list of foreign words

Incoming text contains many foreign words. Used in isolation, asspontanous loans, they should be delt with by a POS guesser. Text chunksin foreign languages represent noise, though, and a good corpus shouldmark such chunks with xml tags (<foreign></foreign>, etc.).While waiting for that, and while developing our parser, we have astoplist of foreign words. The list was made in the following way:

  1. Large lists of Norwegian, Swedish, Danish, Finnish and English wordswere sorted into one list, called gt/script/old-foreign.txt. Thelist was duplicated by an identical list with capital initial letter(using case.regex gave too long compilation time)
  2. The list was run through sme.fst, and the overlapping words(abonnere, adagio, Adam, addere, etc.) were removed.
  3. In addition, a file gt/script/new-foreign.txt was added to thecvs, containing Non-Saami words from our corpus files.
  4. Each of these files were turned into fst files. Then the union ofthe two files was made into one binary file, foreign.fst

The compilation is included in the Makefile. The source files are in thegt/script catalogue, whereas the binary files are in the gt/sme/bincatalogue. Only foreign.fst should be used, the two other ones areintermediate files.

foreign.fst should be used as follows: When investigating Saami wordsthat the parser cannot cope with, foreign words are just noise. They canbe removed with this command line:

cat text | preprocess ... | lookup -flags mbTT sme.fst | grep \?' |cut -f1 | foreign.fst | grep '\?' | cut -f1 | ...

Now, only the words which are not recognised by the parser, and not partof the stop list, are included.

The list of foreign words was cut in two because compilation time forthe whole list is very long. The intention with the split is thatold-foreign.txt should be left alone. All additional words should beadded to the shorter new-foreign.txt file. If this file becomes toolong, it may be transferred over to old-foreign.txt.

Sitemap