GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology. Read more about Why. See also How to get started and our Privacy document.
This text is taken from Johnson, Antonsen and Trosterud 2013
Thanks to the open-source community, there are numerous resources available which make it easy to produce designs with good cross-browser compatibility. Previously, troubleshooting these issues for each individual browser would take time, when one would rather focus on implementation and basically, producing usable software.
In this case, we used Twitter Bootstrap to get the most for less, and it has resulted in an easy to use and very minimal layout. The layout works simultaneously on all the major browsers for desktop operating systems, as well as the most popular mobile browsers. Thus, there is no real need to produce code specific to Apple’s iOS or the Android operating system, or pay for the licensing setup involved with iOS development, and we get all of these things for free.
Having to not worry about the design meant that there was more time left for developing functionality. Our dictionary is based on Flask, a light, and flexible web framework for Python.
As mentioned above, the lexical data used in this application is stored in an XML format, with one file per language pair, per direction (thus making separate files for language 1 to language 2, and language 2 to language 1). These files range in size from 2MB to 5MB, and are used in the live site, without the need for a relational database to store the data. On our server, queries end up being quite fast, but to ensure that this continues to be true for larger dictionaries, we have also used one of the quickest XML libraries for Python currently available, lxml, benchmarks. This allows us to simply update the files, and restart the service, and any new lexical entries are immediately available to users.
Because the application relies on external tools for lemmatisation and tagging, communication between these processes are stored in an in-memory cache. All analyses are cached by lemma, and all generated forms are cached by lemma and tag. Thus, when a future query includes a compound word containing one of these lemmas, we can just retrieve base forms from the cache, instead of sending them directly to the external tools.
The need for a cache arose in response to the start-up time for some of the larger FSTs, however once running, actual lookups are extremely fast. For now caching is enough, however if usage and load indicate that optimisation is necessary, one possible solution is to keep the external tools running in separate processes, and simply communicate with them via sockets. There are a variety of solutions for this, such that the work necessary would be minimal.
Our previous wordform dictionaries demanded installation in two steps: installing a separate dictionary program (StarDict for Windows and Linux; and the preinstalled Dictionary.app for Mac OS X), and then downloading and installing the linguistic files in the dictionary program. New and updated dictionary versions demanded new downloading and installing. Our new, web-based approach naturally avoids all of this, as users only require the URL. The web dictionary may also be updated by the providers at any time, without the need for users to be aware of and perform the updates themselves.
Compared to our web dictionary, the wordform dictionaries had one major advantage: they could be used to click on words in any application running within the operating system in order to get an analysis and definition, whereas the similar functionality provided in the web dictionary only works on web pages within the browser. However, newer versions of Mac OSX have lost a user-friendly means of installing additional dictionaries to the preinstalled dictionary application, as such, this has become a point in favor of web-based solutions.
There is also an advantage for the providers of the dictionary, programmers and linguists alike. With the previous wordform dictionaries, new versions of the software (such as with StarDict), required adjustments in the format of the dictionary files, and we would often find ourselves concerned over whether we should add more linguistic content, or aim for smaller file sizes. As such, running the dictionary on a server with already existing lexicons is a big step forward.
Having a server-based system also allows us to pay attention to actual usage of the systems. As such, we log all incoming queries along with their results, in order to detect areas where the dictionary needs expansion, and these updates are then available to users as they are made.
In addition to being searchable via a form in the web interface, we provide detailed lexical entries in an easily linkable HTML format, and in a more bare-bones format, JSON (JavaScript Object Notation). JSON is a widely adopted, and open standard for communication between applications, specifically with a focus on web applications. The intent here is that data is provided not just for our web-based dictionary via the interface that we provide, but that it may also be used within external applications, on other websites, and even potentially in mobile services.
The data is exposed in a couple of public-facing API endpoints or URL paths, more or less following REST (Representational State Transfer) architecture. One of the endpoints, which provides detailed word entries with inflectional paradigms has already been included in MultiDict’s Wordlink, a reading comprehension tool that includes many other languages and dictionaries. WordLink is quite nice, but naturally, we had some of our own designs for how to use this API.
Two of the learning tools already constructed for North Saami are $Kursa$ and $Oahpa$. $Kursa$ is a free, multimedia-rich set of online course materials in North Saami, containing lessons with text, and audio recordings, which are implemented in WordPress, a free and open-source blogging tool.
To go with these learning materials, we have created a plugin for WordPress
written in JavaScript, jQuery, and Twitter Bootstrap, which provides access to
lemmatisation, compound analysis and lexicon lookup. Users simply
Alt/Opt+Double Click
a word, and it is highlighted with a text-bubble
appearing below that contains word translations and wordform analysis
mobile. Users can quickly and easily look up as many words as they need
to comprehend a text, which erases one of the barriers to reading in a new
language, namely: the need to frequently look up words in a dictionary, while
being unacquainted with potential “dictionary” word forms.
The modular nature of the core library within the plugin allows it to be inserted into several other potential situations with ease. For example, it could be included on a specific page or website, or inserted via a web browser plugin in any page. We have ensured that the library works in the most commonly used, and current web browsers, as such, this functionality is available on Windows, Mac OS X and Linux; in Internet Explorer, Firefox, Chrome, Opera, and Safari.
In addition to plugin for Kursa, we have produced a cross-browser solution which is similar to a browser extension, but instead, is accessible via a bookmarklet, which is a bookmark providing functionality, instead of a link to a website. As it turns out, this option has been much more preferable to developing (and also convincing users to install) browser specific plugins, and “installation” is simply a drag-and-drop affair. Thus, when on a page they wish to read, users may simply click the bookmarklet, which downloads and includes the plugin source in the HTML document structure facing the user. Now the world of news, blogs, or even Facebook, is accessible in all of the language pairs that we support.