GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology. Read more about Why. See also How to get started, and our Privacy document.
The language techonogy tools are available as a server, to be used by a variation of client applications. One of these applications is the cgi-bin interface. There is also a command line client for the server, that offers an user interface to the language techonology tools. The tools include Xerox tools such as lookup for analysis, generation and hyphenation and vislcg for disambiguation. In addition, there is a paradigm generator and preprocessor among other smaller scripts.
The disambiguator is not yet fully implemented.
The communication with the server is implemented as tcp-socket. The server listens and recieves the incoming client requests. Each client is forked to it’s own process, to ensure parallel clients. The communication scheme does not follow any pre-existing protocol. It’s explained below.
Whenever a client connects to the server, the first communication after welcome-message is the request for the tools. The request is given in xml, and it includes the names of the tools and other information such as the input language, and whether the input and/or output will be in xml. The different tools are started only at the client request. After the tools are initialized, they will be running until the client closes the connection. More than one tool may be running simultaneously; the tool can be selected in the xml-input.
The communication continues so that the client sends some input data followed by a newline. The server processes the input and sends output followed by “END_REPLY”. When the client sends “END_REQUEST” or otherwise stops, the server closes the connection.
The parameters sen with client request comprise the instructions to the server. A sample xml containing the parameters for analyzer and hyphenator:
<parameters>
<action tool="anl" fst="" args="">
<action tool="hyph" filter="yes">
<lang>sme</lang>
<xml_in>
<xml_out>
</parameters>
The structure is explained in detail below:
action
:
Contains the name of the tool, and command line arguments
-flags mbTT -utf8
. Generation and paradigm generation have
also the argument -d.
filter="yes"
The location of the
filter script can optionally be given in
attribute filter_script.
language
:xml_in, xml_out
:Input and output to the server are given as they would be when the application was started in command line. So for example the input and output to the analyzer is:
Oslo
Oslo Oslo+N+Prop+Plc+Sg+Acc
Oslo Oslo+N+Prop+Plc+Sg+Gen
Oslo Oslo+N+Prop+Plc+Sg+Nom
The special case is the paradigm generator, which recieves the lemma and the POS tag separated with space as input:
Oslo N
...
When the input is given as a plain text, only one tool may be running. With xml-structure, it is possible for a client to start several tools (but only on of the kind) and have the tool selected in the input.
The xml-structures of input and output are interconnected, since the output from the preprocesser has to be valid input for the analyzer and the same for analyzer and disambiguator. In the following, a sample input and output of the analyzer:
<root tool="anl"><w form="Oslo"/></root>
<root>
<w form="Oslo">
<reading analysis="N+Prop+Plc+Sg+Acc" lemma="Oslo"/>
<reading analysis="N+Prop+Plc+Sg+Gen" lemma="Oslo"/>
<reading analysis="N+Prop+Plc+Sg+Nom" lemma="Oslo"/>
</w>
<w form="oslolaččat">
<reading analysis="N+Prop+Plc+Der/laš+A+Adv" lemma="Oslo"/>
<reading analysis="N+Prop+Plc+Der/laš+A+Pl+Nom" lemma="Oslo"/>
<reading analysis="N+Prop+Plc+Der/laš+A+Sg+Acc+PxSg2" lemma="Oslo"/>
<reading analysis="N+Prop+Plc+Der/laš+A+Sg+Gen+PxSg2" lemma="Oslo"/>
</w>
</root>
As a matter a fact, the element <root> is not named anywhere in the program, so in principle any name can be used. However, the dtd is more strict.
In the following some examples of the input and output are presented. In the examples, there is always only one w-node, but there is no limit for the number of input words.
<root tool="hyph"><w form="oslolaččat"/></root>
<output>
<w form="oslolaččat">
<reading hyph="os^lolač^čat"/>
</w>
</output>
<root tool="gen"><w form="Oslo+N+Prop+Sg+Loc"/></root>
<root>
<w analysis="N+Prop+Sg+Loc" lemma="Oslo">
<surface form="Oslos"/>
<surface form="Oslon"/>
</w>
</root>
<root tool="para"><w form="Oslo N"/></root>
<root>
<w analysis="N+Prop+Pl+Gen+Qst" lemma="Oslo">
<surface analysis="N+Prop+Sg+Loc" form="Oslos"/>
<surface analysis="N+Prop+Sg+Loc" form="Oslon"/>
<surface analysis="N+Prop+Sg+Loc+Foc" form="Oslosbe"/>
<surface analysis="N+Prop+Sg+Loc+Foc" form="Oslosba"/>
<surface analysis="N+Prop+Sg+Loc+Foc" form="Oslosbat"/>
<surface analysis="N+Prop+Sg+Loc+Foc" form="Oslosge"/>
<surface analysis="N+Prop+Sg+Loc+Foc" form="Oslosges"/>
<surface analysis="N+Prop+Sg+Loc+Foc" form="Oslosgen"/>
<surface analysis="N+Prop+Sg+Loc+Foc" form="Oslosgis"/>
<surface analysis="N+Prop+Sg+Loc+Foc" form="Oslosgoson"/>
...
</root>
The pos-tag should perhaps be moved to attribute as well.
<root tool="prep">Mun in leat.</root>
<root>
<w form="Mun"/>
<w form="in"/>
<w form="leat"/>
<w form="."/>
</root>
The preprocessor output is thus analyzer or hyphenator input.