GiellaLT provides rule-based language technology aimed at minority and indigenous languages
Dokumentasjon av transferreglar.
“Apertium is a more complicated and less user friendly version of sed.”
Documentation:
Task:
Num Sg (Nom|Acc) + N Sg Gen => Num Sg Nom + N Par
Input:
$ echo "Leat guokte guoli" | apertium -d . sme-smn-biltrans
^Leat<vblex><iv><indic><pres><pl3><@+FMAINV>/Leđe<vblex><indic><pres><pl3><@+FMAINV>$ ^guokte<num><sg><nom><@←SUBJ>/kyehti<num><sg><nom><@←SUBJ>$
^guolli<n><sem_ani><sg><gen><@Num←>/kyeli<n><sem_ani><sg><gen><@Num←>$
Output:
$ echo "Láá kyehti kyellid" | hfst-proc smn-sme.automorf.hfst
^Láá/Leđe<vblex><iv><indic><pres><pl3>$
^kyehti/kyehti<num><sg><nom>$
^kyellid/kyeli<n><par>$
Rule 1: Ignore disambiguation errors!
$ echo "Leat guokte guoli" | apertium -d . sme-smn
#Leđe #kyehti kyele
So what output do we current have ?
$ echo "Leat guokte guoli" | apertium -d . sme-smn-postchunk
^Leđe<vblex><indic><pres><pl3>$ ^kyehti<num><sg><nom>$ ^kyeli<n><sg><gen>$^.<sent>$
So, what do we actually want to do ?
<n><sg><gen> --> <n><par> | [ <num><sg><nom> | <num><sg><acc>
In Apertium we call the first part (before the two pipes) the “action”, and the second part (… the context after the two pipes) the “pattern”.
So:
Pattern = [ <num><sg><nom> ]( <num><sg><acc> ) <n><sg><gen>
Action = <n><sg><gen> --> <n><par>
<sg> --> ""
<gen> --> <par>
Patterns are defined by “def-cat” entries. The “cat” stands for category.
<def-cat n="num-nomacc">
<cat-item tags="num.sg.nom.*"/>
<cat-item tags="num.sg.acc.*"/>
</def-cat>
This is a set of two items, one containing nom and one containing acc.
You can change the order of the “cat-items” (they are more or less a set). The tags are not sets, they are sequences with wildcards.
To do “or” in the category entries, you just add more cat-item lines.
<def-cat n="n-sg-gen">
<cat-item tags="n.sg.gen.*"/>
</def-cat>
So, to match the pattern “numeral singular in nominative or accusative followed by noun singular in genitive” we would do:
<pattern>
<pattern-item n="num-nomacc"/>
<pattern-item n="n-sg-gen"/>
</pattern>
Here the order is important, this is a sequence.
The “.” is not a regular expression “.” it is ><
so:
n.sg.gen.* = <n><sg><gen>(<*>)+
n.*.gen.* = <n><*><gen>(<*>)+
Let´s start to define our rule file:
-------------------------------------------------------
<transfer>
<section-def-cats>
<def-cat n="num-nomacc">
<cat-item tags="num.sg.nom.*"/>
<cat-item tags="num.sg.acc.*"/>
</def-cat>
<def-cat n="n-sg-gen">
<cat-item tags="n.sg.gen.*"/>
</def-cat>
</section-def-cats>
<section-rules>
<rule>
<pattern>
<pattern-item n="num-nomacc"/>
<pattern-item n="n-sg-gen"/>
</pattern>
</rule>
</section-rules>
</transfer>
-------------------------------------------------------
Input:
^guokte<num><sg><nom><@←SUBJ>/kyehti<num><sg><nom><@←SUBJ>$ ^guolli<n><sem_ani><sg><gen><@Num←>/kyeli<n><sem_ani><sg><gen><@Num←>$
| ________________________**| |**________________________|
Source language (SL) Target language (TL)
| _________________________________________________________|
Lexical unit (LU)
Now we look at the action. Actions are defined within the <rule>.
The action may contain different instructions, and most importantly
determine the output string. The instructions can work on both the
source and target side of the input lexical unit.
<action>
<out>
</out>
</action>
Output:
^num-noun<SN>{^kyehti<num><sg><nom>$ ^kyeli<n><par>$}$
| ______|
name
| ____________________________________________________|
Chunk
We define this with:
<chunk name="num-noun">
</chunk>
This is essentially like writing
^num-noun{}$.
Each chunk has a name, some tags and some contents, for example
to get the
<out>
<chunk name="num-noun">
<tags>
<tag><lit-tag v="SN"/></tags>
</tags>
</chunk>
</out>
This is essentially like writing
^num-noun<SN>{}$.
Looking at this in the file context:
-------------------------------------------------------
<transfer>
<section-def-cats>
<def-cat n="num-nomacc">
<cat-item tags="num.sg.nom.*"/>
<cat-item tags="num.sg.acc.*"/>
</def-cat>
<def-cat n="n-sg-gen">
<cat-item tags="n.sg.gen.*"/>
</def-cat>
</section-def-cats>
<section-rules>
<rule>
<pattern>
<pattern-item n="num-nomacc"/>
<pattern-item n="n-sg-gen"/>
</pattern>
<action>
<out>
<chunk name="num-noun">
<tags>
<tag><lit-tag v="SN"/></tags>
</tags>
</chunk>
</out>
</action>
</rule>
</section-rules>
</transfer>
-------------------------------------------------------
This matches the input pattern
And outputs:
^num-noun<SN>{}$
What is missing here is the chunk contents (e.g. the lexical units that were matched by the pattern).
<chunk name="num-noun">
<tags>
<tag><lit-tag v="SN"/></tags>
</tags>
<lu>
<clip pos="2" side="tl" part="whole"/>
</lu>
</chunk>
**side="sl" part="whole"**_
| |
| _lem_ |
| | |
^guokte<num><sg><nom><@←SUBJ>/kyehti<num><sg><nom><@←SUBJ>$
| ________________________**| |**________________________|
Source language (sl) Target language (tl)
For “part” we can define our own patterns of substrings, but there are also some built in:
So, for the rule above, it will currently output:
^num-noun<SN>{^kyeli<n><sg><gen><@Num←>$}$
So, now that we have some output, we can start with the interesting part, that is changing the output so that it will generate properly.
We´ll start with the easy way, which is just specifying directly what we want to output:
input is the output from sme-smn-biltrans Then comes this:
<out>
<chunk name="num-nomacc"> <!-- Output: ^num-noun -->
<tags>
<tag><lit-tag v="SN"/></tags> <!-- Output: <SN> -->
</tags> <!-- Output: { -->
<lu> <!-- Output: ^ -->
<clip pos="2" side="tl" part="lem"/> <!-- Output: kyeli -->
<lit-tag v="n.par"/> <!-- Output: <n><par> -->
</lu> <!-- Output: $ -->
</chunk> <!-- Output: }$ -->
</out>
The lit-tag instruction outputs strings encased in < and >.
the the output is what we get by calling sme-smn-chunker1
^num-noun<SN>{^kyeli<n><par>$}
Now, how would we output both lexical units ? The output we are looking for is:
^num-noun<SN>{^kyehti<num><sg><nom>$ ^kyeli<n><par>$}$
The rule:
<out>
<chunk name="num-noun"> <!-- Output: ^num-noun -->
<tags>
<tag><lit-tag v="SN"/></tags> <!-- Output: <SN> -->
</tags> <!-- Output: { -->
<lu> <!-- Output: ^ -->
<clip pos="1" side="tl" part="lem"/> <!-- Output: kyehti -->
<clip pos="1" side="tl" part="tags"/> <!-- Output: <num><sg><nom><@←SUBJ> -->
</lu> <!-- Output: $ -->
<lu> <!-- Output: ^ -->
<clip pos="2" side="tl" part="lem"/> <!-- Output: kyeli -->
<lit-tag v="n.par"/> <!-- Output: <n><par> -->
</lu> <!-- Output: $ -->
</chunk> <!-- Output: }$ -->
</out>
This will give:
^num-noun<SN>{^kyehti<num><sg><nom><@←SUBJ>$ ^kyeli<n><par>$}$
This is good, but we don´t want the syntax tag… <@←SUBJ>
How can we change the tags? We first need to define patterns that we want to change. For example, we could define a pattern that matches all of the possible syntax tags.
These patterns are defined in a separate section:
<section-def-attrs>
<def-attr n="function">
<attr-item tags="@←SUBJ"/>
<attr-item tags="@←OBJ"/>
<attr-item tags="@←ADVL"/>
</def-attr>
</section-def-attrs>
The “def-attr” stands for define attribute.
The procedure for changing something goes something like:
<let>
<clip pos="2" side="tl" part="function"/>
<lit v=""/>
</let>
This replaces anything substring that matches one of the patterns in def-attr n=”function” with the empty string.
Here “lit” means “literal” and the attribute “v” is the value. e.g.
<lit v=”foo” is just “foo”, while e.g. <lit-tag v=foo”/> is
(@←SUBJ|@←OBJ|@←ADVL) --> 0
So now if we have
<lu> <!-- Output: ^ -->
<clip pos="1" side="tl" part="lem"/> <!-- Output: kyehti -->
<clip pos="1" side="tl" part="tags"/> <!-- Output: <num><sg><nom> -->
</lu> <!-- Output: $ -->
We will get:
^kyehti
Note that all
What will the whole rule file look like?
<action>
<let><clip pos="1" side="tl" part="tense"/><lit-tag v="past"/></let>
<out>
<let> A B </let>
-------------------------------------------------------
-------------------------------------------------------
<transfer>
<section-def-cats>
<def-cat n="num-nomacc">
<cat-item tags="num.sg.nom.*"/>
<cat-item tags="num.sg.acc.*"/>
</def-cat>
<def-cat n="n-sg-gen">
<cat-item tags="n.sg.gen.*"/>
</def-cat>
</section-def-cats>
<section-def-attrs>
<def-attr n="function">
<attr-item tags="@←SUBJ"/>
<attr-item tags="@←OBJ"/>
</def-attr>
</section-def-attrs>
<section-rules>
<rule>
<pattern>
<pattern-item n="num-nomacc"/>
<pattern-item n="n-sg-gen"/>
</pattern>
<action>
<let>
<clip pos="2" side="tl" part="tags"/>
<lit-tag v="n.par"/>
</let>
<let>
<clip pos="1" side="tl" part="function"/>
<lit-tag v=""/>
</let>
<out>
<chunk name="num-noun">
<tags>
<tag><lit-tag v="SN"/></tag>
</tags>
<lu>
<clip pos="1" side="tl" part="lem"/>
<clip pos="1" side="tl" part="tags"/>
</lu>
<b/>
<lu>
<clip pos="2" side="tl" part="whole"/>
</lu>
</chunk>
</out>
</action>
</rule>
</section-rules>
</transfer>
-------------------------------------------------------
Homework: The data in:
A-3lex_ordinals_uptoten_gt-norm.gen.yaml
Command:
hfst-proc sme-smn.automorf.hfst
More phrases: