GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology. Read more about Why. See also How to get started and our Privacy document.
Proper nouns and hyphens: when the name derives, like in -laš, we should get a hyphenation point in front of the derivation, but we don’t. Example:
Oslo > os^lo*laš
* = missing hyphenation point
sme $ lookup -flags mbTT -utf8 bin/hyph-sme.fst
0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100%
oslolaš
oslolaš Os^lolaš
sme$lookup -flags mbTT -utf8 bin/hyphrules-sme.fst
0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 100%
oslolaš
oslolaš os^lo^laš
Marker for derivational suffix (and, by extension to other languages, also prefix):
- suffix
What about inflectional suffixes? > is used in sma
and nob
. This
speaks against using it for derivation.
root*1*der*2*infl
%> %>
Der1 Der2 Der3
C+ VC* CVC
l u tags, stem
l l diacr here
=
g u diacr here
g l wordform (w or w/o hyphen)
:0 - no hyphenation
If visisble hyphenation:
:- IFF _ C V :0 IFF _ C #
HYPH = read regex ( @\"$(TARGET)/bin/hyphrules-$(TARGET).fst\" .o. \ 3
@\"$(TARGET)/bin/hyph-i$(TARGET).save\" .o. \ 2
@\"$(TARGET)/bin/$(TARGET)-norm.fst\" ) ; \n 1
hyphrules-sme.fst joh^to^lahkii <============ 1+2+3
johtolahkii
hyph-isme.save johtolahkii <============ 1+2
johtolat+N+Sg+Ill
sme-norm.fst johtolat+N+Sg+Ill <======== 1
johtolahkii
hyphrules-sme.fst joh^to^lah^kii
johtolahkii
xfst[0]: load sme-norm.fst
Opening 'sme-norm.fst'
0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100%
November 05, 2007 12:44:54 GMT
Closing 'sme-norm.fst'
apply up> johtolahkii <==================== 1
johtolat+N+Sg+Ill
xfst[1]: load hyph-isme.save
Opening 'hyph-isme.save'
0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100%
November 05, 2007 12:43:40 GMT
Closing 'hyph-isme.save'
apply up> johtolat+N+Sg+Ill <==================== 2
johtolahkii
xfst[2]: load hyphrules-sme.fst
Opening 'hyphrules-sme.fst'
0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 100%
November 05, 2007 12:31:20 GMT
Closing 'hyphrules-sme.fst'
xfst[3]: up johtolahkii <======================== 3
joh^to^lah^kii
xfst[3]: compose net
31.5 Mb. 344233 states, 2169182 arcs, Circular. Label Map: Default.
xfst[1]: up johtolahkii <=========================== 123
joh^to^lahkii
xfst[2]: up johtolahkii
joh^to^lahk@C.NeedNoun@ii@D.NeedNoun.ON@
xfst[1]: set show-flags off
variable show-flags = OFF
xfst[1]: set flag-is-epsilon on
variable flag-is-epsilon = ON
xfst[1]: up johtolahkii
joh^to^lahkii
SPLRNONREC = read regex [ [@\"$(TARGET)/bin/derivation-filter.fst\" ] .o. \
[@\"$(TARGET)/bin/spellernonrec-$(TARGET).save\" ] .o. \
[@\"common/bin/downcase.fst\" ] .o. \
[@\"common/bin/remove-hyphen.fst\" ] .o. \
[@\"$(TARGET)/bin/hyphrules-$(TARGET).fst\".i ] \
] ; \n
$(TARGET)/bin/spellernonrec-$(TARGET).fst: \
common/bin/tag-not-save.fst \
common/bin/downcase.fst \
$(TARGET)/bin/spellernonrec-$(TARGET).save \
$(TARGET)/bin/derivation-filter.fst \
common/bin/remove-hyphen.fst \
$(TARGET)/bin/hyphrules-$(TARGET).fst
os^lolaš
os^lolaš Oslo+N+Prop+Plc+Der1+Der/laš+A+Sg+Nom
os^lolaš Oslo+N+Prop+Plc+Der1+Der/laš+A+SgGenCmp+Cmpnd
os^lolaš Oslo+N+Prop+Plc+Der1+Der/laš+A+SgNomCmp+Cmpnd
os^lolaš Oslo+N+Prop+Plc+Der1+Der/laš+A+Attr
os^lolaš Oslo+N+Prop+Plc+Der1+Der/laš+A+Attr+Cmpnd
os^lo^laš
os^lo^laš os^lo^laš +?
In smj
, also inflections fall outside our hyphenation rules:
Basudissaj
Basudissaj Ba^su^dissaj
Bájddárin
Bájddárin Bájd^dárin
Bájddárin Bájd^dárin
Heandarahkii
Heandarahkii Hean^da^rahkii
I have found bad hyphenation=no hyphenation in some nouns as well
Hyph transducer/speller:
johtolahkii
johtolahkii joh^to^lahkii
Rules only:
johtolahkii joh^to^lah^kii
-bash-3.00$ echo “johto^lahkii” | lookup -flags mbTT -utf8 bin/hyphrules-sme.fst johto^lahkii joh^to^lah^kii
Lines 52+53 above:
xfst[1]: up johtolahkii
johtolahkii
xfst[1]: down johtolahkii
johtolahkii
Target:
joh^to^lah^kii
gonagasažis
gonagasažis go^na^gasa^žis
gonagasažis go^na^gasa^žis
gonagassii
gonagassii go^na^gassii
johtolagas
johtolagas joh^to^lagas
xfst[2]: set flag-is-epsilon ON
variable flag-is-epsilon = ON
<. @"sme/bin/hyph-isme.save" .o. @"sme/bin/sme-norm.fst" ;
...
*** Warning: It is unsafe to treat flag diacritics as special in
composition when both networks contain flags. Please set the
variable compose-flag-as-special to OFF.
*** Warning: label '@U.Cap.Obl@:@U.Cap.Opt@' is illegal: flag diacritics on both sides of the symbol pair.
*** Warning: label '@U.Cap.Opt@:@U.Cap.Obl@' is illegal: flag diacritics on both sides of the symbol pair.
19.5 Mb. 212213 states, 1342959 arcs, Circular. Label Map: Default.
xfst[3]: up johtolahkii
joh^to^lah^k@C.NeedNoun@@C.NeedNoun@ii@D.NeedNoun.ON@@D.NeedNoun.ON@
xfst[3]: set show-flags OFF
variable show-flags = OFF
xfst[3]: up johtolahkii
joh^to^lah^kii
a88-114-120-101:gt sjur$ echo "johtolahkii" | lookup -flags mbTT -utf8 sme/bin/hyph-sme.fst
0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100%
johtolahkii joh^to^lah^kii