GiellaLT

GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology. Read more about Why. See also How to get started and our Privacy document.

View GiellaLT on GitHub

Page Content

Hyphenation problem with derivations

Proper nouns and hyphens: when the name derives, like in -laš, we should get a hyphenation point in front of the derivation, but we don’t. Example:

Oslo > os^lo*laš
* = missing hyphenation point


sme $ lookup -flags mbTT -utf8 bin/hyph-sme.fst
0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100%
oslolaš
oslolaš Os^lolaš


sme$lookup -flags mbTT -utf8 bin/hyphrules-sme.fst
0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 100%
oslolaš
oslolaš os^lo^laš

Marker for derivational suffix (and, by extension to other languages, also prefix):

What about inflectional suffixes? > is used in sma and nob. This speaks against using it for derivation.

root*1*der*2*infl
    %>     %>
Der1 Der2 Der3
C+   VC*  CVC


l u  tags, stem
l l  diacr  here
=
g u  diacr  here
g l  wordform (w or w/o hyphen)

:0 - no hyphenation

If visisble hyphenation:

:- IFF _ C V :0 IFF _ C #

HYPH = read regex ( @\"$(TARGET)/bin/hyphrules-$(TARGET).fst\"  .o. \     3
					@\"$(TARGET)/bin/hyph-i$(TARGET).save\"     .o. \     2
					@\"$(TARGET)/bin/$(TARGET)-norm.fst\" ) ; \n          1
hyphrules-sme.fst   joh^to^lahkii  <============ 1+2+3
                    johtolahkii
hyph-isme.save      johtolahkii    <============ 1+2
                    johtolat+N+Sg+Ill
sme-norm.fst        johtolat+N+Sg+Ill  <======== 1
                    johtolahkii




hyphrules-sme.fst   joh^to^lah^kii
                    johtolahkii




xfst[0]: load sme-norm.fst
Opening 'sme-norm.fst'
0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100%
November 05, 2007 12:44:54 GMT
Closing 'sme-norm.fst'
apply up> johtolahkii       <==================== 1
johtolat+N+Sg+Ill
xfst[1]: load hyph-isme.save
Opening 'hyph-isme.save'
0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100%
November 05, 2007 12:43:40 GMT
Closing 'hyph-isme.save'
apply up> johtolat+N+Sg+Ill  <==================== 2
johtolahkii
xfst[2]: load hyphrules-sme.fst
Opening 'hyphrules-sme.fst'
0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 100%
November 05, 2007 12:31:20 GMT
Closing 'hyphrules-sme.fst'
xfst[3]: up johtolahkii    <======================== 3
joh^to^lah^kii
xfst[3]: compose net
31.5 Mb. 344233 states, 2169182 arcs, Circular. Label Map: Default.
xfst[1]: up johtolahkii  <=========================== 123
joh^to^lahkii


xfst[2]: up johtolahkii
joh^to^lahk@C.NeedNoun@ii@D.NeedNoun.ON@




xfst[1]: set show-flags off
variable show-flags = OFF
xfst[1]: set flag-is-epsilon on
variable flag-is-epsilon = ON
xfst[1]: up johtolahkii
joh^to^lahkii
SPLRNONREC = read regex [ [@\"$(TARGET)/bin/derivation-filter.fst\"     ] .o. \
						  [@\"$(TARGET)/bin/spellernonrec-$(TARGET).save\"  ] .o. \
						  [@\"common/bin/downcase.fst\"                 ] .o. \
						  [@\"common/bin/remove-hyphen.fst\"            ] .o. \
						  [@\"$(TARGET)/bin/hyphrules-$(TARGET).fst\".i ]     \
						] ; \n


$(TARGET)/bin/spellernonrec-$(TARGET).fst: \
							common/bin/tag-not-save.fst \
							common/bin/downcase.fst \
							$(TARGET)/bin/spellernonrec-$(TARGET).save \
							$(TARGET)/bin/derivation-filter.fst \
							common/bin/remove-hyphen.fst \
							$(TARGET)/bin/hyphrules-$(TARGET).fst


os^lolaš
os^lolaš        Oslo+N+Prop+Plc+Der1+Der/laš+A+Sg+Nom
os^lolaš        Oslo+N+Prop+Plc+Der1+Der/laš+A+SgGenCmp+Cmpnd
os^lolaš        Oslo+N+Prop+Plc+Der1+Der/laš+A+SgNomCmp+Cmpnd
os^lolaš        Oslo+N+Prop+Plc+Der1+Der/laš+A+Attr
os^lolaš        Oslo+N+Prop+Plc+Der1+Der/laš+A+Attr+Cmpnd


os^lo^laš
os^lo^laš       os^lo^laš       +?

In smj, also inflections fall outside our hyphenation rules:

Basudissaj
Basudissaj      Ba^su^dissaj
Bájddárin
Bájddárin       Bájd^dárin
Bájddárin       Bájd^dárin
Heandarahkii
Heandarahkii    Hean^da^rahkii

I have found bad hyphenation=no hyphenation in some nouns as well

Hyph transducer/speller:

johtolahkii
johtolahkii joh^to^lahkii

Rules only: johtolahkii joh^to^lah^kii

-bash-3.00$ echo “johto^lahkii” | lookup -flags mbTT -utf8 bin/hyphrules-sme.fst johto^lahkii joh^to^lah^kii

Lines 52+53 above:

xfst[1]: up johtolahkii
johtolahkii
xfst[1]: down johtolahkii
johtolahkii

Target:

joh^to^lah^kii


gonagasažis
gonagasažis go^na^gasa^žis
gonagasažis go^na^gasa^žis


gonagassii
gonagassii go^na^gassii


johtolagas
johtolagas joh^to^lagas


xfst[2]: set flag-is-epsilon ON
variable flag-is-epsilon = ON
<. @"sme/bin/hyph-isme.save" .o. @"sme/bin/sme-norm.fst" ;
...
*** Warning: It is unsafe to treat flag diacritics as special in
    composition when both networks contain flags. Please set the
    variable compose-flag-as-special to OFF.
*** Warning: label '@U.Cap.Obl@:@U.Cap.Opt@' is illegal: flag diacritics  on both sides of the symbol pair.
*** Warning: label '@U.Cap.Opt@:@U.Cap.Obl@' is illegal: flag diacritics  on both sides of the symbol pair.
19.5 Mb. 212213 states, 1342959 arcs, Circular. Label Map: Default.
xfst[3]: up johtolahkii
joh^to^lah^k@C.NeedNoun@@C.NeedNoun@ii@D.NeedNoun.ON@@D.NeedNoun.ON@
xfst[3]: set show-flags OFF
variable show-flags = OFF
xfst[3]: up johtolahkii
joh^to^lah^kii




a88-114-120-101:gt sjur$ echo "johtolahkii" | lookup -flags mbTT -utf8 sme/bin/hyph-sme.fst
0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100%
johtolahkii	joh^to^lah^kii