Text processing

(AKA preprocessing in TTS parlance)

Overall goal:

use our own text tokenisation and analysis tools to reduce ambiguity and thus increase precision in the produced speech

Basic architecture:

rule based and lexicon based (fst and vislcg3)
fall back on general rules only for unknown items

Processing steps:

tokenisation and morphological analysis
whitespace markup
add valency info (helps in later disambiguation)
disambiguation of MWE’s - one or several tokens?
morphological disambiguation, syntactic & dependency parsing
normalisation
conversion to IPA
exctraction of IPA strings from full analysis output ⇒ synthesis

Working setup

Requirements:

nightly hfst package from Tino (at least from June 10 or thereabout, June 21 is known to work), see this for details
newest version of:
- giella-core
- giella-shared
- lang-smj

Configuration & building:

Run the following commands:

cd lang-smj
./autogen.sh
./configure --enable-tokenisers --enable-phonetic --enable-tts --enable-custom-fsts
cd tools/tts/
make dev # creates all modes/ files
cd ../../
make -j6 # takes a while, ~12 minutes on a fast MacBook Pro

Converting text:

echo 'Skåvlån hæhttuji juohkka akta sierra skåvllåbiktasijt adnet. \
Eskilin li tjáhppis båvså, tjáhppis jali bieddjis skirtto ja alek slippsa. \
Næjtsojn li sæmmi skåvllåbiktasa, valla sij máhtti aj vuolppuj tjágŋat.' \
| tools/tts/modes/smj-tts-txt2ipa.mode

This will output text of the form:

"<Skåvlån>"
	"skåvllå" N Sem/Edu_Org Sg Ine "skåvllå>Q1n"MIDTAPE <W:0.0> @ADVL> #1->7 "skɔvlɔːn"phon
: 
"<hæhttuji>"
	"hæhttuji" ? @X #2->0 "heætːuji"phon
: 
"<juohkka>"
	"juohkka" Pron Indef Attr "juohkka>"MIDTAPE <W:0.0> @>Num #3->4 "juokːaː"phon
: 
"<akta>"
	"akta" Num Sg Nom "akta>"MIDTAPE <W:0.0> @SUBJ> #4->7 "ɑktaː"phon
: 

To extract the phonetic transciption only, extend the command above as follows:

... \
| grep 'phon' | rev | cut -d'"' -f2 | rev | uniq

The output is then:

skɔvlɔːn
heætːuji
juokːaː
ɑktaː
siɛrːaː
skɔvlːɔːpiktaːsijht
ɑtnɛht
.

To mostly restore the text as it was (but now normalised/transkribed), instead do like this:

... \
| sed '/phon$/{$!N;//!P;D;}'       | # print the last of several consecutive ""phon lines
  sed -e 's/\":\"/\"xxcolonxx\"/g' | # protect actual colon
  egrep '(^:|phon)'                | # grep phon lines or lines starting with colon
  rev | cut -d'"' -f2 | rev        | # get only the ""phon string content
  uniq                             | # uniq just in case
  sed -e 's/^://'                  | # delete colons in first position
  tr -d '\n'                       | # delete newlines, they are artefacts of earlier steps
  sed -e 's/\\n/\n/g'              | # convert original newlines to actual newlines
  sed -e 's/xxcolonxx/:/'          | # restore actual colons
  ...

The output then becomes:

skɔvlɔːn heætːuji juokːaː ɑktaː siɛrːaː skɔvlːɔːpiktaːsijht ɑtnɛht.

Uncommented version for easy copy and paste:

... \
| sed '/phon$/{$!N;//!P;D;}' | sed -e 's/\":\"/\"xxcolonxx\"/g' |\
  egrep '(^:|phon)' | rev | cut -d'"' -f2 | rev | uniq |\
  sed -e 's/^://' | tr -d '\n' | sed -e 's/\\n/\n/g' | sed -e 's/xxcolonxx/:/'

Note:

The sed command to print the last of many consecutive phon lines (as in multiple possible conversions to text, e.g. of digits), should really be printing the first one. It also discards the main reading in case of cohorts with subreadings - only the last/deepest subreading survives. Feel free to improve, although this is just a shell pipeline workaround until similar functionality is added to our CLI tools.

Lule Sami Text-to-Speech

Page Content

Text processing

Working setup