Finite state and Constraint Grammar based Text-to-Speech processing
View the project on GitHub giellalt/speech-smj
We now have a working setup. Below are first instructions for getting started, followed by the original specifications, and a list of open issues.
Input is a VISLCG3 cohort usually containing a Phon
element (string with Phon
attached at the end):
"<Dr.>"
"dåktår" Area/NO N Sem/Hum Sg Acc "dåktår>av"Phon <W:0.0>
"dr" Area/NO N Sem/Hum ABBR Gram/TAbbr Sg Acc <W:0.0>
The conversion is done using one or more rule transducers. That is, the output should be something like this (with huge reservations for the actual IPA!):
"<Dr.>"
"dåktår" Area/NO N Sem/Hum Sg Acc "dɔkʰtɔrav"Phon <W:0.0>
"dr" Area/NO N Sem/Hum ABBR Gram/TAbbr Sg Acc <W:0.0>
"<álggosijddo>"
"sijddo" N Sem/Obj-surfc Sg Nom "álggo#sijddo"Phon <W:0.0> @SUBJ> #7->16
"álggo" N Sem/Time Cmp/SgNom Cmp "álggo#sijddo"Phon <W:0.0> #7->16
There is now support for building TTS specific analysers that do the analysis in two steps:
echo åludallabehtit | hfst-lookup -q src/analyser-tts-gt-input.hfstol
åludallabehtit oallo»X7dalla>behti>t 0,000000
echo 'oallo»X7dalla>behti>t' | hfst-lookup -q src/analyser-tts-gt-output.hfstol
oallo»X7dalla>behti>t oallot+V+TV+Der/dalla+V+Ind+Prs+Pl2+Err/Orth 0,000000
And hfst-pmatch
and hfst-tokenise
have been extended such that given the above input, the output should be:
echo åludallabehtit | hfst-tokenise -g tools/tokenisers/tokeniser-tts-cggt-desc.pmhfst
"<åludallabehtit>"
"oallot" V TV Der/dalla V Ind Prs Pl2 Err/Orth "oallo»X7dalla>behti>t"MIDTAPE <W:0.0>
:\n
Using this, we have access to underlying abstractions over phonemes, and more detailed consonant gradation information, which should help us produce more accurate and better speech synthesis.
To use this setup, please configure as follows:
MISSING! MIDTAPE is always printed, also in cases where the MIDTAPE is identical to the surface string. It should not, since that will interfere with phon
data from other steps.
The processing goes as follows:
hfst-tokenise
gives a MIDTAPE representation if different from surface formphon
stringsphon
stringsphon
string is present, use word form as starting point (also handles cases of no analysis)phon
strings, and possibly also for different OLang tagsGiven that we retain the VISLCG3 cohort stream even after the IPA conversion, it is possible to do further disambiguation afterwords, if needed or wanted.
We need a very simple tool that will take a VISLCG3 stream, and for each cohort only output the Phon
strings (possibly with the plain-text wordform as comment, for debugging).
Given this input:
"<Dr.>"
"dåktår" Area/NO N Sem/Hum Sg Acc "dɔkʰtɔrav"Phon <W:0.0>
"dr" Area/NO N Sem/Hum ABBR Gram/TAbbr Sg Acc <W:0.0>
return this (or something similar, must be compatible with the synthesiser engines we want to use):
dɔkʰtɔrav # Dr.
This string is then fed to the synthesiser.
In cases where there is still ambiguity left in the CG stream, proceed as follows:
stderr
, and proceedstderr
, and stopstderr
, and proceedstderr
, and proceedsrc/phonetics/txt2ipa.xfscript
or in the lexc files.
Using the txt2ipa
file will work right out of the box, lexc requires some hard thinking