Lule Sami Text-to-Speech

Finite state and Constraint Grammar based Text-to-Speech processing

View the project on GitHub giellalt/speech-smj

Page Content

Conversion to IPA

We now have a working setup. Below are first instructions for getting started, followed by the original specifications, and a list of open issues.

Specification

Input is a VISLCG3 cohort usually containing a Phon element (string with Phon attached at the end):

"<Dr.>"
	"dåktår" Area/NO N Sem/Hum Sg Acc "dåktår>av"Phon <W:0.0>
		"dr" Area/NO N Sem/Hum ABBR Gram/TAbbr Sg Acc <W:0.0>

The conversion is done using one or more rule transducers. That is, the output should be something like this (with huge reservations for the actual IPA!):

"<Dr.>"
	"dåktår" Area/NO N Sem/Hum Sg Acc "dɔkʰtɔrav"Phon <W:0.0>
		"dr" Area/NO N Sem/Hum ABBR Gram/TAbbr Sg Acc <W:0.0>

Compounds

"<álggosijddo>"
	"sijddo" N Sem/Obj-surfc Sg Nom "álggo#sijddo"Phon <W:0.0> @SUBJ> #7->16
		"álggo" N Sem/Time Cmp/SgNom Cmp "álggo#sijddo"Phon <W:0.0> #7->16

Fake three tape fst — abstract phonemic representation

There is now support for building TTS specific analysers that do the analysis in two steps:

  1. step:
    echo åludallabehtit | hfst-lookup -q src/analyser-tts-gt-input.hfstol 
    åludallabehtit	oallo»X7dalla>behti>t	0,000000
    
  2. step:
    echo 'oallo»X7dalla>behti>t' | hfst-lookup -q src/analyser-tts-gt-output.hfstol 
    oallo»X7dalla>behti>t	oallot+V+TV+Der/dalla+V+Ind+Prs+Pl2+Err/Orth	0,000000
    

And hfst-pmatch and hfst-tokenise have been extended such that given the above input, the output should be:

echo åludallabehtit | hfst-tokenise -g tools/tokenisers/tokeniser-tts-cggt-desc.pmhfst 
"<åludallabehtit>"
	"oallot" V TV Der/dalla V Ind Prs Pl2 Err/Orth "oallo»X7dalla>behti>t"MIDTAPE <W:0.0>
:\n

Using this, we have access to underlying abstractions over phonemes, and more detailed consonant gradation information, which should help us produce more accurate and better speech synthesis.

To use this setup, please configure as follows:

MISSING! MIDTAPE is always printed, also in cases where the MIDTAPE is identical to the surface string. It should not, since that will interfere with phon data from other steps.

Input data priority and IPA conversion

The processing goes as follows:

Given that we retain the VISLCG3 cohort stream even after the IPA conversion, it is possible to do further disambiguation afterwords, if needed or wanted.

Outputting IPA

We need a very simple tool that will take a VISLCG3 stream, and for each cohort only output the Phon strings (possibly with the plain-text wordform as comment, for debugging).

Given this input:

"<Dr.>"
	"dåktår" Area/NO N Sem/Hum Sg Acc "dɔkʰtɔrav"Phon <W:0.0>
		"dr" Area/NO N Sem/Hum ABBR Gram/TAbbr Sg Acc <W:0.0>

return this (or something similar, must be compatible with the synthesiser engines we want to use):

dɔkʰtɔrav # Dr.

This string is then fed to the synthesiser.

Ambiguity handling

In cases where there is still ambiguity left in the CG stream, proceed as follows:

Open issues