TTS synthesis notes by KHA

General pipeline / workflow for training data for synthesis:

Text data -> cleaning, correcting -> tokenization, normalization -> POS tagging, morphological analysis etc -> G2P -> Prosody modeling -> Baseline for training the synthesis.

Commented version:

Text data
-> cleaning, correcting # (manuell) korrektur i teksten
-> tokenization, normalization # Normalization: digots, abbreviations, etc - what about parentheses?
-> POS tagging, morphological analysis etc # GiellaLT stuff
-> G2P # This is the real meat
-> Prosody modeling # intonation, stress, extract features from Praat, perhaps using explicit symbols from the analysis step?
-> Baseline for training the synthesis.

Candidate technologies/engines

Wavenet

Fully probabilistic and autoregressive -> each audio sample conditioned on all previous ones
Trained on data with tens of thousands of samples / s of audio -> Most accurate?
Audio needs to be annotated
Conditionings on speaker identity
Adding speakers -> better validation set performance
Requirements for the training data: in the paper it was 24-34 hrs of pro speakers
- Mimics acoustics and recording quality
Locally conditioned on linguistic features
- Needs to be force-aligned
- Phone duration/segmentation
- Sentence & word segmentation, text normalization, POS tagging, G2P mapping

Pros and cons:

Best quality of these three
Sounds natural, models also the “non-speech” parts, like breathing patterns, resulting in good prosody
Not language-dependent, but for training a lot of preferably studio-quality data is needed
Slow to train (1 sentence ≈ 15 min) At least needs very powerful hardware
Expert annotation on input data

Facebook VoiceLoop

Shifting buffer working memory
Voices samples “in the wild”, no need for linguistic features
Robust, mimicking spkrs based on noisy and limited training data
- Does not replicate background noise?
Doesn’t require any alignment between phonemes/acoustic / linguistic features as input
- BUT requires the Phonemizer so it needs a G2P mapping for each language

Pros and cons:

Computationally effective
Does not need expert annotation / linguistic features
Sampled in the wild so not so sensitive for noise in the data
Acceptable but not great overall quality at least for English
Sound quality not so good in the sample audio but depends on the training data
Prosody not very natural

Mozilla TTS

Google Tacotron2 the best synthesizer there is atm
Tacotron synthesizes spectrogram directly from characters
- Learns phonological rules from the input text (ngrams) BUT performs better if text is already “phonemized”
seq2seq framework - > has it been tested on other languages?
Frame level faster than sample level
End-to-end TTS (learns the G2P alignment as well)
- but must cope with large variations at the signal level -> requires a lot of data to cover variation
Minimal human annotation -> what needs to be annotated?
Convolutional filters
MOS better than “normal” parametric but worse than concatenative
Performs even better if WaveNet used as a vocoder (neural vocoder)

Pros and cons:

Requires minimal human annotation (on English at least)
Conditioning on various attributes
Directly from characters?
End-to-end -> suitable for commercial use
Voice samples: prosody is acceptable, not great although there was 24 hrs pro speaker data
Audible artefacts in the output which makes the most “machine-sounding” voice
Pronunciation mistakes in the samples, ‘does’ pronounced as [dous].

(Mozilla) Tacotron

Different implementations (dc_tts, espnet, Nvidia tacotron 2). It depends on the implementation what exact data pre-processing has to be done but generally force-align and use IPA.
Phonetic transcription (IPA, Arpabet,Sampa) + speech corpus is enough to train it
IPA the most generally used protocol, but it can be converted to other frameworks using a script.

Other systems

Acapela - unit selection (concatenative)
Merlin - Needs very well aligned data
Mbrola
Espeak - Pronunciation rules in C

Lule Sami Text-to-Speech

Page Content

TTS synthesis notes by KHA

Candidate technologies/engines

Wavenet

Facebook VoiceLoop

Mozilla TTS

(Mozilla) Tacotron

Other systems