TTS synthesis notes by KHA
General pipeline / workflow for training data for synthesis:
Text data -> cleaning, correcting -> tokenization, normalization -> POS tagging, morphological analysis etc -> G2P -> Prosody modeling -> Baseline for training the synthesis.
Commented version:
Text data
-> cleaning, correcting # (manuell) korrektur i teksten
-> tokenization, normalization # Normalization: digots, abbreviations, etc - what about parentheses?
-> POS tagging, morphological analysis etc # GiellaLT stuff
-> G2P # This is the real meat
-> Prosody modeling # intonation, stress, extract features from Praat, perhaps using explicit symbols from the analysis step?
-> Baseline for training the synthesis.
Candidate technologies/engines
Wavenet
- Fully probabilistic and autoregressive -> each audio sample conditioned on all previous ones
- Trained on data with tens of thousands of samples / s of audio -> Most accurate?
- Audio needs to be annotated
- Conditionings on speaker identity
- Adding speakers -> better validation set performance
- Requirements for the training data: in the paper it was 24-34 hrs of pro speakers
- Mimics acoustics and recording quality
- Locally conditioned on linguistic features
- Needs to be force-aligned
- Phone duration/segmentation
- Sentence & word segmentation, text normalization, POS tagging, G2P mapping
Pros and cons:
- Best quality of these three
- Sounds natural, models also the “non-speech” parts, like breathing patterns, resulting in good prosody
- Not language-dependent, but for training a lot of preferably studio-quality data is needed
- Slow to train (1 sentence ≈ 15 min) At least needs very powerful hardware
- Expert annotation on input data
Facebook VoiceLoop
- Shifting buffer working memory
- Voices samples “in the wild”, no need for linguistic features
- Robust, mimicking spkrs based on noisy and limited training data
- Does not replicate background noise?
- Doesn’t require any alignment between phonemes/acoustic / linguistic features as input
- BUT requires the Phonemizer so it needs a G2P mapping for each language
Pros and cons:
- Computationally effective
- Does not need expert annotation / linguistic features
- Sampled in the wild so not so sensitive for noise in the data
- Acceptable but not great overall quality at least for English
- Sound quality not so good in the sample audio but depends on the training data
- Prosody not very natural
Mozilla TTS
- Google Tacotron2 the best synthesizer there is atm
- Tacotron synthesizes spectrogram directly from characters
- Learns phonological rules from the input text (ngrams) BUT performs better if text is already “phonemized”
- seq2seq framework - > has it been tested on other languages?
- Frame level faster than sample level
- End-to-end TTS (learns the G2P alignment as well)
- but must cope with large variations at the signal level -> requires a lot of data to cover variation
- Minimal human annotation -> what needs to be annotated?
- Convolutional filters
- MOS better than “normal” parametric but worse than concatenative
- Performs even better if WaveNet used as a vocoder (neural vocoder)
Pros and cons:
- Requires minimal human annotation (on English at least)
- Conditioning on various attributes
- Directly from characters?
- End-to-end -> suitable for commercial use
- Voice samples: prosody is acceptable, not great although there was 24 hrs pro speaker data
- Audible artefacts in the output which makes the most “machine-sounding” voice
- Pronunciation mistakes in the samples, ‘does’ pronounced as [dous].
(Mozilla) Tacotron
- Different implementations (dc_tts, espnet, Nvidia tacotron 2). It depends on the implementation what exact data pre-processing has to be done but generally force-align and use IPA.
- Phonetic transcription (IPA, Arpabet,Sampa) + speech corpus is enough to train it
- IPA the most generally used protocol, but it can be converted to other frameworks using a script.
Other systems
- Acapela - unit selection (concatenative)
- Merlin - Needs very well aligned data
- Mbrola
- Espeak - Pronunciation rules in C