Finite state and Constraint Grammar based Text-to-Speech processing
View the project on GitHub giellalt/speech-sme
Participants:
Tentative agenda:
Financed by unused money from the first Divvun project. Financed by the Sámi Parliament. The work is done in cooperation with UiT and HU. Targeted release date is sometime next year.
Voices: Do both a male and female voice - at least the recordings. Using the same text will simplify some of the perparations / postrprocessing, but not that much
The text should be something between rich and balanced. 2 hours of recordings, rather more than less. 120 - 150 000 letters is a rough esitmate, at least for Finnish. This would approximate perhaps 2000 sentences?
Whole paragraphs with intersentence dependencies. Even though it isn’t possible to model that yet, it will come in the future, and it is good to be preprared.
The text needs to as much as possible reflect the text types we are expecting in final use. Things to consider:
The microphone should be a high quality, phase linear condenser microphone. Sampling frequency = basic CD quality is good enough. At least 16 bit resolution. As noise free as possible.
easier to generate
start to plan a larger project with more independence for the UiT group, including a new software engineer position.
Three areas come up as most relevant:
There are other future usages that is hard to envisage now, the synthetic voice has to be as general as possible, to be able to cope with as many different text types and usage scenarios as possible.
Tools for testing prosody prediction
Until now:
Soon:
synthesizer made in Hki :-)
Now: Antti in a (too?) key position. Knowledge transfer is thus high on the priority list. It is not about money as long as we do not extend to a new person.
Intermediate solution:
A person working both for Hki and for Tromsø. This person would be in Hki for training, but could later move to Tromsø.
UiT - speech corpus:
UiT - Building the textual input for training the synthesizer:
ortography
Mon lean okta sápmelaš, guhte lean bargan visot sámi bargguid ->
-> lts rules (automatic)
mun leæn okː.htɑ sɑːp.me.lɑʃ , kuh.te leæn pɑrə.kɑn vi.soh sɑː.miː pɑrk.kujht
-> manual labeling of prominence and breaks (for training the model)
"mun leæn 'okː.htɑ "sɑːp.me.lɑʃ , | kuh.te leæn 'pɑrə.kɑn "vi.soh 'sɑː.miː 'pɑrk.kujht
UiT: enhance the lts system to automatically add full prosody labeling as input to the synthesis. Match as close as possible the trained data, but a perfect match is not required to get a good synthesis.
Needed: A theory of Sámi prosody
2. = 3-level word prominence, 3-4 levels of phrase breaks
The Hki functional approach: Primitives of the tts f-approach: Prominence, the 0123 theory (annotated by numbers). The algorithm for predicting 0123 + breaks (≈ syntactic analysis (?!!)) for Finnish would be relevant
Prominence:
Antti has a proof of concept ruleset for Finnish
Machine learning:
Build a spec based upon what has been done for Finnish
Then work proceeds as agreed.