North Sami Text-to-Speech

Finite state and Constraint Grammar based Text-to-Speech processing

View the project on GitHub giellalt/speech-sme

Page Content

Formalities

Participants:

Tentative agenda:

Project overview

Financed by unused money from the first Divvun project. Financed by the Sámi Parliament. The work is done in cooperation with UiT and HU. Targeted release date is sometime next year.

Recording

Voices: Do both a male and female voice - at least the recordings. Using the same text will simplify some of the perparations / postrprocessing, but not that much

The text should be something between rich and balanced. 2 hours of recordings, rather more than less. 120 - 150 000 letters is a rough esitmate, at least for Finnish. This would approximate perhaps 2000 sentences?

Whole paragraphs with intersentence dependencies. Even though it isn’t possible to model that yet, it will come in the future, and it is good to be preprared.

The text needs to as much as possible reflect the text types we are expecting in final use. Things to consider:

The microphone should be a high quality, phase linear condenser microphone. Sampling frequency = basic CD quality is good enough. At least 16 bit resolution. As noise free as possible.

Software / tools needed

easier to generate

start to plan a larger project with more independence for the UiT group, including a new software engineer position.

Output products

Three areas come up as most relevant:

There are other future usages that is hard to envisage now, the synthetic voice has to be as general as possible, to be able to cope with as many different text types and usage scenarios as possible.

Tools for testing ling input

Tools for testing prosody prediction

Testing procedure:

Until now:

  1. Ling analysis
  2. Feed to B-Á
  3. Look at output
  4. Rewrite ling analysis
  5. Etc.

Soon:

  1. Ling analysis
  2. Feed to syntesizer
  3. Listen to output <==== (use the prototype for this? No)
  4. Rewrite ling analysis
  5. Etc.

Work capacity at the HU

synthesizer made in Hki :-)

Now: Antti in a (too?) key position. Knowledge transfer is thus high on the priority list. It is not about money as long as we do not extend to a new person.

Intermediate solution:

A person working both for Hki and for Tromsø. This person would be in Hki for training, but could later move to Tromsø.

Work plan

UiT - speech corpus:

  1. Collect text
  2. make recordings
  3. cut recordings into sentenes (like for the prototype)
  4. transcribe recordings

UiT - Building the textual input for training the synthesizer:

ortography 
Mon lean okta sápmelaš, guhte lean bargan visot sámi bargguid  ->
 -> lts rules (automatic)
  mun leæn okː.htɑ sɑːp.me.lɑʃ ,  kuh.te leæn pɑrə.kɑn vi.soh sɑː.miː pɑrk.kujht
-> manual labeling of prominence and breaks (for training the model)
"mun leæn 'okː.htɑ "sɑːp.me.lɑʃ , | kuh.te leæn 'pɑrə.kɑn "vi.soh 'sɑː.miː 'pɑrk.kujht

UiT: enhance the lts system to automatically add full prosody labeling as input to the synthesis. Match as close as possible the trained data, but a perfect match is not required to get a good synthesis.

Needed: A theory of Sámi prosody

  1. a perfect theory of Sámi prosody (from ToBi and onwards)
  2. a theory of tts-relevant Sámi prosody

2. = 3-level word prominence, 3-4 levels of phrase breaks

The Hki functional approach: Primitives of the tts f-approach: Prominence, the 0123 theory (annotated by numbers). The algorithm for predicting 0123 + breaks (≈ syntactic analysis (?!!)) for Finnish would be relevant

Prominence:

Antti has a proof of concept ruleset for Finnish

Machine learning:

Build a spec based upon what has been done for Finnish

Todolist

Then work proceeds as agreed.