Sámi Speech Synthesis / TTS planning meeting
Topics:
Financing:
- Sjur meeting project board (ie money) next Thursday
- Trond meeting Sámi Political Secretary next Thursday
Political view
Meetings
The Divvun project board meeting
Based upon the 2003 report.
Less than a year, and NOK ≈ 3.8 / 4.2 m
Arbeids- og inkluderingsdepartementet, samisk avdeling
Forthcoming meeting.
Political setting:
- Universell utforming (persons with disabilities should have access => blind, etc.)
- Sámi language policy (for Sámi as well, not only for norw) => the “me too! principle”
These political premises lead to the conclusion in 20.5.3[#1]
20.5.3 Talesyntese
“Stadig flere tjenester gjøres elektroniske og mye informasjon gjøres tilgjengelig i digital form, blant annet i biblioteker. For å kunne gi alle tilgang er det av flere pekt på behov for ytterligere innsats fra det offentlige, og mange aktører har pekt spesielt på tilgang til verktøy for talesyntese.”
“Tale har vist seg å være en effektiv måte å presentere informasjon på, enten den blir brukt alene, sammen med teksten, eller er synkronisert med teksten ved at markøren viser hvilket ord som blir lest. Taleprogram kan gi mennesker med lese- og skrivevansker tilgang til ulike typer tekst, fra fagbøker til aviser. Mennesker som skal lære norsk vil også kunne få god nytte av slik funksjonalitet. Sametinget har utredet om det er mulig å utvikle talesyntese for samiske språk. Talesyntesen er tenkt brukt som tilleggsverktøy i kombinasjon med ordinære korrekturprogram. Erfaring viser at bruk av talesyntese støtter både lese- og skriveprosessen. I tillegg vil talesyntese kunne bli brukt som grunnlag for å utvikle et moderne tjenestetilbud på mange felt.”
Two views:
1 - we’re done
2 - we have just started
- sentence prosody, intonation studies
- better empirical basis for the hmm model we have today
(the first 186 sentences sound better than the rest)
Form:
- Research collaboration
- Company to make the windows-based boring stuff
The company will be reluctant to open up the system. Bringing the two together might be harder
For speech synthesis we will need a person who is capable of developing a tts.
Our view
- Linguistically interesting
- Working
- Developing our research milieus
- Companies must be chosen via public tender
Strategic goal
Establish an as independent production line as possible
Be able to scale up the number of languages
Where will we stop?
Maximal version:
Within our scope:
Production line for open platforms
The linguistic part, and the basic (to be specified) part of the
production line for proprietary platforms
Embedding applications into proprietary platforms is outside our scope
- do the linguistics
- see what the company does for microsoft
- learn from that
- do “the same” for unix platforms
Prerequisites
Parsers:
- morphology, disambiguation, dependency: sme
- morphology, disambiguation: smj
- morphology (partly): sma
Languages:
- Lule Sámi: Like North Sámi, but writes the consonants with Swedish orthogrhaphic rules, and “has a more Scandinavian sound” (tt subjective)
- South Sámi: Different from sme, smj: No consonant gradation, extensive umlaut, more diphthongs, more vowels., a “scandinavian-based orthography”
Work
Difference before and after 186
Needing:
- a professional voice
- Better quality on the recording itself
- We would need at least 1000 sentences and a more varied material
- The material should be controlled for the diphone reportoire, so that all (most) the diphones are included
- The actual coverage does not matter in the statistical analysis, we will have to rely on luck, and on having a big enough sample
For Finnish, 1 h speech, appr 500 - 1000 sentences.
Recording is cheap, the cost is in the annotation:
Labeling accentuation and breaks.
1 day = 100 sentences
Work on the prediction side:
Make a grammar for predicting the prosodic unit
The synthesis makes the contours as soon as we have the prosodic units.
Putting in the breaks was important, but after that the hmm worked.
Autosegmental break
Predicting abstract prominence:
- word order
- information structure
- POS
- focus particles
- semantic issues
- Syntactic analysis ==>
- information and intonantion structure, accent prediction =>
- F0
Product from our proof-of-concept demo.
Important development goals:
- modualarity
- reusability
- language independence
Tasks for making the voice:
- Training people
- Developing the text-to-ipa (grapheme-to-phone(me))
- text corpora
- selecting text, preprocessing text
- 1 week / language?
- speech corpora
- hiring voice talent + studio
- recordin in studio
- preprocessing (splicing into chunks, files, etc)
- 2000 sentences / language (Finnish * 2)
- 1 month of annotation (per speaker (voice))
- Note that the cost of adding one more voice is relatively small,
compared to other frameworks
- Make the hmm model <==================
- [[with a unit selection model, it is not 1mth/lg but 6-12mth/lg]
- Syntax (external project)
- Find the grammatical cues needed
- Intonation (syntactic and information structure to prosodic phrasing)
- Predict F0 on the basis of syntax, semantics and information structure
- Patrik’s project = 4 man-years
- Patriks recordings could feed into the project
- Mapping syntax onto F0
- Open question whether Sámi intonation is more complicated than Finnish
- Sum:
Labeling
syn-analysis
=> accent and phrasing
Voice-building:
- company - or ourself based on a tool licensing model?
Tasks for making TTS out of the voice:
- number processing (already in place)
- abbreviation processing
- foreign names (including majority language names)
- dates (partly in place)
- integrating this processing with the synthesis system
Tasks for making a product:
- integrating with OSes and end user applications <=== major task, company
- text encoding / Unicode and Sámi
- parser integration / replacement - depends on the selected company
- one-time job
- installation packages
- user documentation
Try to build in an option to open-source future. HMM is a lot of open source.
Most HMM projects:
hts toolkit, festival, etc.
The future of speech synthesis is in HMM synthesis.
2003: 2/3 unit selection, 1/3 diphone, small fraction doing HMM
2008: 1/3 => 2/3 hmm, 1/3 US, nobody doing diphone
- Tromsø/Divvun
- (Syntax and Intonation outside the proj proper): research +parser work 1mth/l
- Text and analysis: 5 months / lang
- recording 1 w / lang
- transcription: 1 month / lang + 1 mnth overhead for the first
- Total: x man-months
- Helsinki
- Do consultation work 1month for the first voice, and then less
- …?
- 2-3 man-months
- prototypes for 2 more langs: 2 month
- Company
- Build voice
- Embed into programs
- x man-months
- y EUR
Persons:
- voice expert Hki
- Sámi linguist
- phonologist
Project-internal training:
- kick-off seminar
- training, education of the project members
24 mmth vs 48 mmth??
Presentations on thursday:
Sjur to integrate his former report and notes from this meeting
Demo:
- The Sámi sentences:
- sentences from 1-186,
- the original Biret Ánne for the same sentences
- sentences from the 187ff realm (… and hear the difference)
hmm examples for Finnish: [http://homepages.inf.ed.ac.uk/jyamagis/]
[1] [http://www.regjeringen.no/nb/dep/aid/dok/regpubl/stmeld/2007-2008/stmeld-nr-28-2007-2008-/20/5.html?id=513086]