Finite state and Constraint Grammar based Text-to-Speech processing
View the project on GitHub giellalt/speech-sme
This document is an overview of the work of assembling and editing texts for reading, ie the texts used in recording the voices.
Considering that we want our end product to be able to read “everything”, the texts must range from formal language to colloqial language. The different styles show different preferences for long words, possessive suffixes and particles, which in turn has different implications for prosody. We need a good mixture of these styles.
Formal language: translated, ‘bureaucratic’ texts. These usually have much longer sentences than texts that are written originally in Sámi. A whole paragraph can be one sentence. Other characteristics: many subordinate clauses, passive sentences, mostly 3 person singular and 1 person plural verb forms (short suffixes), participle constructions preferred to relative clauses (sentences are initially heavy), possessive suffixes are common, particles are uncommon. Bureacratic, political vocabulary with long words. Compounds in which the first element is trisyllabic are common. Many abbreviations and parentheses. Some numbers, mostly years. Use of abbreviations and accronyms.
Semi-formal language: Original language is Sámi, scholar/text book style. Slightly shorter sentences, relative clauses preferred to participial constructions, passive sentences, more verb forms such as dual verbs (disyllabic suffixes). Possessive suffixes are common, particles occur. Everyday vocabulary with elements of technical and traditional terminology, mixture of long and short words. Compounds in which the first element is trisyllabic are common. Many numbers, such as dates, years and amounts. Numbers also come in inflected forms, with colons. Many abbreviations and parentheses. Listings, like nouns separated by commas.
Neutral language: Original language is Sámi, children’s text book style: Short sentences, core word order, use of particles, possessive suffixes occur but are not common, everyday vocabulary, mostly short words. Numbers occur, mostly amounts and some years. Many listings, such as nouns separated by commas.
Semi-colloqial language: Original language is Sámi, conversation style. Long sentences, with relative clauses or non-finite small clauses. Non-finite small clauses with gerunds and agentive preferred to relative clauses. Subjectless sentences. Sentences prefer to have more material towards the end. Extensive use of particles, possessive suffixes occur. Traditional vocabulary with many derived verb forms, mixture of long and short words. Numbers occur, mostly years. Uncommon consonants and consonant centres most likely to occur in these texts.
List of single words, letters, numbers and dates.
Fairytales: Fairytales are probably not good for text to speech purposes, because of the exaggerated prosody. However, there are some consonants and consonant clusters/groups that are rather uncommon in the other types of text, such as the voiceless sonorants hl, hr, hj, hn. These seem to be more frequent in traditional texts, such as fairytales. I have divided the fairytales between neutral and semi-colloqial language. The voice talents must be instructed to read the fairytales in an ordinary conversational style, and not stay true to fairytale style.
The texts that are chosen are not totally authentic. They have been altered to accommodate reading fluency. Some texts have not been proofread properly before publishing: