North Sami Text-to-Speech

Finite state and Constraint Grammar based Text-to-Speech processing

View the project on GitHub giellalt/speech-sme

Page Content

This document looks at different ways of expanding numbers to words. There are several ways to do this:

(In English, there is also the possibility of pairs of digits, as in years or street addresses, but I have not found any examples of when that would be natural for North Sámi).

In Sproat et al. 2001 there is a taxonomy of non-standard words and how they should be expanded/converted to words. This is based on English. Using their taxonomy, this is my suggestion for North Sámi:

Number to integer

The integer solution should be our default solution for short numbers (less than four digits), when we cannot identify the type of number.

Type Description
NUM number (cardinal), for amounts, room numbers, book chapters, reindeer herding districts, university courses, law paragraphs
NORD ordinal numbers
NADDR street adress: Davviluohkká 27: Davviluohkká guoktelogičieža
post box: Poastaboksa 208: Poastaboksa guoktečuođigávcci
PRCT percentage: 32%: golbmalogiguokte proseantta
NYER years. We have some of this already in the date converter, but we need more for expressions like 1980-logut, 80-logut, 1900-logus, and also for case markings. In Sweden and Norway the years between 1100-1999 are read out as in the Scandinavian languages, with eleven hundred etc.

Number to string of digits

The string of digits should probably be our default solution for longer number (more than three digits), when we cannot identify which type of number we have:

NDIG expands to string of digits

Phone numbers

NTEL phone numbers Pause at white spaces

If there is no white space, we will have to make default pauses. Special phone numbers as 110, 112, 113 should be converted to integers.

Zip codes

NZIP zip code 9520 Guovdageaidnu: ovcci vihtta guokte nulla

Some zip codes look like “clear” integers. We could also have them expand to integers, such as

My experience is that people get addresses and phone numbers more easily when you read each digit separately.

Numeral to compound

TODO:

Money

Money is more complicated than this in North Sámi, because, as with months, we don´t have a “short hand” way of saying things using only numerals:

What do we do with other currencies. Do we specify for each currency, like

Or do we just use čuokkis or rihkku:

There is also another class, BMONEY, where you have millions and trillions:

year

Sometimes the noun lohku follows the year:

Concord rules are not the same in these structures as they are otherwise. The numeral is the first part of a compound, and should not change. Compare with golbmaoktalaš (Pekka Sammallahti, personal communication).

1600-lohku:      guhttanuppelotčuohte-lohku (both are nominative)
1600-logu rájes: guhttanuppelotčuohte-logu rájes
1600-logus:      guhttanuppelotčuohte-logus
1600-lohkui:     guhttanuppelotčuohtelohkui
1600-loguin:     guhttanuppelotčuohteloguin
1600-loguid      guhttanuppelotčuohteloguid
1600-loguide:    guhttanuppelotčuohteloguide


1950-lohku:     ovccinuppelotčuođivihttalot-lohku
1950-logu rájes ovccinuppelotčuođivihttalot-logi rájes
1950-logus      ovccinuppelotčuođivihttalot-logus
1950-lohkui     ovccinuppelotčuođivihttalot-lohkui
1950-loguin     ovccinuppelotčuođivihttalot-loguin
1950-loguid     ovccinuppelotčuođivihttalot-loguid
1950-loguide    ovccinuppelotčuođivihttalot-loguide

Number + jahkásaš etc:

we have this+ more combinations in lexicon

To-From-expressions

Numbers separated by a hyphen

(This only includes expressions which are not separated by some word in the text, i.e., not things like Vázzen mánáidskuvlla Mázes 1960 čavčča rájes gitta 1966 geassái.)

Sometimes from-to- expressions should be separated by some word, sometimes not. When the numbers refer to years, a word lika gitta seems to be required. When the numbers refer to amounts, it is not always necessary to have a word intervene between them:

ja borrá 20 000 – 30 000 divresuosa ovtta geasis: guokte-golbmalotduháha, guoktelot-golbmalotduháha Koloniseren (1200-1700) guoktenuppelotčuođi gitta čiežanuppelotčuođi

Other things

Fractions: | 1/4: | njealjádas, njealját-oassi | 3/4: | golbma njealjádasa, golbma njealját-oassi | 1/2: | bealli | 1½: | beannot, okta ja bealli

Types with case marking:

NUM:    SAM-3014:s lea rievdan lohkanmearri.
        SAM golbmaduhátnjealljenuppelogis   
            LOC

        
NYER:   Lean bargan dáppe 1991 rájes.       
        ovccinuppelotčuođiovccilogiovtta             
        GEN
PRCT:   Jávkan lea lassánan 32%:s 40%:i.
        golbmalogiguovtti proseanttas njealljelot prosentii 
        GEN                 LOC         ATTR        ILL
NDATE:  Mii váldit vuostá studeantačállosiid 03.12 rájes 
        juovlamánu goalmmát beaivvi rájes
                            GEN
NTIME:  Mii fertet geargat 15:00 rádjái
        golmma rádjái
        GEN