Finite state and Constraint Grammar based Text-to-Speech processing
View the project on GitHub giellalt/speech-sme
This document looks at different ways of expanding numbers to words. There are several ways to do this:
(In English, there is also the possibility of pairs of digits, as in years or street addresses, but I have not found any examples of when that would be natural for North Sámi).
In Sproat et al. 2001 there is a taxonomy of non-standard words and how they should be expanded/converted to words. This is based on English. Using their taxonomy, this is my suggestion for North Sámi:
The integer solution should be our default solution for short numbers (less than four digits), when we cannot identify the type of number.
Type | Description |
---|---|
NUM | number (cardinal), for amounts, room numbers, book chapters, reindeer herding districts, university courses, law paragraphs |
NORD | ordinal numbers |
NADDR | street adress: Davviluohkká 27: Davviluohkká guoktelogičieža |
” | post box: Poastaboksa 208: Poastaboksa guoktečuođigávcci |
PRCT | percentage: 32%: golbmalogiguokte proseantta |
NYER | years. We have some of this already in the date converter, but we need more for expressions like 1980-logut, 80-logut, 1900-logus, and also for case markings. In Sweden and Norway the years between 1100-1999 are read out as in the Scandinavian languages, with eleven hundred etc. |
The string of digits should probably be our default solution for longer number (more than three digits), when we cannot identify which type of number we have:
NDIG | expands to string of digits |
NTEL | phone numbers | Pause at white spaces |
If there is no white space, we will have to make default pauses. Special phone numbers as 110, 112, 113 should be converted to integers.
NZIP | zip code | 9520 Guovdageaidnu: ovcci vihtta guokte nulla |
Some zip codes look like “clear” integers. We could also have them expand to integers, such as
My experience is that people get addresses and phone numbers more easily when you read each digit separately.
Money is more complicated than this in North Sámi, because, as with months, we don´t have a “short hand” way of saying things using only numerals:
What do we do with other currencies. Do we specify for each currency, like
Or do we just use čuokkis or rihkku:
There is also another class, BMONEY, where you have millions and trillions:
Sometimes the noun lohku follows the year:
Concord rules are not the same in these structures as they are otherwise. The numeral is the first part of a compound, and should not change. Compare with golbmaoktalaš (Pekka Sammallahti, personal communication).
1600-lohku: guhttanuppelotčuohte-lohku (both are nominative)
1600-logu rájes: guhttanuppelotčuohte-logu rájes
1600-logus: guhttanuppelotčuohte-logus
1600-lohkui: guhttanuppelotčuohtelohkui
1600-loguin: guhttanuppelotčuohteloguin
1600-loguid guhttanuppelotčuohteloguid
1600-loguide: guhttanuppelotčuohteloguide
1950-lohku: ovccinuppelotčuođivihttalot-lohku
1950-logu rájes ovccinuppelotčuođivihttalot-logi rájes
1950-logus ovccinuppelotčuođivihttalot-logus
1950-lohkui ovccinuppelotčuođivihttalot-lohkui
1950-loguin ovccinuppelotčuođivihttalot-loguin
1950-loguid ovccinuppelotčuođivihttalot-loguid
1950-loguide ovccinuppelotčuođivihttalot-loguide
we have this+ more combinations in lexicon
Numbers separated by a hyphen
(This only includes expressions which are not separated by some word in the text, i.e., not things like Vázzen mánáidskuvlla Mázes 1960 čavčča rájes gitta 1966 geassái.)
Sometimes from-to- expressions should be separated by some word, sometimes not. When the numbers refer to years, a word lika gitta seems to be required. When the numbers refer to amounts, it is not always necessary to have a word intervene between them:
ja borrá 20 000 – 30 000 divresuosa ovtta geasis: guokte-golbmalotduháha, guoktelot-golbmalotduháha Koloniseren (1200-1700) guoktenuppelotčuođi gitta čiežanuppelotčuođi
Fractions: | 1/4: | njealjádas, njealját-oassi | 3/4: | golbma njealjádasa, golbma njealját-oassi | 1/2: | bealli | 1½: | beannot, okta ja bealli
Types with case marking:
NUM: SAM-3014:s lea rievdan lohkanmearri.
SAM golbmaduhátnjealljenuppelogis
LOC
NYER: Lean bargan dáppe 1991 rájes.
ovccinuppelotčuođiovccilogiovtta
GEN
PRCT: Jávkan lea lassánan 32%:s 40%:i.
golbmalogiguovtti proseanttas njealljelot prosentii
GEN LOC ATTR ILL
NDATE: Mii váldit vuostá studeantačállosiid 03.12 rájes
juovlamánu goalmmát beaivvi rájes
GEN
NTIME: Mii fertet geargat 15:00 rádjái
golmma rádjái
GEN