Hyphenating texts using the API server
For the time being, only North Sámi can be hyphenated as shown below. For other languages than North Sámi, see this document. We hope to add support for more languages soon.
Basic command
curl -s -X POST -H 'Content-Type: application/json' \
'https://api-giellalt.uit.no/hyphenation/se' \
--data '{"text": "mun hálan davvisámegiela"}' | jq
Comments:
- we use
curlto access the REST API, with the-sparameter to silence it. --datacontains the actual text to be hyphenated. It can be long, but should preferably be restricted to single paragraphs for execution time reasons.jqto pretty print the output
Output:
{
"text": "mun hálan davvisámegiela",
"results": [
{
"word": "mun",
"hyphenations": [
{
"value": "mun",
"weight": 0.0
},
{
"value": "mun",
"weight": 5000.0
}
]
},
{
"word": "hálan",
"hyphenations": [
{
"value": "há^lan",
"weight": 0.0
},
{
"value": "há^lan",
"weight": 5000.0
}
]
},
{
"word": "davvisámegiela",
"hyphenations": [
{
"value": "dav^vi#sá^me#gie^la",
"weight": 0.0
},
{
"value": "dav^vi^sá^me^gie^la",
"weight": 5000.0
}
]
}
]
}
This is the raw output from the API server. Comments on the output:
- both input text and output data is listed
- hyphenation points are indicated with two symbols:
#: primary hyphenation point (usually a word boundary)^: secondary hyphenation point
- for each input word, all hyphenation patterns are listed, from best to worst
- the weight is a very rough indication of priority, with
0.0being the best - there will most often be at least two hyphenation patterns, one from the lexical lookup (those with weight
0.0), and one from the pattern-based fallback (weight5000.0or higher). For unrecognised misspellings or unknown words, only the pattern-based fallback is provided.
Filtered examples
curl -s -X POST -H 'Content-Type: application/json' \
'https://api-giellalt.uit.no/hyphenation/se' \
--data '{"text": "mun hálan davvisámegiela"}' |\
jq '.results[].hyphenations | map(select(.value)) | first'
Comment:
- we use
jqfiltering to only retain the most likely hyphenation pattern, with weights
Output:
{
"value": "mun",
"weight": 0.0
}
{
"value": "há^lan",
"weight": 0.0
}
{
"value": "dav^vi#sá^me#gie^la",
"weight": 0.0
}
The same example, but now with a misspelling; notice the change in weight for the last word:
curl -s -X POST -H 'Content-Type: application/json' \
'https://api-giellalt.uit.no/hyphenation/se' \
--data '{"text": "mun hálan davvisámegiellla"}' |\
jq '.results[].hyphenations | map(select(.value)) | first'
Output:
{
"value": "mun",
"weight": 0.0
}
{
"value": "há^lan",
"weight": 0.0
}
{
"value": "dav^vi^sá^me^giell^la",
"weight": 5000.0
}
If you only want the hyphenated input text, and not the json stuff, use the following jq filtering:
curl -s -X POST -H 'Content-Type: application/json' \
'https://api-giellalt.uit.no/hyphenation/se' \
--data '{"text": "mun hálan davvisámegiela"}' |\
jq '.results[].hyphenations | map(select(.value).value) | first'
Output:
"mun"
"há^lan"
"dav^vi#sá^me#gie^la"
Add -r/--raw-output to jq if you want to get rid of the quotes:
curl -s -X POST -H 'Content-Type: application/json' \
'https://api-giellalt.uit.no/hyphenation/se' \
--data '{"text": "mun hálan davvisámegiela"}' |\
jq -r '.results[].hyphenations | map(select(.value).value) | first'
Output:
mun
há^lan
dav^vi#sá^me#gie^la
If you have a text file that you would like to have hyphenated, do as follows:
cat textfile.txt |\
(printf '{"text": "' && cat && printf '"}') |\
curl -s -X POST -H 'Content-Type: application/json' \
'https://api-giellalt.uit.no/hyphenation/se' \
--data @- |\
jq '.results[].hyphenations | map(select(.value).value) | first'
Comments:
- the
printfstuff after the initialcatis there to wrap the file content in a simplejsonstructure, as that is what is expected on the other end. - add
-r/--raw-outputtojqif you want to get rid of the quotes (cf above)
Output (assuming the textfile.txt file has the same content as the example sentence used above):
"mun"
"há^lan"
"dav^vi#sá^me#gie^la"