GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology. Read more about Why. See also How to get started and our Privacy document.
For parallelltekst mellom nord-, lule- og sørsamisk og evt. andre språk. I praksis vil det primært gjelde tekstar mellom norsk og dei tre samiske språka.
Arbeidsoppgåver:
To månadsverk + overhead til UiT
Resultatet av a-c vil bli ein deskriptiv database over departementet sine tekstar, og eit grensesnitt omsetjarane kan bruke for å samanlikne omsetjingane sine med tidlegare omsetjingar.
Det trengst deretter mange månadsverk for å bearbeide materialet vidare til ei forvaltningsordbok:
Eit grovt overslag kunne vere ca 6 månadsverk pr språk.
freecorpus/converted/sme/admin/depts/regjeringen.no/
freecorpus/converted/sme/admin/sd/
freecorpus/converted/smj/admin/depts/
freecorpus/converted/smj/admin/depts/regjeringen.no/
freecorpus/converted/smj/admin/depts/
freecorpus/converted/sma/admin/depts/regjeringen.no/
nob-sme files are in the folder $BIGGIES/gt/sme/corp/forvaltningsordbok/
.
freecorpus/converted/sme/admin/depts/regjeringen.no/
-
1384 documents, 615852 words 1. Saami parliament files: freecorpus/converted/sme/admin/sd/
-
929 documents, 220377 wordsfreecorpus/converted/smj/admin/depts/
freecorpus/converted/smj/admin/depts/regjeringen.no/
freecorpus/converted/smj/admin/depts/
freecorpus/converted/sma/admin/depts/regjeringen.no/
Inside $GTFREE:
find orig -type f | grep -v .svn | grep -v .xsl | grep -v .DS_Store | xargs convert2xml2.pl
The output is thanks, «you gave me $numArgs files to process» and then . or | for
each file that is processed. . means success, | means failure to convert a file.
For a lot more verbose output to the terminal, use the --debug option
After the conversion, get a summary of the converted files this way:
java -Xmx2048m net.sf.saxon.Transform -it main $GTHOME/gt/script/corpus/ym_corpus_info.xsl inDir=$GTFREE/converted
This results in a file corpus_report/corpus_summary.xml
To find out which and how many files have no content, use this command:
java -Xmx2048m net.sf.saxon.Transform -it main ../corpus/get-empty-docs.xsl inFile=`pwd`/corpus_report/corpus_summary.xml
This results in a file out_emptyFiles/correp_emptyFiles.xml
The second line tells how many empty files there are.