GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology.
For parallelltekst mellom nord-, lule- og sørsamisk og evt. andre språk. I praksis vil det primært gjelde tekstar mellom norsk og dei tre samiske språka.
Arbeidsoppgåver:
To månadsverk + overhead til UiT
Resultatet av a-c vil bli ein deskriptiv database over departementet sine tekstar, og eit grensesnitt omsetjarane kan bruke for å samanlikne omsetjingane sine med tidlegare omsetjingar.
Det trengst deretter mange månadsverk for å bearbeide materialet vidare til ei forvaltningsordbok:
Eit grovt overslag kunne vere ca 6 månadsverk pr språk.
freecorpus/converted/sme/admin/depts/regjeringen.no/
freecorpus/converted/sme/admin/sd/
freecorpus/converted/smj/admin/depts/
freecorpus/converted/smj/admin/depts/regjeringen.no/
freecorpus/converted/smj/admin/depts/
freecorpus/converted/sma/admin/depts/regjeringen.no/
nob-sme files are in the folder $BIGGIES/gt/sme/corp/forvaltningsordbok/
.
freecorpus/converted/sme/admin/depts/regjeringen.no/
-
1384 documents, 615852 wordsfreecorpus/converted/sme/admin/sd/
-
929 documents, 220377 wordsfreecorpus/converted/smj/admin/depts/
freecorpus/converted/smj/admin/depts/regjeringen.no/
freecorpus/converted/smj/admin/depts/
freecorpus/converted/sma/admin/depts/regjeringen.no/
Inside $GTFREE:
find orig -type f | grep -v .svn | grep -v .xsl | grep -v .DS_Store | xargs convert2xml2.pl
The output is thanks, «you gave me $numArgs files to process» and then . or | for
each file that is processed. . means success, | means failure to convert a file.
For a lot more verbose output to the terminal, use the --debug option
After the conversion, get a summary of the converted files this way:
java -Xmx2048m net.sf.saxon.Transform -it main $GTHOME/gt/script/corpus/ym_corpus_info.xsl inDir=$GTFREE/converted
This results in a file corpus_report/corpus_summary.xml
To find out which and how many files have no content, use this command:
java -Xmx2048m net.sf.saxon.Transform -it main ../corpus/get-empty-docs.xsl inFile=`pwd`/corpus_report/corpus_summary.xml
This results in a file out_emptyFiles/correp_emptyFiles.xml
The second line tells how many empty files there are.