GiellaLT

GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology.

View GiellaLT on GitHub

Page Content

Metadata

This page documents metadata categories and subcategories as well as labels we use for these metadata in the Freiburg-Tromsø Speech Corpora.

Project-internally we collect different kinds of metadata. Not all of them can be made public due to ethical and legal reasons. Here we document only metadata categories relevant for the corpora published through Korp. Main metadata categories describe:

*Actors (e.g. a recorded speaker, author, translator or annotator) *Sessions (e.g. an annotated recording or an annotated written text) *Texts (e.g. modality or genre)

All publicely available metadata is stored in files separated from the [ELAN ELAN.html] annotations in IMDI format on the Session node in the TLA. A script (which does not yet exist) converts IMDI into a structure useful to be read into the Korp interface.

Actors

*Speakers (e.g. informants/consultants recorded and transcribed or authors/translators of written text included in the corpora) *Annotators (e.g. PIs or assistants transcribing, translating or otherwise annotating recordings or written text included in the corpora)

Sessions

*Actors *Date *Equipment *Media *Place *Project *Languages

Texts

Actors

Date

Language(s)

Modality

As a label for this category we use Modality and mean here the way by which signs are transmitted by a sender. This catory has two values:

*oral (e.g. speech which we have recorded on audio or audio+video and transcribed or speech which is transcribed, but where there is no audio available because it is lost or the speech was transcribed without being recorded) *written (e.g. handwritten or printed texts, texts published online)

Another potential values (not relevant for our projects) are:

*gestured *signed

Note that the kind of perception by a receiver is not relevant for our metadata categories (a written text can be received oraly if we use text-to-speech, etc.) Neither does Modality in our sense refer to the actual medium (paper, video, etc.)

Language

The-letter code in accordance with ISO 639-3

Genre

*poetry *fiction *ritual *advertisement *biography *fairy tale *facta *idiom *narrative *teaching *story

Register

*formal *informal *neutral

Medium

Other conventions

Note that also file names used by us inlcude some metadata already. For instance: *sms19610000lagercrantz318 *sjd20150609aaa-sport where the first three letters sms or sjd - in accordance with ISO 639-3 - always mark the language (or main language) of a given session, the following eight digits 19610000 or 20150609 always mark the date of a given session in the format YYYYMMDD. If the exact date is unknown or cannot be specified (e.g. in a book publication were only the year is given) we use the digit 0.

See also

XXX - ???