Skolt Sami NLP Grammar

Finite state and Constraint Grammar based analysers, proofing tools and other resources

View the project on GitHub giellalt/lang-sms

Page Content

Meeting 
Date: 21.11.2009
11:30 - 18:00 (with an interruption for lunch)

Revision date: 24.11.2009 (Micha)


Micha, Lena, Josh, Ciprian

Topics: data for SJD (Micha), SMS (Lena+Micha), SJE (Josh)  

Some basic commands for svn:
Checkout: svn co PATH-TO-SVN-server LOCAL-DIR - once a lifetime (in theory)
Update command: svn up
Checkin command: svn ci -m "...write some log notes here..."
Status command: svn stat (compare local data with data in the svn repository)

1. Admin:

working with svn:
 out: Josh, Lena, Micha
 in: Josh, Lena, Micha
 
 TODO: checkin permissions for Lena (__CIP__)
 DONE

conventions for files to be saved in /inc:
filenames: asci, no kapitals, no spaces
readme file: 00_readme_filename.txt for comments/explanations of raw files

location of original data (this, possibly raw files are untouchable and have to be kept in its original state!!):
 gt/sjd/inc
 gt/sje/inc
 gt/sms/inc
 
location of working files (in xml format):
 gt/sjd/src
 gt/sje/src
 gt/sms/src

2. Data inventory: copyright, description

SJD:
 - originally XML generated from FileMaker (?)
 - then partially structured by Ciprian (this version is in the
 dictionary directory (sjd:rus))
 - then checked and edited by Micha (this is the version checked in by
 Micha in inc today)

 TODO:
  - move the file(s) from sjd:rus-dir to sjd/inc (__Ciprian__)
  - compare the last version of Cip's transformations to Micha's
    (__Ciprian__)
  - work further at the transformation towards the final version
    (__Ciprian__)


SMS:
 Skolt Sami in FileMake format in inc

DESCRIPTION:
 ID =unique index
 ID_Audio =index for matching of audiofiles
 Lemma group: Skolt (=Lemma in Skolt orthography), Skolt_Cyr (=Lemma in Cyrillic transliteration), GramInfo_Skolt (=grammatical info [not very systematic] in Latin), GramInfo_SkoltCyr (=grammatical info [not very systematic] in Cyrillic transliteration), PoS_Skolt (=Part-of-speech in Latin), PoS_SkoltCyr (=Part-of-speech in Russian translation [this could actually be updated with help of a script which translates the PoS_Skolt entries)
 Meaning group Kildin: Kildin (=Kildin Saami translation of Lemma), Kildin_Lat (=Latin transliteration of Kildin Saami translation)
Norwegian (=Norwegian translation of Lemma)
Finnish (=Finnish translation of Lemma)
English (=English translation of Lemma)
German (=German translation of Lemma)
Russian (=Russian translation of Lemma)
North Saami (=North Saami translation of Lemma)
Inari Saami (=Inari Saami translation of Lemma)
Note group [different notes of collaborators; those are needed during further editing of the dictionary entries; these notes are related to different meaning groups!)
Espen group [indices and notes related to Espen's audiofiles; these elements belong under "ID_Audio"]
Sampling group [indices which Micha needs in order to compile word lists, these elements belong under "ID_Audio"]

Copyright: free

 TODO: 
  - to be exported to XML (__Ciprian__)
  - to be transformed into a giellatekno-kompatile XML format (__Ciprian__)
  - ask for permission to use the Skolt data from Kimberly Maranainen
     (__Ciprian__)
  - merge data from Kimberly with the data from Micha (__Ciprian__)

SJE:

DESCRIPTION:
preliminary wordlist/dictionary xml file created from original filemaker database using only a poor sample of words NOT originating from Pite wordlist project (insamling av pitesamiska ord).
A list of field names and their intentions:
record number JW = unique record number from original FileMaker database
Pite Saami = sje lemma
English translation* = English translation of Pite Saami lemma 
Swedish translation* = Swedish translation of Pite Saami lemma 
part of speech = part of speech (individual values in Swedish!)
consonant gradation pattern = abstracted form of consonant gradation pattern (ex: bm-m)
umlaut pattern = abstracted form of umlaut pattern
stem extension = full form of possible stem extension (ex.: "bednaga" the stem extention form of lemma "bena" dog)
purely grammatical gloss = if relevant, this shows any grammatical information in individual lemma which may be relevant and deviates from elicitation form (ex: if noun is not in NOM.SG)
notes = notes from/for wordlist project
notes JW = notes specifically for joshua wilbur
original wordlist comments = original entry from original wordlist form wordlist project, if relevant
source = person(s) who entered lemma originally
*unfortunately no convention is followed concerning differentiating between synonyms or meaning groups etc.
!!Lehtiranta 

Copyright: free (unfortunately, the big collection of data is not
                 (yet) free)

TODO:
 - to transform sje-data into gt-format (__Ciprian__)
 - to collect a balanced (wrt both pos and initial letter) sample of
 50-100 words (possibly with audio files and (static) morphology) in
 order to compile demo dictionaries (Mac and StarDict) (collecting:
 __Josh__; transforming: __Ciprian__)

--------------------------------------------------------------

DONE:
 - Micha, Lena, Josh (re-)learned how to use svn
 - Micha, Lena, Josh learned how to compile dictionary on their own on
    the mac

TODO:
 - check the mac make-dict pipeline with all possible special
    characters in sjd, sms, sje (perhaps, here there is no problem):
    keyword "dog ps" in Kildin (__Ciprian__)