Mansi NLP Grammar

Finite state and Constraint Grammar based analysers, proofing tools and other resources

View the project on GitHub giellalt/lang-mns

mns meeting June 2nd, 2023

Present: Csilla, Jack, Trond.

Agenda

Status

fst

The incoming folder

New files in src/fst/incoming/. Jack and Csilla have gone through the list. Status:

a file

Ӣльям    N
ва̄ӈкве    V

b file

А̄лям    N    А̄ля
нама̄ныл    N    нам
Дуня̄гум-Света̄гум    N    Дуня Света
ащирма̄ныл    N    ащирма
костера̄нылт    N    костер
ла̄тӈынувл    N    ла̄тыӈ
е̄мтнэ̄ныл    V    е̄мтэ̄ныл
йӣквыт    V    йӣквуӈкве
лаквылтамыт    V    лаквылтаӈкве
пантхатамыт    V    пантхатуӈкве

Problem: Lative case has not been working for one of the lexica.

Verbs

Csilla has been looking at verb morphology. Our morphology does not have all morphophonological patterns.

See and edit the document docs/participles.md in Macdown.app or in Subethaedit. If in see, do cmd-R to see what it looks like.

The duplication forms

Verbs are duplicated with hyphen: sleeps-dreams where the two verbs form a compound wordform with the same morphosyntactic description. Jack has looked at that.

Testing

text coverage

We recognise 82.2 % of the test/data/Olvasmanyok.txt.

yaml files

make check

Missing lemmas: prop 0, nouns 0, verbs 3 (moӯнлуӈкве, ӯнттуӈкве, ва̄нталаӈкве)

Corpus

Csilla has worked on her LS corpus.

corpus-mns <=== missing converted open files corpus-mns-orig = open original files:
corpus-mns-orig-x-closed = closed files ls_yyy.pdf (original), ls_yyy.pdf.xsl (meta) corpus-mns-x-closed = converted closed files ls_yyy.pdf.xml

In git:

corpus-mns-orig:
wikipedia (for some reason not converted)

corpus-mns-orig-x-closed:
news = 1101989 words (including p r o b l e m a t i c words)

It contains 180 files. The files may be read with the command
`ccat -l mns corpus-mns-x-closed/converted/`


## Priorities onwards

### fst

#### Verbs

**Csilla** to check in content to `docs/participles.md`

Problematic forms:
    

йӣквыт V йӣквуӈкве лаквылтамыт V лаквылтаӈкве пантхатамыт V пантхатуӈкве


This stems from European views on participles taking persons as suffixes. It seems the Gerund can also take personal suffixes. The Russian tradition called person-inflected gerunds participles.

**Csilla** to  add this to the participle document.

The yaml file now says:

Active, present tense, indicative, subjective conjugation йиӈкве+V+Act+Ind+Prs+Sg1: ювум йиӈкве+V+Act+Ind+Prs+Sg2: ювын йиӈкве+V+Act+Ind+Prs+Sg3: юв йиӈкве+V+Act+Ind+Prs+Du1: ювме̄н йиӈкве+V+Act+Ind+Prs+Du2: [ювын, ювы̄н] йиӈкве+V+Act+Ind+Prs+Du3: ювыг йиӈкве+V+Act+Ind+Prs+Pl1: [ювув, ювӯв] йиӈкве+V+Act+Ind+Prs+Pl2: [ювын, ювы̄н] йиӈкве+V+Act+Ind+Prs+Pl3: ювыт

Active, past tense, indicative, subjective conjugation йиӈкве+V+Act+Ind+Prt+Sg1: йисум йиӈкве+V+Act+Ind+Prt+Sg2: йисын йиӈкве+V+Act+Ind+Prt+Sg3: йис йиӈкве+V+Act+Ind+Prt+Du1: йисуме̄н йиӈкве+V+Act+Ind+Prt+Du2: [йисын, йисы̄н] йиӈкве+V+Act+Ind+Prt+Du3: йисыг йиӈкве+V+Act+Ind+Prt+Pl1: [йисув, йисӯв] йиӈкве+V+Act+Ind+Prt+Pl2: [йисын, йисы̄н] йиӈкве+V+Act+Ind+Prt+Pl3: йисыт

Active, conditional, subjective conjugation йиӈкве+V+Act+Cond+Sg1: [йинувум, йинӯвум] йиӈкве+V+Act+Cond+Sg2: [йинувын, йинӯвын] йиӈкве+V+Act+Cond+Sg3: йинув йиӈкве+V+Act+Cond+Du1: йинуваме̄н йиӈкве+V+Act+Cond+Du2: [йинувын, йинувы̄н] йиӈкве+V+Act+Cond+Du3: [йинувыг, йинувы̄г] йиӈкве+V+Act+Cond+Pl1: [йинувув, йинӯвув] йиӈкве+V+Act+Cond+Pl2: [йинувын, йинувы̄н] йиӈкве+V+Act+Cond+Pl3: [йинувыт, йинувы̄т]

Active, imeperative, subjective conjugation йиӈкве+V+Act+Imp+Sg2: яен йиӈкве+V+Act+Imp+Du2: [яен, яе̄н] йиӈкве+V+Act+Imp+Pl2: [яен, яе̄н]

Passive, present tense, indicative йиӈкве+V+Pass+Ind+Prs+Sg1: йивем йиӈкве+V+Pass+Ind+Prs+Sg2: йивен йиӈкве+V+Pass+Ind+Prs+Sg3: йиве йиӈкве+V+Pass+Ind+Prs+Du1: йивеме̄н йиӈкве+V+Pass+Ind+Prs+Du2: [йивен, йиве̄н] йиӈкве+V+Pass+Ind+Prs+Du3: йиве̄г йиӈкве+V+Pass+Ind+Prs+Pl1: йиве̄в йиӈкве+V+Pass+Ind+Prs+Pl2: [йивен, йиве̄н] йиӈкве+V+Pass+Ind+Prs+Pl3: йивет

Passive, past tense, indicative йиӈкве+V+Pass+Ind+Prt+Sg1: яйве̄сум йиӈкве+V+Pass+Ind+Prt+Sg2: яйве̄сын йиӈкве+V+Pass+Ind+Prt+Sg3: [яйвес, яйве̄с] йиӈкве+V+Pass+Ind+Prt+Du1: яйвесеме̄н йиӈкве+V+Pass+Ind+Prt+Du2: [яйвесын, яйвесы̄н] йиӈкве+V+Pass+Ind+Prt+Du3: яйвесы̄г йиӈкве+V+Pass+Ind+Prt+Pl1: [яйвесув, яйвесӯв] йиӈкве+V+Pass+Ind+Prt+Pl2: [яйвесын, яйвесы̄н] йиӈкве+V+Pass+Ind+Prt+Pl3: яйве̄сыт

Passive, conditional йиӈкве+V+Pass+Cond+Sg1: [йинувем, йинӯвем] йиӈкве+V+Pass+Cond+Sg2: [йинувен, йинӯвен] йиӈкве+V+Pass+Cond+Sg3: [йинуве, йинӯве] йиӈкве+V+Pass+Cond+Du1: йинувеме̄н йиӈкве+V+Pass+Cond+Du2: [йинувен, йинуве̄н] йиӈкве+V+Pass+Cond+Du3: [йинувег, йинуве̄г] йиӈкве+V+Pass+Cond+Pl1: йинуве̄в йиӈкве+V+Pass+Cond+Pl2: [йинувен, йинуве̄н] йиӈкве+V+Pass+Cond+Pl3: [йинувет, йинӯвет]

Passive, imeperative conjugation йиӈкве+V+Act+Imp+Sg2: йивен, йиен йиӈкве+V+Act+Imp+Du2: [йивен, йиве̄н, йиен, йие̄н] йиӈкве+V+Act+Imp+Pl2: [йивен, йиве̄н, йиен, йие̄н]

#Present participle йиӈкве+V+PrsPrc: йинэ йиӈкве+V+PrsPrc+PxSg1: йинэм йиӈкве+V+PrsPrc+PxSg2: йинэн йиӈкве+V+PrsPrc+PxSg3: йинэ̄тэ йиӈкве+V+PrsPrc+PxDu1: йинэме̄н йиӈкве+V+PrsPrc+PxDu2: йинэнэ̄н йиӈкве+V+PrsPrc+PxDu3: [йинэг, йинэ̄г] йиӈкве+V+PrsPrc+PxPl1: йинэ̄в йиӈкве+V+PrsPrc+PxPl2: [йинэн, йинэ̄н] йиӈкве+V+PrsPrc+PxPl3: [йинэныл, йинэ̄ныл]

#Past participle йиӈкве+V+PrfPrc: йим йиӈкве+V+PrfPrc+PxSg1: йимум йиӈкве+V+PrfPrc+PxSg2: йимын йиӈкве+V+PrfPrc+PxSg3: йиме йиӈкве+V+PrfPrc+PxDu1: йимме̄н йиӈкве+V+PrfPrc+PxDu2: йимын, йимы̄н йиӈкве+V+PrfPrc+PxDu3: [йимег, йиме̄г] йиӈкве+V+PrfPrc+PxPl1: [йимув, йимӯв] йиӈкве+V+PrfPrc+PxPl2: [йимын, йимы̄н] йиӈкве+V+PrfPrc+PxPl3: [йимыт]

#Gerund йиӈкве+V+Ger: йийим йиӈкве+V+Ger+PxSg1: йийимам йиӈкве+V+Ger+PxSg2: йийиман йиӈкве+V+Ger+PxSg3: [йийима, йийиматэ] йиӈкве+V+Ger+PxDu1: йийимаме̄н йиӈкве+V+Ger+PxDu2: йийиман йиӈкве+V+Ger+PxDu3: [йийимаг, йийимаг] йиӈкве+V+Ger+PxPl1: йийимав йиӈкве+V+Ger+PxPl2: йийиман йиӈкве+V+Ger+PxPl3: йийимат

#Privative йиӈкве+V+Ger: йимтал


Make sure the *sleeps-dreams* issue works (**Jack** perhaps asking Sjur)

#### Nouns

**Jack** to look at lative (with Csilla?)

#### Lexicon

New `Olvasmanyok.txt` missing list: `test/data/Olvasmanyok.missing6.freq`

Progress since last time:
    
 627 test/data/Olvasmanyok.missing5.freq
 434 test/data/Olvasmanyok.missing6.freq ```

A dominating category among the 434 is typos (vowel length) or inflectional errors. Some are missing from the lexicon, some are old orthography.

Procedure:

Problematic words in src/fst/incoming/from-oppikirja.230504b.lexc:

нама̄ныл N       нам
Дуня̄гум-Света̄гум        N       Дуня Света
ащирма̄ныл       N       ащирма
костера̄нылт     N       костер

Problem: in Luima Seripos these words do not have long vowels.

The length is in the morphology. We do not want them.

TODO: Csilla to find out whether the error has a Mansi source.

corpus

Working corpus

The corpus- folders

We need to evaluate the conversion and look at ways to improve it. The last option would be to take Csilla’s manually corrected (0.5m) corpus as original files.

Planning

Milestones

We return to that in the next meeting.

mns in Korp

The goal is to have msn in Korp. Let us make an analyser first.

Administrative issues

Work routines

Csilla needs more practice checking in information, e.g., yaml-test (this, of course, means writing on Mac to avoid DOS finger print),

Weekly meetings

Default time for (not too long) weekly meetings: Friday at 11 Finnish time

Next meeting

Friday, June 9th at 11 Finnish time.