Corpus maintenance
This document keeps track of measures to improve the corpus collection and conversion process. Note also the sentence alignment page, which looks into that specific sub-part of the corpus maintenance.
Corpus improvement work
Mappestruktur osv
- news: muligheter for å flytte fra bound til free?
- science: vi har filer sme/science både i free og i bound, uten noen klar deling
- sma: eget valg for klassiske tekster
Tasks
Where do we find texts
Parallel texts
- Suggestions for detecting (flaws in) parallel texts
- How to manipulate the conversion of different file formats, in xsl, to get the correct language
Meetings in the corpus improvement project
- 2019: 7.6.
- 2017: 3.3. // 25.4. // 6.9. // 5.10.
- 2016: 26.10.// 02.11.// 16.11.// 25.11.
- 2014: 12.3.
- 2012: 12.1. // 19.1. // 25.1. // 1.2. // 7.2. // 13.2. // 17.2. // 29.2. // 12.3. // 22.3. // 31.8.
- 2011: 7.4. // 11.4. // 3.5. // 27.6. // 12.9. // 21.9. // 12.10. // 7.11. // 11.11. // 25.11. // 28.11. // 8.12. // 14.12. // 20.12.
OCR and conversion errors leftover from spring 2011
- OCR error overview, May 2011 (still open issues here)
- Conversion errors (open issues here?)