GiellaLT

GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology.

View GiellaLT on GitHub

Page Content

Wordform generation and analysis details (Paradigms)

Obsolete: The contents of this document may be obsolete (Feb 2024).

The Paradigms section contains a part of speech (in all caps) and tag forms (minus the part of speech) to generate forms for. The lists of tags here should be the maximal set that will be generated for any given part of speech. If this needs to be altered or reduced for any lexical set (i.e., singular only for Proper nouns, or 3rd person only for specific weather verbs), rules for this must be defined elsewhere.

If forms will be displayed, but pregenerated by some other rule, there must be at least one entry for the part of speech. That set of tags will then be passed to pregenerating functions and ignored, but, if the part of speech is not set here, this will not happen.

Paradigms:
  olo:
    PRON:
      - "Pregenerate"
    N:
      - "Sg+Par"
      - "Sg+Apr"
      - "Sg+Gen"
      - "Pl+Par"
    V:
      - "Ind+Prs+ScSg1"
      - "Ind+Prs+ScSg3"
      - "Ind+Prs+ScPl3"
      - "Ind+Prt+ScSg1"
  liv:
    PRON:
      - "Pregenerate"
    N:
      - "Sg+Nom"
      - "Sg+Gen"
      - "Sg+Dat"
      - ... etc.

In the above example: “Pregenerate” is completely arbitrary and serves no programmatic function, however “PRON” being set is important.

Tag definitions (TagSets, TagTransforms)

Unfortunately it is not yet easy to use the babel and gettext translation system to define what will be displayed to users. As such, YAML defines these things.

TagTransforms is a dictionary of language pairs, each of which contains string pairs. Each string pair consists of the tag chunks from output from a morphological tool (minus tag separator), and then the string that will be displayed to the users.

Each language pair is defined as the source language of the dictionary or morphological tool, and then the language of the user interface and the formatting of this pair definition is important (see below). If corresponding tags for the source dictionary - user interface pair are not available, tags for the dictionary source language - dictionary target language will be displayed. It may be useful to use aliases here too, but see existing config files for examples.

TagTransforms:
  (olo, rus):
    "V": "v."
    "N": "s."
    "A": "adj."


  (liv, rus):
    "V": "v."
    "N": "s."
    "A": "adj."

NOTE: parentheses, comma, and space are important in the language pair definition. Quotes are optional around the tag chunks on the left side, but ideal to avoid any potential problem with conversion to strings.

TagSets

TagSets aren’t particularly relevant within the configuration file, but are meant to be an aid in producing language-specific rules (see elsewhere in the documentation, or docstrings for now). TagSets are defined first by language they apply to, but then each tagset consists of a name, and then a list of tags that goes along with the set.

TagSets:
  sme:
    pos: ["N", "V", "A", "Pr", "Po", "Num"]
    type: ["NomAg", "G3", "aktor"]

Multiword lookups

The reader may be configured to allow multiword environments, so, each click will expand the word selection to include surrounding material. This operation only respects word boundaries, but does not perform any linguistic computation on the client side. It also results in more data being sent to the server.