Referring to Concepts#

Searching using a Thesaurus must be:
  1. EXHAUSTIVE:
    • Taking advantage of all the specifics in the Thesaurus Hierarchies,
    • Augmented by words from Synonyms and Translations,
    • Augmented by equivalence relations between Thesauri,
  2. and PRECISE:
    • Helping the user to choose the right concepts in the hierarchies,
    • Using the code of the concepts (no ambiguities like those with usual word searches).

Furthermore:

  • Indexers should be allowed to communicate exact Concepts more easily than typing an URI.
  • Where they are asked for a "value", indexers may want to write sentences using SKOS Concepts references. With adequate "punctuation", they could express relations between different Concepts and Values (attributes).
  • The Semantic Web and Linked Data standards (RDF, RDFs, SKOS) should be supported to allow discovery/import from external sources and to allow distributed vocabularies.

Storing references in database fields:#

  • the stored value is the "about" of the Concept (authority entry): this ensures that the preferred terms, the synonyms, the translations can change without having to update the database records.
    • The Concepts "about" are "hidden" in most places: they are dynamically translated in their preferred term in the user's language.
    • Ajax Autocomplete is used for data fields updates and for Searches: the user types a few letters and proposals are made with terms (preferred or synonyms) in any language.
  • the "about" (code) for a Concept is very simple: it is the Scheme code (a SKOS Scheme is an Authority List), an underline character ("_"), and the Concept code itself (any letter, digit or underscores).
    • The identification of the concept can be relative (a unique Scheme is used to control the content of a field) or "absolute" (recommended: the vocabulary can evolve to allow values from foreign schemes): the code of the Scheme is added to have:
      1. a Scheme is identified by a word (case sensitive) composed of letters and digits. For instance, language
      2. a Concept is identified ("relative", within a given Scheme) by a word starting and ending by letters and digits and composed of letters, digits and underscore. For instance, fr_BE
      3. a Concept is identified ("absolute", out of the context of a Scheme) by the Scheme identifier, an underscore, a Concept "in context" identifier. For instance, language_fr_BE
  • The underscore is chosen because:
    • it is rarely used in real world texts
    • it is acceptable to tie parts together within an unique word to be indexed by Lucene. Lucene tokenizers are slightly modified to skip words with underlines (this warrants no stemming of codes!)
  • The usual encoding of SKOS Concepts (complete URIs) can be produced automatically for exchanges between computer applications (not involving humans)
More precisely, a Concept reference can be composed of:
  • an optional prefix which precises the ROLE (example: is the following person acts as an author? a composer? an illustrator?).
    This prefix code ends with an underscore ("_").
    • When displayed, prefixes are translated depending of user language.
    • When updated, allowed prefixes are presented in a menu.
  • one or more spaces
  • an "entry code" with:
    • a "scheme code" (authority list identifier) followed by an underscore;
      The scheme code can be a "notation list identifier" to use an alternate coding scheme for a given list (CAS and PubChemId for instance)
    • the entry code within the scheme
  • an optional suffix which precises the QUANTITY (example: is the preceding person is the main author?) the weight or the confidence level.
    This suffix code begins with some spacing and an underscore ("_").
    • When displayed, suffixes are translated depending of user language.
    • When updated, allowed suffixes are presented in a menu.

Examples:

  • language_fr_BE means the french language as written in Belgium.
  • illustrator_ person_1234 _top which means "this contributor is the person with code 1234, act as an illustrator and is the "top" contributor"
  • This design is to easily embed "concept references" in text fields of many software: DSpace, JSPWiki, SolR+BlackLight, etc...
  • This simple syntax/grammar shall allow to write adequate indexing sentences
(like, for instance, Precis Indexing System explained by Barbara H.Kwaśnik )
  • Concept References and Authority control are VERY important for integration within the Semantic Web (the so-called "Web 3.0"): you quit basic "text search" to enter in a world of structured relations...

More examples:#

In http://www.WindMusic.org:

  1. to denote that a recording is using a flute for a soloist, we can write (in the context of scheme Instruments): solo_ flute
  2. to denote two trumpets: trumpet _2
  3. to precise the role of an author: illustrator_ person_123456 (123456 being a numerical code for a given person)

In the Belgium Poison Centre:

  1. to denote an article about a given substance: thSubstances_1234567 or CAS_10_820_30 (CAS is a Notation, not a Scheme) or about a plant: thPlants_7654321 (references out of the context of a given Scheme and using a numerical ID).
  2. to denote a serial: journal_123456 or ISSN_1234_4321 (ISSN is a Notation, not a Scheme)

Indexing text or data containing references#

  • Being the major open source search engine, Lucene is used in our application. The basic principle (one reference = one word = one indexation token) could be parameterised with other search engines as well.
  • the "about" (code for a Concept) is always indexed.
    • The coding for each Concept "about" has been choose to be considered by Lucene as one single word: this warrants PRECISION of searches.
      • The Lucene "pluggable" tokenizers are modified to accept and leave untouched words containing an underline
    • the "notations" (code for a Concept in alternative coding schemes) could be added as a parameterised option.
  • the indexed values are all the preferred and alternative labels for a Concept: this ensures that full-text retrieval becomes more exhaustive
    • hidden labels could be added as a parameterised option
  • indexes can be defined to include recursively all the Broaders of a Concept: this ensures that searches on a generic concept (when the ConceptScheme is arranged along a hierarchy like in a thesaurus) are retrieving all the specifics.
  • ASKOSI keeps the count of the uses of each Concepts: those counts are provided in displays to give clues to users about how much data is linked to each Concepts and also provide horizontal searches whenever possible.
  • Each reference to a Concept can be enriched by a prefix (to precise, for instance, a "role") and a suffix (to precise, for instance, a quantity, a level of importance): Lucene proximity search then allow to find the use of a Concept with a given role and/or quantity.
In the above example (illustrator_ person_1234 _top), Lucene is able to retrieve different kind of queries. A search can be made for:
  • "illustrator_ person_1234 _top"
  • "illustrator_ person_1234"
  • "person_1234 _top"

    • A search can also be made on any synonyms or translations or broaders of "person_1234".
    • A EOL distance is specified to Lucene to separate occurrences of the same field.

Conclusion#

  • A mapping mechanism may be used to map this to regular SKOS Scheme and Concept URIs, but this encoding frees the humans from typing or seing "http://..."
  • To go further and integrate this in a Wiki like JspWiki, this notation may have to be improved with equivalent capabilities than those in Semantic MediaWiki (relate a part of wiki page text as being an attribute of the relation between the Wiki page and a concept).

Add new attachment

Only authorized users are allowed to upload new attachments.
« This page (revision-1) was last changed on 16-août-2010 09:40 by (auteur inconnu)  
Referred by: Projects - Output Matrix - Data Sources for Concepts Definitions