Referring to Concepts#
Searching using a Thesaurus must be:- EXHAUSTIVE:
- Taking advantage of all the specifics in the Thesaurus Hierarchies,
- Augmented by words from Synonyms and Translations,
- Augmented by equivalence relations between Thesauri,
- and PRECISE:
- Helping the user to choose the right concepts in the hierarchies,
- Using the code of the concepts (no ambiguities like those with usual word searches).
Furthermore:
- Indexers should be allowed to communicate exact Concepts more easily than typing an URI.
- Where they are asked for a "value", indexers may want to write sentences using SKOS Concepts references. With adequate "punctuation", they could express relations between different Concepts and Values (attributes).
- The Semantic Web and Linked Data standards (RDF, RDFs, SKOS) should be supported to allow discovery/import from external sources and to allow distributed vocabularies.
Storing references in database fields:#
- the stored value is the "about" of the Concept (authority entry): this ensures that the preferred terms, the synonyms, the translations can change without having to update the database records.
- The Concepts "about" are "hidden" in most places: they are dynamically translated in their preferred term in the user's language.
- Ajax Autocomplete is used for data fields updates and for Searches: the user types a few letters and proposals are made with terms (preferred or synonyms) in any language.
- the "about" (code) for a Concept is very simple: it is the Scheme code (a SKOS Scheme is an Authority List), an underline character ("_"), and the Concept code itself (any letter, digit or underscores).
- The identification of the concept can be relative (a unique Scheme is used to control the content of a field) or "absolute" (recommended: the vocabulary can evolve to allow values from foreign schemes): the code of the Scheme is added to have:
- a Scheme is identified by a word (case sensitive) composed of letters and digits. For instance, language
- a Concept is identified ("relative", within a given Scheme) by a word starting and ending by letters and digits and composed of letters, digits and underscore. For instance, fr_BE
- a Concept is identified ("absolute", out of the context of a Scheme) by the Scheme identifier, an underscore, a Concept "in context" identifier. For instance, language_fr_BE
- The identification of the concept can be relative (a unique Scheme is used to control the content of a field) or "absolute" (recommended: the vocabulary can evolve to allow values from foreign schemes): the code of the Scheme is added to have:
- The underscore is chosen because:
- it is rarely used in real world texts
- it is acceptable to tie parts together within an unique word to be indexed by Lucene. Lucene tokenizers are slightly modified to skip words with underlines (this warrants no stemming of codes!)
- The usual encoding of SKOS Concepts (complete URIs) can be produced automatically for exchanges between computer applications (not involving humans)
More precisely, a Concept reference can be composed of:
- an optional prefix which precises the ROLE (example: is the following person acts as an author? a composer? an illustrator?).
This prefix code ends with an underscore ("_").- When displayed, prefixes are translated depending of user language.
- When updated, allowed prefixes are presented in a menu.
- one or more spaces
- an "entry code" with:
- a "scheme code" (authority list identifier) followed by an underscore;
The scheme code can be a "notation list identifier" to use an alternate coding scheme for a given list (CAS and PubChemId for instance) - the entry code within the scheme
- a "scheme code" (authority list identifier) followed by an underscore;
- an optional suffix which precises the QUANTITY (example: is the preceding person is the main author?) the weight or the confidence level.
This suffix code begins with some spacing and an underscore ("_").- When displayed, suffixes are translated depending of user language.
- When updated, allowed suffixes are presented in a menu.
Examples:
- language_fr_BE means the french language as written in Belgium.
- illustrator_ person_1234 _top which means "this contributor is the person with code 1234, act as an illustrator and is the "top" contributor"
- This design is to easily embed "concept references" in text fields of many software: DSpace, JSPWiki, SolR+BlackLight, etc...
- This simple syntax/grammar shall allow to write adequate indexing sentences


- Concept References and Authority control are VERY important for integration within the Semantic Web (the so-called "Web 3.0"): you quit basic "text search" to enter in a world of structured relations...
More examples:#
In http://www.WindMusic.org:
- to denote that a recording is using a flute for a soloist, we can write (in the context of scheme Instruments): solo_ flute
- to denote two trumpets: trumpet _2
- to precise the role of an author: illustrator_ person_123456 (123456 being a numerical code for a given person)
In the Belgium Poison Centre:
- to denote an article about a given substance: thSubstances_1234567 or CAS_10_820_30 (CAS is a Notation, not a Scheme) or about a plant: thPlants_7654321 (references out of the context of a given Scheme and using a numerical ID).
- to denote a serial: journal_123456 or ISSN_1234_4321 (ISSN is a Notation, not a Scheme)
Indexing text or data containing references#
- Being the major open source search engine, Lucene is used in our application. The basic principle (one reference = one word = one indexation token) could be parameterised with other search engines as well.
- the "about" (code for a Concept) is always indexed.
- The coding for each Concept "about" has been choose to be considered by Lucene as one single word: this warrants PRECISION of searches.
- The Lucene "pluggable" tokenizers are modified to accept and leave untouched words containing an underline
- the "notations" (code for a Concept in alternative coding schemes) could be added as a parameterised option.
- The coding for each Concept "about" has been choose to be considered by Lucene as one single word: this warrants PRECISION of searches.
- the indexed values are all the preferred and alternative labels for a Concept: this ensures that full-text retrieval becomes more exhaustive
- hidden labels could be added as a parameterised option
- indexes can be defined to include recursively all the Broaders of a Concept: this ensures that searches on a generic concept (when the ConceptScheme is arranged along a hierarchy like in a thesaurus) are retrieving all the specifics.
- ASKOSI keeps the count of the uses of each Concepts: those counts are provided in displays to give clues to users about how much data is linked to each Concepts and also provide horizontal searches whenever possible.
- Each reference to a Concept can be enriched by a prefix (to precise, for instance, a "role") and a suffix (to precise, for instance, a quantity, a level of importance): Lucene proximity search then allow to find the use of a Concept with a given role and/or quantity.
In the above example (illustrator_ person_1234 _top), Lucene is able to retrieve different kind of queries. A search can be made for:
- "illustrator_ person_1234 _top"
- "illustrator_ person_1234"
- "person_1234 _top"
- A search can also be made on any synonyms or translations or broaders of "person_1234".
- A EOL distance is specified to Lucene to separate occurrences of the same field.
Conclusion#
- A mapping mechanism may be used to map this to regular SKOS Scheme and Concept URIs, but this encoding frees the humans from typing or seing "http://..."
- To go further and integrate this in a Wiki like JspWiki, this notation may have to be improved with equivalent capabilities than those in Semantic MediaWiki (relate a part of wiki page text as being an attribute of the relation between the Wiki page and a concept).
Add new attachment
Only authorized users are allowed to upload new attachments.