The vocabulary quality issues defined in the following sections should be applicable to any SKOS vocabulary. Some of the issues are taken from already existing research (see section "Related Work"). Others reflect our thoughts when investigating real-world thesauri (see the data repository).
If not stated otherwise, we treat the SKOS vocabularies as fully entailed RDFS graphs. We also enrich the vocabularies by entailment of owl:inverseOf properties as well as instances of owl:TransitiveProperty and
owl:SymmetricProperty.
Some controlled vocabularies contain literals in natural language, but without information what language has actually been used. Language tags might also not conform to language standards, such as RFC 3066.
Iteration over all triples in the vocabulary that have a predicate which is a (subclass of) rdfs:label or skos:note.
Incomplete Language Coverage
Description
Some concepts in a thesaurus are labeled in only one language, some in multiple languages. It may be desirable to have each concept labeled in each of the languages that also are used on the other concepts. This is not always possible, but incompleteness of language coverage for some concepts can indicate shortcomings of the vocabulary.
Iteration over all concepts in the vocabulary and creation of a global set of language tags appearing in the vocabulary. In a second iteration, each concept having a set of language tags that is not equal to the global language tag set is returned.
No Common Language
Description
Checks if all concepts have at least one common language, i.e. they have assigned at least one literal in the same language.
Example
Preliminary ideas on computation
Undocumented Concepts
Description
The SKOS "standard" defines a number of properties useful for documenting the meaning of the concepts in a thesaurus (section 7) also in a human-readable form. Intense use of these properties leads to a well-documented thesaurus which should also improve its quality.
Example
Library of Congress Thesaurus for Graphic Materials offers a high coverage of documentation properties
Implementation
Iteration over all concepts in the vocabulary and find those not using one of skos:note, skos:changeNote, skos:definition, skos:editorialNote, skos:example, skos:historyNote, or skos:scopeNote
Overlapping Labels
Description
This is a generalization of a recommendation in the SKOS primer, that “no two concepts have the same preferred lexical label in a given language when they belong to the same concept scheme”. This could indicate missing disambiguation information and thus lead to problems in autocompletion application.
Iteration over all authoritative concepts, collecting their respective labels. In a second pass, similarity of all possible label pairs is checkt by a similarity function. Concept labels with a similarity value below a given threshold, are considered conflicting and are returned. In the current implementation, the similarity function is string equality with a threshold equal to 1.
Missing Labels
Description
To make the vocabulary more convenient for humans to use, instances of SKOS classes (Concept, ConceptScheme, Collection) should be labeled using e.g., prefLabel, altLabel, rdfs:label, dc:title.
Example
Preliminary ideas on computation
Unprintable Characters in Labels
Description
pref/alt/hiddenlabels contain characters that are not alphanumeric characters or blanks.
Example
Newline characters that have been left over from automated vocabulary conversion or invalid user input.
Preliminary ideas on computation
A SPARQL query would be sufficient to find labels having characters that belong to the unicode general category "Zl", "Zp" and "C"
Empty Labels
Description
Labels also need to contain textual information to be useful, thus we find all SKOS labels with length 0 (after removing whitespaces).
Example
Preliminary ideas on computation
Ambiguous Notation References
Description
Concepts within the same concept scheme should not have identlical skos:notation literals.
Example
Preliminary ideas on computation
Structural Issues
SKOS is based on RDF, which is a graph-based data model. Therefore we can concentrate on the vocabulary's graph-based structure for assessing the quality of SKOS vocabularies and apply graph- and network-analysis techniques.
Orphan Concepts
Description
An orphan concept is a concept without any associative or hierarchical relations. It might have attached literals like e.g., labels, but is not connected to any other resource, lacking valuable context information. A controlled vocabulary that contains many orphan concepts is less usable for search and retrieval use cases, because, e.g., no hierarchical query expansion can be performed on search terms to find documents with more general content.
Iteration over all concepts in the vocabulary and returning that don't have associated resources using (subproperties of) skos:semanticRelation.
Disconnected Concept Clusters
Description
Checking the connectivity of the graph, it is possible to identify all weakly connected components. These datasets form "islands" in the vocabulary and might be caused by incomplete data acquisition, "forgotten" test data, outdated terms and the like.
Example
The dmGeo vocabulary consists of 5 weakly connected components. It was available at http://www.dismarc.org but now seems to be offline. Weakly connected components can also be found in the LVAk thesaurus.
Implementation
Creation of an undirected graph that includes all non-orphan concepts as nodes and all semantic relations as edges. Tarjan's algorithm then finds and returns all weakly connected components.
Cyclic Hierarchical Relations
Description
Although perfectly consistent with the SKOS data model, cyclic relations may reveal a logical problem in the thesaurus. Consider the following example: "decision" -> "problem resolution" -> "problem" (-> "decision": here the cycle is closed). The concepts are connected using skos:broader relationships (indicated with "->"). Due to the fact that a thesaurus is in many cases a product of consensus between the contributors (or just the decision of one dedicated thesaurus manager), it will be almost impossible to automatically resolve the cycle (i.e. deleting an edge).
Construction of a graph having all concepts as nodes and the set of edges being skos:broader relations.
Valueless Associative Relations
Description
Two concepts are sibling, but also connected by an associative relation. In that context, the associative relation is not necessary. See ISO_DIS_25964-1, 11.3.2.2
Identification of all pairs of concepts that have the same broader or narrower concepts, i.e. they are "sibling terms". All siblings that are related by a skos:related property are returned.
Solely Transitively Related Concepts
Description
skos:broaderTransitive and skos:narrowerTransitive are, according to the SKOS reference document, "not used to make assertions", so they should not be the only relations hierarchically relating two concepts.
Example
The NAICS thesaurus contains 2189 concepts that are related directly by skos:broaderTransitive.
Implementation
Identification of all concept pairs that are related by skos:broaderTransitive or skos:narrowerTransitive properties but not by their skos:broader and skos:narrower subproperties.
Unidirectionally Related Concepts
Description
Reciprocal relations (e.g., broader/narrower, related, hasTopConcept/topConceptOf) should be included in the controlled vocabularies to, e.g., to achieve better search results using SPARQL in systems without reasoner support.
Example
Implementation
This issue is checked WITHOUT inference of owl:inverseOf properties. We iterate over all triples and check for each property if an inverse property is defined in the SKOS ontology and if the respective statement using this property is included in the vocabulary. If not, the resources associated with this property are returned.
Omitted Top Concepts
Description
A vocabulary should provide "entry points" to the data to provide “efficient access” (SKOS primer) and guidance for human users.
For every ConceptScheme in the controlled vocabulary, a SPARQL query is issued finding resources that are associated with this ConceptScheme by one of the properties skos:hasTopConcept or skos:topConceptOf. TODO: extend notion of top concepts also by concepts having no broader concept (as suggested in [Abdul]).
Top Concepts Having Broader Concepts
Description
Concepts "internal to the tree" should not be indicated as top concepts, as pointed out in [Allemang2011].
A SPARQL query finds all top concepts (being defined by one of the properties skos:hasTopConcept or skos:topConceptOf) having associated a broader concept.
Hierarchical Redundancy
Description
As stated in the SKOS reference document, skos:broader and skos:narrower are not transitive properties. However, they are subproperties of skos:broaderTransitive and skos:narrowerTransitive which enables inference of a "transitive closure". This, in fact, leaves it up to the user to interpret wheter a vocabulary's hierarchical structure is seen as transitive or not. In the former case, this check can be useful. It finds pairs of concepts (A,B) that are directly hierarchically related but there also exits an hierarchical path through a concept C that connects A and B.
Example
Concept A has a broader concept B. If a concept C exists, such that A broader B and B broader C, the hierarchical relation A broader C is considered redundant.
Implementation
These structures can be found by a single SPARQL query.
Reflexive Relations
Description
Concepts related to themsevels.
Example
Implementation
These structures can be found by a single SPARQL query.
data is provided using standard formats (e.g., RDF which is obviously the case for SKOS vocabularies)
linked resources are dereferencable and provide further information
data linked to and from other resources
The issue introduced in this section can be used to create computable metrics for measuring data linkage.
Missing In-Links
Description
The usage of its concepts can be an indicator for a vocabulary's quality. Usage can be determined by the number of external resources, referencing these concepts.
For each authoritative concept in the vocabulary, a SPARQL query (against, e.g. the Sindice endpoint) is issued that returns all triples in which the concept shows up as the object. An estimation of the number of other vocabularies referencing the concept can be obtained by examining if the host part of the returned triple subject URIs does't match the publishing host of the vocabulary. Concepts for which no such matches can be found are returned.
Missing Out-Links
Description
SKOS concepts can define links to other concepts within one and the same vocabulary, to concepts in other vocabularies, or to external resources on the Web. These external links are essential to, for example,
connect the vocabulary with other Web resources and benefit from other people's knowledge about the contained terms (by, e.g., using the link as starting point for a web crawling application)
act as some kind of bridge, connecting previously unconnected (unrelated) domains
provide information on the context of a term, serving a documentation purpose
For each authoritative concept in the vocabulary, a SPARQL query is issued that returns all IRIs that occur as subject or object in triples where this concept is involved. All IRIs that are HTTP URIs and refer to a non-authoritative resource for the concept are counted. Concepts with a count that equals zero are returned.
Broken Links
Description
If concepts link to other resources (link targets) on the Web, it is important that these resources are dereferencable and return a response code other than 200 after possible redirections.
A SPARQL query is issued that finds all HTTP URIs being part (as subject, predicate, or object) of a triple in the vocabulary. The found URIs are then dereferenced and returned if the HTTP response code (after possible redirections) is other than 200.
Undefined SKOS Resources
Description
The vocabulary should not invent any new terms within the SKOS namespace or use “deprecated” SKOS
elements like those defined in Appendix D of the SKOS reference.
A SPARQL query finds all IRIs that appear in one of the vocabulary's triples in combination with a "deprecated" predicate. "Invented" new terms are found by a SPARQL query, selecting all IRIs in the vocabulary's RDF triples belonging to the SKOS namespace but are not defined in the SKOS ontology. All terms found by the two mentioned queries are returned.
HTTP URI Scheme Violation
Description
URIs should be dereferencable. C. Bizer, How to Publish Linked Data on the Web: "In the context of Linked Data, we restrict ourselves to using HTTP URIs only and avoid other URI schemes such as URNs and DOIs."
Example
In CFR Thesaurus (Thesaurus in the Legal domain by the Cornell University) a concept has been identified by a file:// URI
Implementation
A SPARQL query is used to find all IRIs that occur as subject in the vocabulary's RDF triples. If their protocol identifier is other than http or https, the resource is returned.
SKOS Semi-Formal Consistency Issues
This category defines issues that relate to specific design decisions of the SKOS ontology. Some of them are also semi-formally expressed in the SKOS reference documentation.
Relation Clashes
Description
Covers condition S27 from the SKOS reference document, that has not been defined formally.
In a first step, all pairs of concepts are found that are associatively connected, using a SPARQL query. In the second step, a graph is created, containing only hierarchically related concepts and the respective relations. For each concept pair from the first step, we check for a path in the graph from step two. If such a path is found, a clash has been identified and the causing concepts are returned.
Mapping Clashes
Description
Covers condition S46 from the SKOS reference document, that has not been defined formally.
Example
Implementation
Can be solved by issuing a SPARQL query.
Inconsistent Preferred Labels
Description
According to the SKOS reference document, "A resource has no more than one value of skos:prefLabel per language tag".
Example
For the concept http://dbpedia.org/resource/Income_tax, the STW thesaurus mappings define two german prefLabels: "Einkommensteuer" and "Einkommensteuer (Deutschland)".
Implementation
A SPARQL query is used to find concepts with at least two prefLabels. In a second step, the language tags of these prefLabels are analyzed and an ambiguity is detected if they are equal.
Disjoint Labels Violation
Description
Covers condition S13 from the SKOS reference document (section 5.4) stating that "skos:prefLabel, skos:altLabel and skos:hiddenLabel are pairwise disjoint properties".
A SPARQL query collects all labels of all concepts, building an in-memory structure. This structure is then checked for disjoint entries.
Mapping Relations Misuse
Description
According to the SKOS reference documentation, mapping relations (e.g., skos:broadMatch or skos:relatedMatch) should be asserted to concepts being members of different concept schemes. This check finds concepts that are related by a mapping property and are either members of the same concept scheme or members of no concept scheme at all.
Example
The concept labeled "jaguar" is member of concept scheme labeled "animals". Furthermore, the concept "cat" is member of the same concept scheme and "jaguar" is related to "cat" by skos:broadMatch. Thus, this relation can be considered a misuse of a mapping relation.