RDFa Guide - gchq/LD-Explorer GitHub Wiki

RDFa is a W3C recommendation created by the RDFa working group that allows for the annotation of XML-based document types (such as HTML) with RDF via attributes. This means document authors can add more meaning to their documents, opening up more avenues for machine readability. This capability enables a myriad of intelligent functionality to become possible.

  • Events in a document could be automatically added to calendars.
  • Contact information could be added to address books.
  • Content could be indexed by search engines more effectively.
  • Data could be scraped/extracted from a web page for further querying and analysis.
  • Data would have increased connectivity to other data.

Increased data connectivity is perhaps the most compelling improvement that RDFa enables. None would argue that the web isn't already a connected resource - it was in fact designed largely around the idea that pages can "connect" to each other through the use of the <a> tag and an appropriate href - we call them links and every web user encounters hundreds of them every time they browse the internet. Without any semantic meaning though, all this can ever tell us is that two pages are connected and not a whole lot more. RDFa expands that connectivity to exist at the data level, allowing you to express not only the nature of a particular link, but also various facts about the things that are being linked together.

Publications

The RDFa working group published several documents in support of their charter.

  • RDFa Primer - an informal introduction to the topic and the only document most people will ever need to read.
  • RDFa Core - the RDFa specification.
  • RDFa Lite - the RDFa lite specification (a subset of the core specification).
  • HTML+RDFa - Extends the core specification to include several HTML-only scenarios.
  • XHTML+RDFa - Extends the core specification to include several XHTML-only scenarios.

Anyone who isn't trying to implement RDFa themselves in a host language is unlikely to require anything but the information from the primer, although much of that document is covered by this one.

Example

RDFa works using a combination of attributes and the natural hierarchy of XML-based documents (i.e. elements being nested within each other) to describe a graph that conveys semantic meaning. It is often, ironically, difficult to describe what we mean when we say "semantic meaning" so let's explore the topic by way of example. Consider the following HTML snippet:

<div>
	<h1>Spiderman</h1>
	<p>
		Spider-Man is a superhero appearing in American comic books published by Marvel Comics. He first
		appeared in issue 15 of Amazing Fantasy in 1962.
	</p>
</div>

As a human, you are able to parse this and understand what what this content means. You may even be able to act on it, or possibly even connect it to other information you already posess (that's the beauty of graphs). A computer, on the other hand, only sees this...

<div>
	<h1>HEADING</h1>
	<p>TEXT TEXT TEXT TEXT TEXT TEXT</p>
</div>

It's not true to say there is no meaning being communicated here - the document has been marked up with HTML elements which are famously the things that contribute the "semantic" part of a web page. The topic gets muddied further by the fact that Semantic Markup is a term that gets used liberally in the web profession to describe HTML best practice (particularly when relating to accessibility).

While these HTML elements do contribute to adding meaning to a document, they only relate to the format of the the document, they don't tie the content to any real world concepts. In the above example, a machine is able to understand that there's a heading and a paragraph, but it has no idea that the text is related to Spiderman.

Using the new attributes introduced in the RDFa specification alongside existing RDF vocabularies, the document can be annotated to allow for machines to extract and understand the relevant information from the content of the page. If we were to add RDFa to the above example, it might look like this...

<div vocab="https://schema.org/" typeof="Person" resource="http://dbpedia.org/resource/Spider-Man">
	<h1 property="name">Spiderman</h1>
	<p>
		Spider-Man is a superhero appearing in American comic books published by Marvel Comics. He first
		appeared in
		<span
			typeof="ComicStory"
			resource="https://www.marvel.com/comics/issue/16926/amazing_fantasy_1962_15"
		>
			<span property="name">issue 15 of Amazing Fantasy</span>
			in
			<span property="datePublished">1962</span>.
		</span>
	</p>
</div>

These additions make no difference to the user, in fact the user won't even be aware they exist once the page is actually rendered in their browser unless they inspect the source code. From a machine readability perspective however, this changes everything! The addition of these attributes has made it possible to automate the creation of the following graph from the document.

@prefix schema: <https://schema.org/> .
@prefix dbr: <http://dbpedia.org/resource/> .

dbr:Spider-Man
   rdf:type schema:Person ;
   schema:name "Spiderman" .

<https://www.marvel.com/comics/issue/16926/amazing_fantasy_1962_15>
   rdf:type schema:ComicStory ;
   schema:name "issue 15 of Amazing Fantasy" ;
   schema:datePublished "1962" .

For reasons already documented our RDF Guide, the fact we're using RDF carries all the usual benefits, including data interoperability and flexibility straight out of the box.

The Attributes Themselves

There are six types of attribute according to the RDFa specification.

  • Syntax attributes: prefix, vocab.
  • Subject attributes: about.
  • Predicate attributes: property, rel, rev.
  • Resource attributes: resource, href, src.
  • Literal attributes: datatype, content, xml:lang or lang.
  • Macro attributes: typeof, inlist.

Rather than go over every one of these, we'll briefly cover the five attributes that are part of RDFa-lite. RDFa-lite is another W3C reccommendation, a subset of RDFa, which covers most of the "day-to-day needs" that most web authors will require. The attributes covered by RDFa-Lite are property, resource, vocab, typeof, and prefix.

property

The property attribute is used to define the "predicate" portion of an RDF triple. The behavior of this attribute changes depending on the node it's applied to and any other properties that are applied alongside it. When applied to most nodes, any text content within the node will be assumed to be the "object literal" portion of an RDF triple, for example...

<div>
  <span property="http://xmlns.com/foaf/0.1/name">Bob</span>
</div>

Attributes can have an effect on the processing of property attribute data, as can the element type. When applied to an HTML link, for example, in the absence of certain other attributes the resource specified in the href attribute will be used for the object portion of the triple and will be read as an IRI, not a literal.

<div>
  <a property="http://xmlns.com/foaf/0.1/homepage" href="http://www.example.com/bobs-homepage">
    Bob
  </a>
</div>

Typically for links you would instead use the rel attribute in place of property, but the effect would be exactly the same.

Multiple properties can be specified in the same property attribute by seperating them by a space. You might want to do this if you have multiple vocabularies in play, or if you want to specify that one resource is related to another in multiple ways.

resource

The resource attribute allows you to specify that the subject portion of any child node attributes should apply to a particular IRI. The following example now indicates that "Bob" is the foaf:name of the resource identified by http://www.example.com/bob.

<div resource="http://www.example.com/bob">
  <span property="http://xmlns.com/foaf/0.1/name">Bob</span>
</div>

Multiple resource attributes can be added to a particular document, allowing you to describe multiple resources on the same page.

If we hadn't included a resource attribute at all in the above example then the subject of any property attributes would be inferred based on other factors. If the document exists in a remote URL then that address would be used. If no such location information is available, blank nodes are used (these are un-named resources that exist only in the context of the current graph).

vocab

This attribute saves users having to write out the full URI of a particular vocabulary if a portion of a document is all using the same vocab. Any unqualified properties nested under a vocab annotation will be attributed to that vocabulary.

<div vocab="http://xmlns.com/foaf/0.1/" resource="http://www.example.com/bob">
  <span property="name">Bob</span>
</div>

In the above example, the property name will have the foaf vocabulary prepended.

typeof

The typeof attribute specifies the rdf:type of any associated resource. Expanding on our "Bob" example from the previous attribute, we could indicate that Bob is a person by adding this attribute to the enclosing div.

<div vocab="http://xmlns.com/foaf/0.1/" resource="http://www.example.com/bob" typeof="Person">
  <span property="name">Bob</span>
</div>

Note that we did not need to specify that we're using the Person class from the foaf vocabulary - that's implied by the vocab attribute that already exists on the element.

prefix

The prefix attribute can be used to specify any shorthand prefixes for IRI values that are used within the rest of the document, which is largely analogous with how prefixes are typically used in RDF documents and in fact carry the same benefits (increased readability, reduced repetition, reduced data size and so on).

This attribute also allows web authors to use multiple vocabularies within the same page without having to write out the full URI for every property and resource. The aforementioned vocab attribute only lets you specify a single vocab, which might be fine for many use cases but won't be suitable for all. Using prefix makes it much less painful to use several vocabularies at once.

The syntax for this attribute is to specify prefix:IRI pairs, seperated by whitespace. Here's the example from above using prefix instead of vocab.

<html>
	<head></head>
	<body prefix="foaf: http://xmlns.com/foaf/0.1/ ex: http://www.example.com/">

		<div resource="ex:bob" typeof="foaf:Person">
			<span property="foaf:name">Bob</span>
		</div>

  </body>
</html>

This allows for the following graph to be extracted from the page.

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .

<http://www.example.com/bob>
   rdf:type foaf:Person;
   foaf:name "Bob" .

If you are using popular prefixes such as foaf, it might not be necessary for you to specify these prefixes at all. Modern browsers all understand something called the RDFa initial context, which is a set of pre-defined prefixes that users can assume to already be in scope.

Relationship to HTML Microdata

RDFa is very similar in scope to an alternative standard called HTML Microdata. The goal of each is more or less exactly the same: to add contextual meaning to otherwise meaningless documents.

Microdata is much more widely used and much simpler than RDFa, but also much less flexible and much smaller in scope. It allows for only very basic facts to be added to a web page, but enough to cover most of the major use cases for the web like search engine indexing. It was created to allow for web authors to more easily add the things they need without having to engage with the deeper complexities of linked data.

Relationship to screen scraping

Screen scraping is a method of automatically extracting data directly from a web document, so the case can easily be made that RDFa is a form of screen scraping. It does, however, stand apart from the various other methods that are typically used to scrape in a few key ways.

Firstly, other methods of web scraping involve the manual setup of rules and scripts to identify which parts of a page contain the data you're interested in. These techniques are fairly brittle and can break if a document changes its structure (a common occurence, such as a minor re-design of a website, could in theory break any scraping operations that have been going on). The same is not true of RDFa annotations - so long as the new design uses the same annotations then the same data extraction method could continue to be used regardless of what the document structure changes to.

Secondly, web scraping is a fairly controversial topic and legally a gray area, and many websites take action to explicitly prevent themselves being scraped (the terms and conditions of a particular site should mention this). If, on the other hand, a website allows itself to be scraped and actively encourages users to pull data down from it, then RDFa is a great approach they can use to make this easy for both themselves and their customers - it allows them to play an active role in helping their customers to scrape data in a future-proof way.

Relationship to Named Entity Recognition

Named Entity Recognition (NER) and adjacent techniques such as Relationship Extraction are methods of recognizing semantic meaning within unstructured content.

The methods used to perform NER vary depending on the document you're trying to extract data from, but if you happen to be extracting entities and relationships from a document that has been annotated with RDFa then the process becomes much simpler. In these scenarios, you'd use the term "extraction" rather than "recognition" as there is no need to recognize anything - the data provider has explicitly pointed out the data for you and told you all about what the relationships are.

On the other hand, if the document you are working with is completely unstructured and the provider hasn't given you any semantics within the document, then the task becomes a whole lot harder. You are still technically "extracting information" from a document, but you are doing so via methods you bring to the table yourself rather than ones that the document provider laid out for you. Unfortunately, in the current data climate, this is going to be the case for most content.

Unlike extracting facts from documents via RDFa attributes, the output of most NER operations is a "prediction" and therefore prone to error (might be a very good prediction, but a prediction all the same). Techniques for performing NER on unstructured data range from using regular expressions all the way through using machine learning and word vectors along with various other NLP techniques.

⚠️ **GitHub.com Fallback** ⚠️