Tutorial: Data Model Authoring in Meta Metadata - ecologylab/BigSemanticsWrapperRepository GitHub Wiki

Note: This is one of a series of tutorials for authoring meta-metadata wrappers.

Table of Contents Overview Information Resources and Types Adding an Repository File Adding a Wrapper Adding Fields

Overview

Wrappers use the Meta-Metadata language to specify integrated information about different aspects of metadata: data model (data structures), extraction rules, semantic actions, and presentation rules. Each wrapper tells what the data structure is, how to form instances of it from web pages and web services, and how to present the metadata to users.

BigSemantics comes with a repository of wrappers for popular websites, including Google Search, Amazon products, and IMDB movies. However, there are still millions of websites that are not addressed by the existing wrappers, and you may want to use semantic data from one of them. To do that, you can author new wrappers for the website.

The process of writing a wrapper is usually iterative, adding pieces of information step by step: first determining what type to reuse and defining the data structure, then attaching extraction rules, then applying semantic actions and presentation rules if necessary. In this series of tutorials, we will go through this process and create a wrapper in the meta-metadata language for the site UrbanSpoon.

NOTE: You may need to delete mmdrepository/powerUser/urbanSpoon.xml and mmdrepository/repositorySources/restaurant.xml to prevent some conflicts. After getting through the tutorials, you can use git reset to revert local changes to get them back.

Information Resources and Types

There are many different types of information pages on UrbanSpoon: pages for individual restaurants, pages for places, pages for user profiles, and so on.

We call the data structure about a restaurant or a user profile as an information resource. It is essentially metadata about a real world entity, just like a library catalog for a real book. The difference between restaurants and user profiles is characterized by types, that is, there could be a type for restaurants which includes restaurant name, genre, cuisines, and ratings, while there could be another type for user profiles which includes name, photos, favorites, and reviews.

Inheritance

Information resource types are related to each other. Restaurants and hotels are different, yet sharing a lot of properties: location, price, service quality, reviews, and so on. Conceptually, either of them is a specific type of service; in other words, type "restaurant" and type "hotel" are both subtypes of type "service" which is associated with the common properties. The idea of inheritance roots here.

Therefore, when creating a new wrapper, the first step is to determine the information resource type that your wrapper will address, and which base type it can inherit from. Determining the base type can help reuse fields already defined, and promote interoperability. For example, an application working with "service" can automatically work with "restaurant" or "hotel" since the latter two share common properties with "service".

The best practice is to start with the lowest level of information page, and work up. For our example with UrbanSpoon, we will start with restaurant information pages such as this one. Also, note that such a type can be generalized to any other restaurant pages that are hosted or not hosted by UrbanSpoon. Therefore, it makes more sense to have a base "restaurant" type, and derive our type for UrbanSpoon restaurants from this base type, in case that we will have wrappers for other restaurant sites later.

To gain an understanding of the inheritance structure of our meta-metadata, launch the Wrapper Dev Assist and navigate to http://localhost:8080/mmdOntoVis/metametadata_ontology.html .

Adding an Repository File

The first step is to create a new repository file in XML for your meta-metadata definitions, and place it in the mmdrepository/repositorySources package in the BigSemanticsWrappers project. All the files in this package constitute the BigSemantics wrapper repository. Alternatively, you can add your definitions to an existing file in that folder if one already exists for your information source. Here, let's create a file called urbanSpoon.xml in mmdrepository/powerUser.

Each repository file must contain a root element meta_metadata_repository. It must have the following attributes:

name - The name of the repository.
package - The default package to place classes generated by the MetaMetadataCompiler (which we will address later in this series of tutorials), for wrappers in this repository.

Now your urbanSpoon.xml will contain:

<meta_metadata_repository name="urban_spoon" package="ecologylab.bigsemantics.generated.library.tutorial.urbanspoon">

</meta_metadata_repository>

Adding a Wrapper

In a repository file, each wrapper is written as an element meta_metadata inside the root element. The element must have the following attributes:

name - A unique name for the wrapper.
parser - The type of parser to be used, either "direct" or "xpath". Use "direct" for sources represented by XML that can map directly to metadata, and "xpath" for HTML documents that need XPath expressions for extraction of metadata.

Besides these three required attributes, the two following attributes will also be used:

extends - The name of the base type, that is, another wrapper defined by a meta_metadata element. The current wrapper will build upon the base one, inheriting all of its fields. New fields can be further added to the current wrapper.
comment - A helpful comment to increase readability.

As we have discussed before, we will start with a "restaurant" type. We first define the meta_metadata element like this:

<meta_metadata name="restaurant" extends="compound_document" comment="The restaurant type"> 

</meta_metadata>

compound_document - A built-in base type for a URL addressable web document. It has basic fields such as title, location (the URL), and description.
name="restaurant" - A unique and informative name.
extends="compound_document" - This extends the compound_document type to inherit its fields.

Again, by this wrapper we are defining a generic type (or data structure) for all kinds of restaurants, while urbanspoon.com is one of many information sources that provide data of this type.

Because this is the generic restaurant type, the parser attribute is excluded -- different sources may use different extraction methods. It will be added later when we create a wrapper specific to UrbanSpoon.

Adding Fields

Now we will add some fields inside the wrapper "restaurant" we just created, to represent metadata for a restaurant. These fields will be things like the name of the restaurant, its user rating, an image of the restaurant, and so on.

Fields are defined using one of three elements:

scalar for scalar fields, such as strings, integers, or URLs.
composite for fields that contain an object with multiple fields. The data structure must be defined by a wrapper.
collection for fields containing multiple values, for example, a list of strings or a list of restaurants.

Frequently used attributes for each field type are described below:

for scalar:

name - A unique (within the object) name for the field.
scalar_type - The type of the data, like String, ParsedURL or Int.
comment - A helpful comment. This attribute is not required.

for composite:

name - A unique (within the object) name for the field.
type - The type of the field, defined by a wrapper.
comment - A helpful comment. This attribute is not required.

for collection:

name - A unique (within the object) name for the field.
child_type - The type of values contained within the collection. If values are scalars, use child_scalar_type instead.
comment - A helpful comment. This attribute is not required.

To decide which information fields we want to gather, let's take a look at the restaurant page and see what is available and what we think could be useful (As a rule it is always better to get the information even if you are not sure if it will be needed). If you are authoring a wrapper to be used with the BigSemantics Service, MICE, or the MetadataLoader, it is a good idea to disable Javascript before loading the source page, as our current extraction method does not allow for accessing data created by Javascript.:

I have boxed in blue the information which I think would be good to have:

Restaurant name
Phone number
Website
A picture of the restaurant
Rating
Price of an average entrée
List of food genres
A map showing the restaurant's location

The complete wrapper for generic restaurants will look like:

<meta_metadata name="restaurant" extends="compound_document" comment="The restaurant class">
  <scalar name="phone" scalar_type="String" comment="Phone number of the restaurant" />
  <scalar name="pic" scalar_type="ParsedURL" hide="true" comment="A picture from the restaurant" />
  <scalar name="link" scalar_type="ParsedURL" comment="Link to the restaurant's website" />
  <scalar name="rating" scalar_type="String" comment="Rating of the restaurant" />
  <scalar name="price_range" scalar_type="String" comment="Price range of the restaurant" />
  <scalar name="map" scalar_type="ParsedURL" hide="true"
    comment="Map image of the restaurant's location or link to a directions page" />
  <collection name="genres" child_type="compound_document" comment="The genres of food offered" />
</meta_metadata>

Some key things to observe from this class:

The restaurant name is missing; that information will use the title field inherited from the compound_document class.
phone, rating, and price_range are of scalar type String.
pic, link, and map are of scalar type ParsedURL, which are actual URLs to an image or website.
pic and map have the attribute hide = false, because the URLs of these images don't need to be displayed to the user.
genres is a collection, meaning it will hold a list of values. This is accommodate restaurants with many food genres. Here, we use the built-in type compound_document without further wrapping this type; however we can add a wrapper for this type later.

Note: To use your newly defined wrapper in an application, you need to compile it into source codes with MetaMetadata Compiler. We will address this in the next tutorials.

Hopefully now you have learned how a data structure is defined using a wrapper. The next tutorial will cover information extraction.