Tutorial: Data Model Authoring in Meta Metadata - ecologylab/BigSemanticsWrapperRepository GitHub Wiki
Note: This is one of a series of tutorials for authoring meta-metadata wrappers.
Wrappers use the Meta-Metadata language to specify integrated information about different aspects of metadata: data model (data structures), extraction rules, semantic actions, and presentation rules. Each wrapper tells what the data structure is, how to form instances of it from web pages and web services, and how to present the metadata to users.
BigSemantics comes with a repository of wrappers for popular websites, including Google Search, Amazon products, and IMDB movies. However, there are still millions of websites that are not addressed by the existing wrappers, and you may want to use semantic data from one of them. To do that, you can author new wrappers for the website.
The process of writing a wrapper is usually iterative, adding pieces of information step by step: first determining what type to reuse and defining the data structure, then attaching extraction rules, then applying semantic actions and presentation rules if necessary. In this series of tutorials, we will go through this process and create a wrapper in the meta-metadata language for the site UrbanSpoon.
NOTE: You may need to delete mmdrepository/powerUser/urbanSpoon.xml and mmdrepository/repositorySources/restaurant.xml to prevent some conflicts. After getting through the tutorials, you can use git reset
to revert local changes to get them back.
There are many different types of information pages on UrbanSpoon: pages for individual restaurants, pages for places, pages for user profiles, and so on.
We call the data structure about a restaurant or a user profile as an information resource. It is essentially metadata about a real world entity, just like a library catalog for a real book. The difference between restaurants and user profiles is characterized by types, that is, there could be a type for restaurants which includes restaurant name, genre, cuisines, and ratings, while there could be another type for user profiles which includes name, photos, favorites, and reviews.
- Inheritance
Therefore, when creating a new wrapper, the first step is to determine the information resource type that your wrapper will address, and which base type it can inherit from. Determining the base type can help reuse fields already defined, and promote interoperability. For example, an application working with "service" can automatically work with "restaurant" or "hotel" since the latter two share common properties with "service".
The best practice is to start with the lowest level of information page, and work up. For our example with UrbanSpoon, we will start with restaurant information pages such as this one. Also, note that such a type can be generalized to any other restaurant pages that are hosted or not hosted by UrbanSpoon. Therefore, it makes more sense to have a base "restaurant" type, and derive our type for UrbanSpoon restaurants from this base type, in case that we will have wrappers for other restaurant sites later.
To gain an understanding of the inheritance structure of our meta-metadata, launch the Wrapper Dev Assist and navigate to http://localhost:8080/mmdOntoVis/metametadata_ontology.html .
The first step is to create a new repository file in XML for your meta-metadata definitions, and place it in the mmdrepository/repositorySources
package in the BigSemanticsWrappers
project. All the files in this package constitute the BigSemantics wrapper repository. Alternatively, you can add your definitions to an existing file in that folder if one already exists for your information source. Here, let's create a file called urbanSpoon.xml
in mmdrepository/powerUser
.
Each repository file must contain a root element meta_metadata_repository
. It must have the following attributes:
- name - The name of the repository.
- package - The default package to place classes generated by the MetaMetadataCompiler (which we will address later in this series of tutorials), for wrappers in this repository.
urbanSpoon.xml
will contain:
<meta_metadata_repository name="urban_spoon" package="ecologylab.bigsemantics.generated.library.tutorial.urbanspoon">
</meta_metadata_repository>
In a repository file, each wrapper is written as an element meta_metadata
inside the root element. The element must have the following attributes:
- name - A unique name for the wrapper.
- parser - The type of parser to be used, either "direct" or "xpath". Use "direct" for sources represented by XML that can map directly to metadata, and "xpath" for HTML documents that need XPath expressions for extraction of metadata.
-
extends - The name of the base type, that is, another wrapper defined by a
meta_metadata
element. The current wrapper will build upon the base one, inheriting all of its fields. New fields can be further added to the current wrapper. - comment - A helpful comment to increase readability.
meta_metadata
element like this:
<meta_metadata name="restaurant" extends="compound_document" comment="The restaurant type">
</meta_metadata>
-
compound_document
- A built-in base type for a URL addressable web document. It has basic fields such as title, location (the URL), and description. - name="restaurant" - A unique and informative name.
-
extends="compound_document" - This extends the
compound_document
type to inherit its fields.
Because this is the generic restaurant type, the parser attribute is excluded -- different sources may use different extraction methods. It will be added later when we create a wrapper specific to UrbanSpoon.
Now we will add some fields inside the wrapper "restaurant" we just created, to represent metadata for a restaurant. These fields will be things like the name of the restaurant, its user rating, an image of the restaurant, and so on.
Fields are defined using one of three elements:
- scalar for scalar fields, such as strings, integers, or URLs.
- composite for fields that contain an object with multiple fields. The data structure must be defined by a wrapper.
- collection for fields containing multiple values, for example, a list of strings or a list of restaurants.
for scalar:
- name - A unique (within the object) name for the field.
- scalar_type - The type of the data, like String, ParsedURL or Int.
- comment - A helpful comment. This attribute is not required.
- name - A unique (within the object) name for the field.
- type - The type of the field, defined by a wrapper.
- comment - A helpful comment. This attribute is not required.
- name - A unique (within the object) name for the field.
- child_type - The type of values contained within the collection. If values are scalars, use child_scalar_type instead.
- comment - A helpful comment. This attribute is not required.
I have boxed in blue the information which I think would be good to have:
- Restaurant name
- Phone number
- Website
- A picture of the restaurant
- Rating
- Price of an average entrée
- List of food genres
- A map showing the restaurant's location
<meta_metadata name="restaurant" extends="compound_document" comment="The restaurant class">
<scalar name="phone" scalar_type="String" comment="Phone number of the restaurant" />
<scalar name="pic" scalar_type="ParsedURL" hide="true" comment="A picture from the restaurant" />
<scalar name="link" scalar_type="ParsedURL" comment="Link to the restaurant's website" />
<scalar name="rating" scalar_type="String" comment="Rating of the restaurant" />
<scalar name="price_range" scalar_type="String" comment="Price range of the restaurant" />
<scalar name="map" scalar_type="ParsedURL" hide="true"
comment="Map image of the restaurant's location or link to a directions page" />
<collection name="genres" child_type="compound_document" comment="The genres of food offered" />
</meta_metadata>
Some key things to observe from this class:
- The restaurant name is missing; that information will use the title field inherited from the
compound_document
class. - phone, rating, and price_range are of scalar type String.
- pic, link, and map are of scalar type ParsedURL, which are actual URLs to an image or website.
- pic and map have the attribute hide = false, because the URLs of these images don't need to be displayed to the user.
-
genres is a collection, meaning it will hold a list of values. This is accommodate restaurants with many food genres. Here, we use the built-in type
compound_document
without further wrapping this type; however we can add a wrapper for this type later.
Hopefully now you have learned how a data structure is defined using a wrapper. The next tutorial will cover information extraction.