Wrapper Authoring Tutorial - ecologylab/BigSemanticsWrapperRepository GitHub Wiki

The following is a work in-progress for basic tasks in authoring meta-metadata wrappers.

Table of Contents Overview Setup Your First Wrapper Selectors Extracting Scalar Fields Extracting Composite Fields Extracting Collections

Overview

Wrappers conjoin information about different aspects of metadata: data structure, extraction rules, semantic actions, and presentation rules.

BigSemantics comes with a repository of wrappers for popular websites, including Google Search, Amazon products, and IMDB movies. However, there are still millions of websites that are not addressed by the existing wrappers, and you may want to use semantic data from one of them. To do that, you can author new wrappers for the website.

The process of writing a wrapper is usually iterative, adding pieces of information step by step: first determining what type to reuse and defining the data structure, then attaching extraction rules, then applying semantic actions and presentation rules if necessary.

Setup

We assume you have already downloaded the Wrapper Repository and have set up the Wrapper Dev Assist.

We recommend using the latest version Chrome to find xpath's for the information to be extracted. Installing a utility like the Quick Javascript Switcher to easily disable javascript is recommended, as our current extraction methods don't support information created by javascript.

You should be familiar with the basics of xml and xpaths.

Your First Wrapper

For this tutorial, we are going to make a wrapper for microcenter.com, an electronics store (due to potential changes in site layout, the xpaths used here may not work).

Before we get started on your first wrapper, you should be aware that each wrapper has a type. These types allow for inheritance between wrappers, and they're part of our ongoing effort to provide a cohesive ontology for our metadata. Our wrapper will be extracting information on microcenter's products, so it makes sense for us to inherit from the already-made product class in productsAndServices.xml. For more details on inheritance, see inheritance.

Open up ./BigSemanticsWrapperRepository/BigSemanticsWrappers/MmdReposiory/mmdrespository/repositorySources/productsAndServices.xml in your preferred .xml editor (we use Eclipse).

There are a lot of wrappers in there, but for now focus on the first tag:

<meta_metadata_repository name="product_and_service"
  package="ecologylab.bigsemantics.generated.library.product_and_service">

It may look intimidating, but all it does is declare that it contains metadata wrappers in a repository named "product_and_service" and name the package it's in. Every metadata wrapper must be in a repository, and each repository must declare itself to be in a package. See metadata repositories for details on how to create your own.

Now it's time to make add your own wrapper! Add

<meta_metadata name="microcenter_product" extends="product" parser="xpath">  
</meta_metadata>

inside the <meta_metadata_repository> tag (but not inside another meta_metadata tag). This meta_metadata tag is essentially an individual metadata wrapper. It's first attribute is self-explanatory - we must give the wrapper a name. The second, extends="product", means that the type microcenter_product inherits all fields from product, and it can also add new fields.

Let's see what it gets from product. Product includes a scalar field (a field with one value) for a model and a collection called detailed specifications. Product also extends from commodity and compound_document. Without listing every field microcenter_product inherits, we'll focus on the key ones:

title - The name of your product. Inherited from document via compound_document.
description - A description or summary of the product. Inherited from document via compound_document.
price - The price of a product.
detailed specification - This collection will enable you to extract detailed information on a product.

Selectors

Now that we've made a wrapper, we need to tell the service when to use it. we do this by adding selectors to our wrapper. Selectors match certain URL's based on domains, regex, and more.

For the microcenter site, we see that URL's for products are of the form http://www.microcenter.com/product/'product number'/'product name'. Because of this, we will use the selector url_path_tree, which matches all URL's that begin with the same path.

The form for using a url_path_tree selector is

<selector url_path_tree="path"/>

In this case add

<selector url_path_tree="http://www.microcenter.com/product/"/>

Extracting Scalar Fields

Since we already have some great fields, let's talk about how to get information into them! Most data extraction is done with xpaths.

Open up http://www.microcenter.com/product/433221/MacBook_Air_MD711LL-B_116_Laptop_Computer_-_Silver in Chrome. Activate the web inspector by pressing CTRL+SHIFT+I. We are going to find ourselves a title.

Note that the xpath's used in this tutorial may no longer work if the website is changed.

Click on the small magnifying glass in the upper-left corner of the web inspector and then click on the title. As you can see in the elements panel, it is stored in a div with the attribute 'itemprop' equal to 'name'. Using the Chrome Web Inspector's console, test out entering $x("//div[@itemprop='name']"). As of May 28th, 2014, the xpath returns exactly one element, which is precisely how many we need.

To include these extraction rules in the wrapper, add

<scalar name="title">
  <xpath>//div[@itemprop='name']</xpath>
</scalar>

to the meta_metadata field.

To test and see if our title extraction works, save the productsAndServices.xml file, launch Assist App, and press "Update Backend With New Wrappers". Then, load http://localhost:8080/interactiveSemantics/testLocal.html in your browser. Copy and past the URL for the microcenter product page we've been using into search box and press "Show Metadata".

You should see that our title has been properly extracted and that an auto-generated description is available.

Adding in the next two fields, description and price works similarly. For fields inherited from an ancestor, the process is as simple as:

Find the element on the page.
Inspect the element.
Make an xpath to the element.
Add a field in the metametadata wrapper.

In the cases where you want to add data for which a field has not been defined by an ancestor, you must additionally add a scalar_type. The Macbook Air product page lists that it is "Available for In-Store Pickup Only.", but no current field represents this inventory data. We can easily represent this by adding the field:

<scalar name="inventory_status" scalar_type="String">
  <xpath>//p[@class='inventory']</xpath>
</scalar>

We use the attribute scalar_type="String" because inventory status is stored as a String. You may also use the int and float scalar_type's. See scalar types for complete documentation.

In this case, we see that inventory_status may be useful for other products, not just microcenter's. Adding the field

<scalar name="inventory_status" scalar_type="String"/>

to the product meta_metadata wrapper will benefit future meta_metadata authors and application developers. Once added, you may remove the attribute 'scalar_type="String"' from the inventory_status tag in the microcenter wrapper.

Extracting Composite Fields

Unlike simple scalar fields, composite fields have a type that must be defined by another meta_metadata wrapper. Composites are much more powerful, and provide the ability to group related information together.

Extracting Collections

Collections allow us to extract multiple scalars or composites with similar xpaths at once. Each collection must specify either a child_type or child_scalar_type, depending on whether it's children are composites or scalars.

The microcenter product has a list of key features. Although we could manually create fields for "key_feature_1" ect. it's much faster and more flexible to use a collection.

First, identify the xpath all the collection's elements have in common. The number of results returned by this xpath will be the number of child elements included in the collection.

The xml for this example is quite simple:

<collection name="key_features" child_scalar_type="String">
  <xpath>//div[@itemprop='description']/ul/li</xpath>
</collection>

It extracts the text it finds at all nodes returned from the xpath and puts each string in a scalar in the collection.

Collections of composites are extracted similarly to composites. They are of the form:

<collection name="collection_name" child_type="some_type">
  <xpath>root xpath</xpath>
  <scalar name="some_scalar_in_the_type">
    <xpath>some relative xpath>
  </scalar>
</collection>

For this tutorial, we will focus on using the detailed_specifications collection inherited from product. As you can see, the microcenter product contains a list of tables, each of which contains label/value pairs. The detailed_specifications collection is designed to extract information in this (fairly common) form.

detailed_specifications contains elements of the type product_specs, which is defined as follows:

<meta_metadata name="product_specs" extends="compound_document" parser="xpath">
  <collection name="specifications" child_type="compound_document"/>
</meta_metadata>

Essentially, detailed_specifications is a collection of collections of compound documents, and each compound document has a title and description.

Be aware that the following xpath's are quite complicated; focus on how the fields from detailed_specifications are included instead of the xpath's if you are uncomfortable with them.