GSoC 2017: Krishanu Konar progress - dbpedia/list-extractor GitHub Wiki

This page contains the weekly progress reports for my GSoC-2017 project on List-Extractor

Abstract:

This project aims to augment upon the already existing list-extractor project, which was created by Federica in GSoC 2016. The project focuses on the extraction of relevant but hidden data which lies inside lists in Wikipedia pages. Wikipedia, being the world’s largest encyclopedia, has humongous amount of information present in form of text. While key facts and figures are encapsulated in the resource’s infobox, and some detailed statistics are present in the form of tables, but there’s also a lot of data present in form of lists which are quite unstructured and hence its difficult to form into a semantic relationship. The main objective of the project would be to create a tool that can extract information from wikipedia lists and form appropriate RDF triplets that can be inserted in the DBpedia dataset.

The related blog posts to my project can be found here.

Have Questions? Post your queries on the DBpedia support page here.

For a detailed explaination about List-Extractor, refer to the documentation in the docs folder. The sample generated datasets can be found here.

Architecture

images/List_Extractor_Architecture.jpg

The Extractor has 3 main parts:

Request Handler: Selects the resource(s) depending on the user's options and makes corresponding resource requests to the JSONpedia service for list data.
JSONpedia Service: JSONpedia Service provides the resource's information in a well-structured JSON format, which is used by the mapping functions to form appropriate triples from the list data. Currently, JSONpedia Live is being used, which is a web-service and is hence susceptible to be overloaded by large volume of requests. An objective in this year's project is to overcome this bottleneck, and using the JSONpedia Library instead of the Live service.
Mapper: This is the set of modules which use the JSON recieved from the JSONpedia Service and produce appropriate triples which can be serialized. The first part is cleaning the JSON dictionary to extract only meaningful list data. This data is then passed to a mapping Selector module, which using the rules present in the mapping rules, which are formed in accordance to the DBpedia Ontolgy, selects the mapping functions that are needed to be applied to the elements. The mapping functions then form appropriate triples, which are then serialized into a RDF graph.

Updated Deliverables

images/updated_deliverables.jpg

Outcomes and Impacts:

Better extraction for the existing Domains (Writer, Actor).
Adding a lot of different in-built mappers to extract triples from various domains, increasing the coverage of the list-extractor.
Adding custom Rules Generator, which would allow user to add mapping rules and mapper function themselves, making the extractor more scalable.
This would allow users to extract triples from any Wikipedia article, once they configure the mapping rules.
Hence, this list-extractor would be a much more generalized and user-friendly tool, which would create more triples for the DBpedia dataset.

Progress Record

Community Bonding:

8th- 16th May:

Going over the existing code again to grasp the finer details of the existing extractor to understand the complete working.

17th - 24th May:

Currently exploring the possible new domains that can be added to the list-extractor.

Explored a few potential domains containing lists which could be added:

Musical Artists
Musical Bands
Educational Institutes
Written Work(Magazines, Newspapers etc.)

24th - 27th May:

Analysed the usage of JSONpedia Live service in the existing code in order to use the library programmatically instead of using the live service. Wrote a sample Java code that uses the JSONpedia library and emulates the result of the existing extractor code. Full integration to be done in later weeks.

Coding Period:

Week 1 (30th May - 4th Jun):

Made slight tweaks with the existing code.
Added a new method to clean some of the junk values that were observed during extraction.
Also added a method that stores the statistical results of all the extractions that would take place in a csv for evaluation.
Added musicalArtist domain to the existing code, which was already part of my GSoC warmup task.
As discussed with mentors, I'm currently looking at ways to make the list-extractor more scalable. I'll also look for potential problems in the existing code and improve it wherever required.

Week 2 (5th - 11th Jun):

[5 Jun]: Added support for German language in all the 3 initial domains(Actor, Writer, MusicalArtist).
[6-7 Jun]: After analysing many articles from different domains, I realised that while there are several domains that have intersecting sections, a generalized template for the same is not possible. So, I have changed my approach a bit.
[8-9 Jun]: From now, I'll be focusing on writing the mapping functions that can extract list elements from a given section. Later, domains can be added in the mapping_rules.py inluding the various sections that might exist in the domain articles.
Made a Major change in the selection of mapper functions. It is now possible to add multiple mappers to a domain, effectively increasing the number of extracted elements, hence increasing accuracy.
[10 Jun]: Added support for Spanish language in all the 3 initial domains(Actor, Writer, MusicalArtist).
Another task next week would be discussing an approach with Luca, for potentially coming up with a possible template/mapping rules, to make a more effective and scalable extractor.

Week 3 (12th - 18th Jun):

[12-13 Jun]: Adding EducationalInstitution Domain.
Added extractor's architecture details to Wiki(should've done a lot earlier).
[14-15 Jun]: Looking at different Domains within Person in order to generalize extractor to work on this superclass.
- Added a few more section headers to include other subclasses like Painter, Architects etc.
- Changed various functions in order to support generalized domains(eg. year_mapper, role etc.); Now extracts all the years in which the person has won same award/honors.
- Manually going through ~50 wiki-pages for subdomains to find new potential section containing lists : Architect, Astronaut, Ambassador.
- Work Halted because dbpedia.org was under maintenance [14 Jun].
- Added a general Career and Works mappers for the domain; analysed more subdomains: Athelete, BusinessPerson, Chef, Celebrity, Coach.
[16-17 Jun]: Had a meeting with Luca to discuss ways to merge the mapping_rules for both list & table extractor projects; another meeting scheduled next week after discussing with mentors.
Finished writing mapper functions for EducationalInstitution domain.
Started looking at PeriodicalLiterature Domain; added some rules for Magazines.

Week 4 (19th - 24th Jun):

[19-20 Jun]: Started working on Mapper functions for PeriodicalLiterature.
Re-writing the year_mapper() to months (if present) with the dates. Also, try to extract the period of years of the particular element (start date-end date).
[21 Jun]: Finished mappers and rules for PeriodicalLiterature; tested on Magazines, Newspapers and AcademicJournals.
[22 Jun]: Merged all progress into master; This is now the most recent stable running version.
I'll now start working on merging different *_mapper() functions into a more generalised mapping functions, trying to reduce redundant code and making the whole structure more general. I'll also add my newly created/modified functions with Federica's existing ones and restructure the code wherever required.
Replaced the existing year_mapper with the the new mapper in each module; adding the newly written quote_mapper resource extractor in the uri extracting process.
[23 Jun]: Optimised the code a bit
Improved the mapper awards a bit to differentiate Honorary Degrees.
As discussed with Luca, I'll now start working on a module that'll create a new settings file and allow the user to select the mapping functions for the domain for the extraction process. This will increase support for unmatched domains.

Eval Week & Week 5 (26th Jun - 7th Jul):

[Eval week]: Came up with a structure and plan for how to implement the rulesGenerator and its impact on the current code.
wrote the skeleton of the rulesGenerator.
[1-3 Jul]: Coded up the prototype for the rulesGenerator
Will now run the generator and analyse how it can be fully integrated into the project.
Read the remarks on my first evaluation, and as suggested, I'll try to improve the proposal to be more specific with timetables and outcomes.
[4-5 Jul]: Added the newly structured Mapping Rules to the list extractor.
Extractor can now accept optional commandline argument to select class of mapper functions
Both listExtractor.py and rulesGenerator.py can now work together using/updating the settings.json file; both programs update the settings after any changes to the settings.json to remain up-to-date.
[6-7 Jul]: Started working on custom mapper using rulesGenerator.py
Completed the interface for accepting mapper function dictionary.
Completed functionalities of rulesGenerator.py.
Next week, I'll code up a general mapper function in mapper.py that can use the settings.json and custom_mappers.json files to create a totally user defined list-extractor module!

Week 6 (10th - 15th Jul):

[10-11 Jul]: Started working on map_user_defined_mappings(), which would emulate the mapper functions using the properties present in custom_mappers.json, which was generated using rulesGenerator.py.
[12-13 Jul]: Integrating all the rules and methods written so far with the main list-extactor tool.
Running unit tests with the user generated rules and mappers.
[Testing Single Resources]: Tested existing MusicalArtist and Writer domains with user defined CUSTOM_MUSICAL_ARTIST and CUSTOM_WRITER classes, which use the map_user_defined_mappings() to extract triples. These classes use user defined CUSTOM_ARTIST_MAPPER and CUSTOM_BIBLIOGRAPHY_MAPPER settings defined in custom_mappers.json.
- python listExtractor.py s Taylor_Swift en -c CUSTOM_MUSICAL_ARTIST
- python listExtractor.py s William_Gibson en -c CUSTOM_WRITER.
[Testing New Domains]: Added mapping rules for MusicGenre domain using rulesGenerator.py, and used MUSIC_GENRE_MAPPER as one of the mapper functions for the domain.
- python listExtractor.py a MusicGenre en
Next week, I'll start working on using JSONpedia as a library instead of using the JSONpedia Live web-service.

Week 7 (17th Jun - 22th Jul):

[17 Jul]: Started working on JSONpedia integration.
Writing a Java program that takes commandline input of resource, filters and extractors, that the JSONpedia library requires.
[18 Jul]: Took a day off for personal reasons.
[19-20 Jul]: Completed writing the wrapper function for Jsonpedia library.
The wrapper would take language, resource name, processors and filters as commandline input and make the related Jsonpedia calls. The results would be printed on stdout.
The main idea would to to pipe the output of this wrapper to our list-extractor and use the json.loads() command to turn it into a valid python dictionary that can be processed by the list extractor.
[20-21 Jul]: Started writing python modules for utilizing the wrapper.
Need to re-write WikiParser.json_convert() and WikiParser.find_page_redirects() to use the wrapper.
Will complete the crude implementation soon.

Week 8 (Eval week) (24 - 29th Jul):

Completed the implementation of WikiParser.json_convert() and WikiParser.find_page_redirects() using jsonpedia wrapper.
A problem was encountered while testing the modules; the wrapper function (or jsonpedia library) wasn't working properly.
The output of the JSONpedia Live service and the JSONpedia library had a minor difference, which is as explained below:

The expected output after applying the required filters:

{
	"filter": "object_filter(@type=section,)>null",
	"result": [{ ......
          .......
}

Instead the following output was observed:

{
	"filter": object_filter(@type=section,)>null,
	"result": [{ ......
          .......
}

The missing quotes in the first line break the json syntax and hence, break the current code.
I contacted Michele, the creator of JSONpedia for support on this issue and we'll look at the problem in the weekend.

Week 9 (31 Jul - 5 Aug):

Looked in the JSONpedia issue and found a bug present in JSONpedia filters.
Wrote a fix to the existing filter problem, and sent a pull request with the fix to JSONpedia for same.
Ran the list-extractor with the wrapper, and completed preliminary tests successfully.
Made a small improvement in the Time-period extractor; added an extra optional dictionary in the parameter so that use can pass the correct ontology in special cases (like releaseYear instead of activeYear in case of Movie or Musical Albums.
Created and finalized ontology classes/properties that would be used in the extractor to extract triples(pending approval).
Started extracting dataset for musicalArtist.
Started documentation.

Week 10 (7 - 11 Aug):

Completed internal documentation for the whole project.
Fixed minor bugs present in the code.

Week 11 (14 - 19 Aug):

Started creating Sphinx documentation for the list extractor.
Modified internal documentation for using it with Sphinx.
Created sample datasets for results and finalized the project.

Notes on Issues/Improvements required:

Honor/Awards too unstructured to be generalised. Improvement needed to increase efficiency/correctness of these triples. [solved]
Honorary degrees; differentiate them? [Yes, differentiated]
Writing new mapping rules in mapping_rules.py breaks the structure; create a seperate file for same? [new setting.json created]
Time-periods are present in a lot of elements; write a better year_mapper()? [new year_mapper() created]
Athelete achievements are much different from other classes of Person; might require a seperate mapping function?
dbr:property for mappings need to be added/improved to match existing DBpedia ontology.

Changes Summary:

Improved Extraction; Changed select_mapping() to support handling multiple sections.
Evaluation method added.
Pred-defined domains now include:
- Person: Writer, Actor, MusicalArtist, Athelete, Polititcian, Manager, Coach, Celebrity etc.
- EducationalInstitution: University, School, College, Library
- PeriodicalLiterature: Magazines, Newspapers, AcademicJournals
- Group
- Organisation
Year mapper added; add years (if present) in any lists.
A new tool Rules Generator added; can now add user defined mapping rules for differnt domains.
User defined mapper functions added; can add customised mapper functions.
JSONpedia Live web-service dependency removed; added library support.
Fixed a minor bug in the JSONpedia Library itself; submitted a pull request for the same.
New sample datasets generated.

Results:

Following are the some sample results from using the extractor on different domains. A more detailed evaluation statistics are present in evaluation.csv.

Topic & Language	# Resources	# Statements	Evaluation Accuracy
Actors (2016)	6,621	110,797	77%
Actors (2017)	6,606	134,013	79.08%
Musical Artist	52,759	1,340,800	75.77%
Band	34,505	867,984	84.57%
University	20,343	250,167	49.29%
Newspapers	6,861	17,546	52.37%

For a comparative analysis, we take a look at the Actors dataset, results for which are available from previous year. The accuracy of the extractor has improved (accuracy is defined as the ratio of list elements that succesfully contributed to a triple generation to the total number of list-elements present). We also see, despite there being less resources than the previous year, the list extractor was able to generate about 22k more triples from the same domain.

This can be due to many factors. On of them could be people adding new list entries in the wikipedia resources, causing the number to increase. This, of course, cannot be influenced by us and hence could have lead to an increase in that number. From a programmer's perspective, the major addition in this year's project was the new year_mapper(), which helped in extracting time periods from the list elements, as well as changing the select_mapping() method, which previously allowed only one mapper function per domain. The newer version of select_mapper() allows selecting several mapping functions to be used with a single domain, allowing more sections to be considered for extraction and consequently, creating more triples from the existing list elements.

These are the crude results and would definitely improve, for the following reason. The dataset generation using this tool requires a "continuous uninterrupted internet connection" to work properly. During the creation of final few datasets, I faced many problems related to the internet connection (which were beyond my control), and hence many resources were not processed and hence skipped. Generating these datasets again with a stable internet connection might improve the performance by ~5-10%. Also, the accuracy is quite low in case of the latter domains, particularly because most of the resources within these domains are very unstructured and hence are very difficult to extract. The accuracy can be improved by improving the domain knowledge, going through several wikipedia pages of the domain and understanding the underlying structure, hence improving the mapping rules.

Goals and Challenges

There were 3 main goals as proposed in my GSoC proposal:

Creation of new datasets.
Making the extractor more scalable, so that users can easily add their own rules and extract triples from different domains.
Removing the JSONpedia Live Service bottleneck by integrating the existing JSONpedia library with the list-extractor.

All the three goals were achieved by the project (at least to some extent).

New datsets (sample) were created for domains like MusicalArtist, Actor, Band, University, Magazine, Newspaper etc. All the sample datasets combined, that were created with the list-extractor, were created after processing about 1.3 million list elements, generating about 2.8 million triples.

The extractor was also made more scalable, by adding several more common mapper functions that can be used, while also making the selection of the mapper functions more flexible for every domain, by shifting the MAPPING dict to settings.json and allowing multiple mapper functions for a single domain. But, a bigger impact on the scalability came from the creation of rulesGenerator, which would now allow the users to create their own mapping rules and mapper functions from a interactive console program, without having to write code for the same! A sample domain MusicGenre was tested for the working of rulesGenerator, and the results/datasets are also present. Although the domain did not have much information that could be extracted, this still showed the ability of the rulesGenerator, a tool that can be used by people who are not programmers or don't have much knowledge about the inner working of the extractor, to generate triples and produce decent results.

The third goal was also achieved in the project. The dependency on JSONpedia Live Service was removed and JSONpedia Library is now being used for obtaining the JSON representation of the resource. This was achieved by writing a wrapper function (jsonpedia-wrapper.jar) on the actual JSONpedia library, so that it could be manipulated easily by the list-extractor. The JSONpedia wrapper is a command-line program that'll take some commandline parameters and output the retrieved JSON. The wrapper can be individually run using the following command:

java -jar jsonpedia_wrapper.jar -l [language] -r [resource_name] -p [processors] -f [filters]

So, the list-extractor simply forks another process that runs the JSONpedia wrapper with the parameters provided by the list-extractor, and the output is piped back to the list-extractor's stdin, which is then converted to JSON using the json.loads() method, hence completely emulating the previous behavior and eliminating the bottleneck. Hence, all the mentioned goals were achieved.

The main challenge remained the same from the last year, which was the extreme variability of lists. Unfortunately any real standard, structure or consistency does not exist in the resources and there are multiple formats used along with different meanings depending on the user who edited the page. Also, the strong dependability on the topic as well as the use of unrestricted natural language makes it impossibile to find a precise general rule to extract semantic information without knowing in advance the kind of list and the resource type. Hence the knowledge of the domain is also extremely important to write a good set of mapping rules and mapper functions, which would require the user to go through hundreds of Wikipedia pages of the same domain to find out the finer structure and relationships in the domain, which is very time consuming and exhausting. Apart from the heterogeneity, unfortunately there are several Wikipedia pages with bad/wrong formatting, which is obviously reflected in the impurity of extracted data. These are the main challenges that are present in the tool in general. Ironically, the feature that makes wikipedia great is also the root cause of our biggest challenges (i.e. openly accessed and modifiable by any/everyone), which reminds me of a popular phrase from the Holy Bible : "Lord giveth, and the lord taketh away."

Future Work

The scalabilty and JSONpedia web-dependence was taken care of tools were created/upgraded to help with the same. For future work, we can add support for more languages in the existing as well as the new domains by exploring the domain by going through wikipedia resources in different languages and the creating the mapping rules/functions and hence, allow more triples to be created from various domains and add them to the dataset.

Mentors:

Marco Fossati
Emanuele Storti