GSoC 2017: Krishanu Konar progress - dbpedia/list-extractor GitHub Wiki
This page contains the weekly progress reports for my GSoC-2017 project on List-Extractor
Abstract:
This project aims to augment upon the already existing list-extractor project, which was created by Federica in GSoC 2016. The project focuses on the extraction of relevant but hidden data which lies inside lists in Wikipedia pages. Wikipedia, being the world’s largest encyclopedia, has humongous amount of information present in form of text. While key facts and figures are encapsulated in the resource’s infobox, and some detailed statistics are present in the form of tables, but there’s also a lot of data present in form of lists which are quite unstructured and hence its difficult to form into a semantic relationship. The main objective of the project would be to create a tool that can extract information from wikipedia lists and form appropriate RDF triplets that can be inserted in the DBpedia dataset.
The related blog posts to my project can be found here.
Have Questions? Post your queries on the DBpedia support page here.
For a detailed explaination about List-Extractor, refer to the documentation in the docs
folder. The sample generated datasets can be found here.
Architecture
images/List_Extractor_Architecture.jpg
The Extractor has 3 main parts:
-
Request Handler: Selects the resource(s) depending on the user's options and makes corresponding resource requests to the JSONpedia service for list data.
-
JSONpedia Service: JSONpedia Service provides the resource's information in a well-structured JSON format, which is used by the mapping functions to form appropriate triples from the list data. Currently, JSONpedia Live is being used, which is a web-service and is hence susceptible to be overloaded by large volume of requests. An objective in this year's project is to overcome this bottleneck, and using the JSONpedia Library instead of the Live service.
-
Mapper: This is the set of modules which use the JSON recieved from the JSONpedia Service and produce appropriate triples which can be serialized. The first part is cleaning the JSON dictionary to extract only meaningful list data. This data is then passed to a mapping Selector module, which using the rules present in the mapping rules, which are formed in accordance to the DBpedia Ontolgy, selects the mapping functions that are needed to be applied to the elements. The mapping functions then form appropriate triples, which are then serialized into a RDF graph.
Updated Deliverables
images/updated_deliverables.jpg
Outcomes and Impacts:
- Better extraction for the existing Domains (Writer, Actor).
- Adding a lot of different in-built mappers to extract triples from various domains, increasing the coverage of the list-extractor.
- Adding custom Rules Generator, which would allow user to add mapping rules and mapper function themselves, making the extractor more scalable.
- This would allow users to extract triples from any Wikipedia article, once they configure the mapping rules.
- Hence, this list-extractor would be a much more generalized and user-friendly tool, which would create more triples for the DBpedia dataset.
Progress Record
Community Bonding:
8th- 16th May:
Going over the existing code again to grasp the finer details of the existing extractor to understand the complete working.
17th - 24th May:
Currently exploring the possible new domains that can be added to the list-extractor.
Explored a few potential domains containing lists which could be added:
- Musical Artists
- Musical Bands
- Educational Institutes
- Written Work(Magazines, Newspapers etc.)
24th - 27th May:
Analysed the usage of JSONpedia Live service in the existing code in order to use the library programmatically instead of using the live service. Wrote a sample Java code that uses the JSONpedia library and emulates the result of the existing extractor code. Full integration to be done in later weeks.
Coding Period:
Week 1 (30th May - 4th Jun):
- Made slight tweaks with the existing code.
- Added a new method to clean some of the junk values that were observed during extraction.
- Also added a method that stores the statistical results of all the extractions that would take place in a csv for evaluation.
- Added
musicalArtist
domain to the existing code, which was already part of my GSoC warmup task. - As discussed with mentors, I'm currently looking at ways to make the list-extractor more scalable. I'll also look for potential problems in the existing code and improve it wherever required.
Week 2 (5th - 11th Jun):
- [5 Jun]: Added support for German language in all the 3 initial domains(Actor, Writer, MusicalArtist).
- [6-7 Jun]: After analysing many articles from different domains, I realised that while there are several domains that have intersecting sections, a generalized template for the same is not possible. So, I have changed my approach a bit.
- [8-9 Jun]: From now, I'll be focusing on writing the mapping functions that can extract list elements from a given section. Later, domains can be added in the mapping_rules.py inluding the various sections that might exist in the domain articles.
- Made a Major change in the selection of mapper functions. It is now possible to add multiple mappers to a domain, effectively increasing the number of extracted elements, hence increasing accuracy.
- [10 Jun]: Added support for Spanish language in all the 3 initial domains(Actor, Writer, MusicalArtist).
- Another task next week would be discussing an approach with Luca, for potentially coming up with a possible template/mapping rules, to make a more effective and scalable extractor.
Week 3 (12th - 18th Jun):
- [12-13 Jun]: Adding
EducationalInstitution
Domain. - Added extractor's architecture details to Wiki(should've done a lot earlier).
- [14-15 Jun]: Looking at different Domains within
Person
in order to generalize extractor to work on this superclass.- Added a few more section headers to include other subclasses like
Painter
,Architects
etc. - Changed various functions in order to support generalized domains(eg. year_mapper, role etc.); Now extracts all the years in which the person has won same award/honors.
- Manually going through ~50 wiki-pages for subdomains to find new potential section containing lists :
Architect
,Astronaut
,Ambassador
. - Work Halted because dbpedia.org was under maintenance [14 Jun].
- Added a general Career and Works mappers for the domain; analysed more subdomains:
Athelete
,BusinessPerson
,Chef
,Celebrity
,Coach
.
- Added a few more section headers to include other subclasses like
- [16-17 Jun]: Had a meeting with Luca to discuss ways to merge the mapping_rules for both list & table extractor projects; another meeting scheduled next week after discussing with mentors.
- Finished writing mapper functions for
EducationalInstitution
domain. - Started looking at
PeriodicalLiterature
Domain; added some rules forMagazines
.
Week 4 (19th - 24th Jun):
- [19-20 Jun]: Started working on Mapper functions for
PeriodicalLiterature
. - Re-writing the
year_mapper()
to months (if present) with the dates. Also, try to extract the period of years of the particular element (start date-end date). - [21 Jun]: Finished mappers and rules for
PeriodicalLiterature
; tested onMagazines
,Newspapers
andAcademicJournals
. - [22 Jun]: Merged all progress into master; This is now the most recent stable running version.
- I'll now start working on merging different
*_mapper()
functions into a more generalised mapping functions, trying to reduce redundant code and making the whole structure more general. I'll also add my newly created/modified functions with Federica's existing ones and restructure the code wherever required. - Replaced the existing
year_mapper
with the the new mapper in each module; adding the newly writtenquote_mapper
resource extractor in the uri extracting process. - [23 Jun]: Optimised the code a bit
- Improved the mapper awards a bit to differentiate Honorary Degrees.
- As discussed with Luca, I'll now start working on a module that'll create a new settings file and allow the user to select the mapping functions for the domain for the extraction process. This will increase support for unmatched domains.
Eval Week & Week 5 (26th Jun - 7th Jul):
- [Eval week]: Came up with a structure and plan for how to implement the rulesGenerator and its impact on the current code.
- wrote the skeleton of the rulesGenerator.
- [1-3 Jul]: Coded up the prototype for the rulesGenerator
- Will now run the generator and analyse how it can be fully integrated into the project.
- Read the remarks on my first evaluation, and as suggested, I'll try to improve the proposal to be more specific with timetables and outcomes.
- [4-5 Jul]: Added the newly structured Mapping Rules to the list extractor.
- Extractor can now accept optional commandline argument to select class of mapper functions
- Both
listExtractor.py
andrulesGenerator.py
can now work together using/updating thesettings.json
file; both programs update the settings after any changes to the settings.json to remain up-to-date. - [6-7 Jul]: Started working on custom mapper using rulesGenerator.py
- Completed the interface for accepting mapper function dictionary.
- Completed functionalities of
rulesGenerator.py
. - Next week, I'll code up a general mapper function in
mapper.py
that can use thesettings.json
andcustom_mappers.json
files to create a totally user defined list-extractor module!
Week 6 (10th - 15th Jul):
- [10-11 Jul]: Started working on
map_user_defined_mappings()
, which would emulate the mapper functions using the properties present incustom_mappers.json
, which was generated usingrulesGenerator.py
. - [12-13 Jul]: Integrating all the rules and methods written so far with the main list-extactor tool.
- Running unit tests with the user generated rules and mappers.
- [Testing Single Resources]: Tested existing
MusicalArtist
andWriter
domains with user definedCUSTOM_MUSICAL_ARTIST
andCUSTOM_WRITER
classes, which use themap_user_defined_mappings()
to extract triples. These classes use user definedCUSTOM_ARTIST_MAPPER
andCUSTOM_BIBLIOGRAPHY_MAPPER
settings defined incustom_mappers.json
.python listExtractor.py s Taylor_Swift en -c CUSTOM_MUSICAL_ARTIST
python listExtractor.py s William_Gibson en -c CUSTOM_WRITER
.
- [Testing New Domains]: Added mapping rules for
MusicGenre
domain usingrulesGenerator.py
, and usedMUSIC_GENRE_MAPPER
as one of the mapper functions for the domain.python listExtractor.py a MusicGenre en
- Next week, I'll start working on using JSONpedia as a library instead of using the JSONpedia Live web-service.
Week 7 (17th Jun - 22th Jul):
- [17 Jul]: Started working on JSONpedia integration.
- Writing a Java program that takes commandline input of resource, filters and extractors, that the JSONpedia library requires.
- [18 Jul]: Took a day off for personal reasons.
- [19-20 Jul]: Completed writing the wrapper function for Jsonpedia library.
- The wrapper would take language, resource name, processors and filters as commandline input and make the related Jsonpedia calls. The results would be printed on stdout.
- The main idea would to to pipe the output of this wrapper to our list-extractor and use the
json.loads()
command to turn it into a valid python dictionary that can be processed by the list extractor. - [20-21 Jul]: Started writing python modules for utilizing the wrapper.
- Need to re-write
WikiParser.json_convert()
andWikiParser.find_page_redirects()
to use the wrapper. - Will complete the crude implementation soon.
Week 8 (Eval week) (24 - 29th Jul):
- Completed the implementation of
WikiParser.json_convert()
andWikiParser.find_page_redirects()
using jsonpedia wrapper. - A problem was encountered while testing the modules; the wrapper function (or jsonpedia library) wasn't working properly.
- The output of the JSONpedia Live service and the JSONpedia library had a minor difference, which is as explained below:
The expected output after applying the required filters:
{
"filter": "object_filter(@type=section,)>null",
"result": [{ ......
.......
}
Instead the following output was observed:
{
"filter": object_filter(@type=section,)>null,
"result": [{ ......
.......
}
- The missing quotes in the first line break the json syntax and hence, break the current code.
- I contacted Michele, the creator of JSONpedia for support on this issue and we'll look at the problem in the weekend.
Week 9 (31 Jul - 5 Aug):
- Looked in the JSONpedia issue and found a bug present in JSONpedia filters.
- Wrote a fix to the existing filter problem, and sent a pull request with the fix to JSONpedia for same.
- Ran the list-extractor with the wrapper, and completed preliminary tests successfully.
- Made a small improvement in the Time-period extractor; added an extra optional dictionary in the parameter so that use can pass the correct ontology in special cases (like
releaseYear
instead ofactiveYear
in case of Movie or Musical Albums. - Created and finalized ontology classes/properties that would be used in the extractor to extract triples(pending approval).
- Started extracting dataset for
musicalArtist
. - Started documentation.
Week 10 (7 - 11 Aug):
- Completed internal documentation for the whole project.
- Fixed minor bugs present in the code.
Week 11 (14 - 19 Aug):
- Started creating Sphinx documentation for the list extractor.
- Modified internal documentation for using it with Sphinx.
- Created sample datasets for results and finalized the project.
Notes on Issues/Improvements required:
- Honor/Awards too unstructured to be generalised. Improvement needed to increase efficiency/correctness of these triples. [solved]
- Honorary degrees; differentiate them? [Yes, differentiated]
- Writing new mapping rules in
mapping_rules.py
breaks the structure; create a seperate file for same? [newsetting.json
created] - Time-periods are present in a lot of elements; write a better
year_mapper()
? [newyear_mapper()
created] Athelete
achievements are much different from other classes ofPerson
; might require a seperate mapping function?dbr:property
for mappings need to be added/improved to match existing DBpedia ontology.
Changes Summary:
- Improved Extraction; Changed
select_mapping()
to support handling multiple sections. - Evaluation method added.
- Pred-defined domains now include:
- Person:
Writer
,Actor
,MusicalArtist
,Athelete
,Polititcian
,Manager
,Coach
,Celebrity
etc. - EducationalInstitution:
University
,School
,College
,Library
- PeriodicalLiterature:
Magazines
,Newspapers
,AcademicJournals
- Group
- Organisation
- Person:
- Year mapper added; add years (if present) in any lists.
- A new tool Rules Generator added; can now add user defined mapping rules for differnt domains.
- User defined mapper functions added; can add customised mapper functions.
- JSONpedia Live web-service dependency removed; added library support.
- Fixed a minor bug in the JSONpedia Library itself; submitted a pull request for the same.
- New sample datasets generated.
Results:
Following are the some sample results from using the extractor on different domains. A more detailed evaluation statistics are present in evaluation.csv
.
Topic & Language | # Resources | # Statements | Evaluation Accuracy |
---|---|---|---|
Actors (2016) | 6,621 | 110,797 | 77% |
Actors (2017) | 6,606 | 134,013 | 79.08% |
Musical Artist | 52,759 | 1,340,800 | 75.77% |
Band | 34,505 | 867,984 | 84.57% |
University | 20,343 | 250,167 | 49.29% |
Newspapers | 6,861 | 17,546 | 52.37% |
For a comparative analysis, we take a look at the Actors
dataset, results for which are available from previous year. The accuracy of the extractor has improved (accuracy is defined as the ratio of list elements that succesfully contributed to a triple generation to the total number of list-elements present). We also see, despite there being less resources than the previous year, the list extractor was able to generate about 22k more triples from the same domain.
This can be due to many factors. On of them could be people adding new list entries in the wikipedia resources, causing the number to increase. This, of course, cannot be influenced by us and hence could have lead to an increase in that number. From a programmer's perspective, the major addition in this year's project was the new year_mapper()
, which helped in extracting time periods from the list elements, as well as changing the select_mapping()
method, which previously allowed only one mapper function per domain. The newer version of select_mapper()
allows selecting several mapping functions to be used with a single domain, allowing more sections to be considered for extraction and consequently, creating more triples from the existing list elements.
These are the crude results and would definitely improve, for the following reason. The dataset generation using this tool requires a "continuous uninterrupted internet connection" to work properly. During the creation of final few datasets, I faced many problems related to the internet connection (which were beyond my control), and hence many resources were not processed and hence skipped. Generating these datasets again with a stable internet connection might improve the performance by ~5-10%. Also, the accuracy is quite low in case of the latter domains, particularly because most of the resources within these domains are very unstructured and hence are very difficult to extract. The accuracy can be improved by improving the domain knowledge, going through several wikipedia pages of the domain and understanding the underlying structure, hence improving the mapping rules.
Goals and Challenges
There were 3 main goals as proposed in my GSoC proposal:
- Creation of new datasets.
- Making the extractor more scalable, so that users can easily add their own rules and extract triples from different domains.
- Removing the JSONpedia Live Service bottleneck by integrating the existing JSONpedia library with the list-extractor.
All the three goals were achieved by the project (at least to some extent).
New datsets (sample) were created for domains like MusicalArtist
, Actor
, Band
, University
, Magazine
, Newspaper
etc. All the sample datasets combined, that were created with the list-extractor, were created after processing about 1.3 million list elements, generating about 2.8 million triples.
The extractor was also made more scalable, by adding several more common mapper functions that can be used, while also making the selection of the mapper functions more flexible for every domain, by shifting the MAPPING
dict to settings.json
and allowing multiple mapper functions for a single domain. But, a bigger impact on the scalability came from the creation of rulesGenerator
, which would now allow the users to create their own mapping rules and mapper functions from a interactive console program, without having to write code for the same! A sample domain MusicGenre
was tested for the working of rulesGenerator, and the results/datasets are also present. Although the domain did not have much information that could be extracted, this still showed the ability of the rulesGenerator, a tool that can be used by people who are not programmers or don't have much knowledge about the inner working of the extractor, to generate triples and produce decent results.
The third goal was also achieved in the project. The dependency on JSONpedia Live Service was removed and JSONpedia Library is now being used for obtaining the JSON representation of the resource. This was achieved by writing a wrapper function (jsonpedia-wrapper.jar
) on the actual JSONpedia library, so that it could be manipulated easily by the list-extractor. The JSONpedia wrapper is a command-line program that'll take some commandline parameters and output the retrieved JSON. The wrapper can be individually run using the following command:
java -jar jsonpedia_wrapper.jar -l [language] -r [resource_name] -p [processors] -f [filters]
So, the list-extractor simply forks another process that runs the JSONpedia wrapper with the parameters provided by the list-extractor, and the output is piped back to the list-extractor's stdin
, which is then converted to JSON using the json.loads()
method, hence completely emulating the previous behavior and eliminating the bottleneck. Hence, all the mentioned goals were achieved.
The main challenge remained the same from the last year, which was the extreme variability of lists. Unfortunately any real standard, structure or consistency does not exist in the resources and there are multiple formats used along with different meanings depending on the user who edited the page. Also, the strong dependability on the topic as well as the use of unrestricted natural language makes it impossibile to find a precise general rule to extract semantic information without knowing in advance the kind of list and the resource type. Hence the knowledge of the domain is also extremely important to write a good set of mapping rules and mapper functions, which would require the user to go through hundreds of Wikipedia pages of the same domain to find out the finer structure and relationships in the domain, which is very time consuming and exhausting. Apart from the heterogeneity, unfortunately there are several Wikipedia pages with bad/wrong formatting, which is obviously reflected in the impurity of extracted data. These are the main challenges that are present in the tool in general. Ironically, the feature that makes wikipedia great is also the root cause of our biggest challenges (i.e. openly accessed and modifiable by any/everyone), which reminds me of a popular phrase from the Holy Bible : "Lord giveth, and the lord taketh away."
Future Work
The scalabilty and JSONpedia web-dependence was taken care of tools were created/upgraded to help with the same. For future work, we can add support for more languages in the existing as well as the new domains by exploring the domain by going through wikipedia resources in different languages and the creating the mapping rules/functions and hence, allow more triples to be created from various domains and add them to the dataset.
Mentors:
- Marco Fossati
- Emanuele Storti