GSoC2012 Progress (Dirk) - dbpedia-spotlight/dbpedia-spotlight GitHub Wiki
Repository
Proposal
Rest service: http://160.45.137.73:2222/rest/topic?text="Here comes the text" (if it is not running, just contact me)
The simple approach, which was flattening wikis hierarchy utilizing the distances of a wiki category to the topic's main categories, which were defined manually before, did work ok, but it was not satisfying. That goes for wikipedias main_topic_categorization and also the wiki main portals, on which I both tested the approach.
Clustering over wikipedia's categories utilizing extracted TF-IDF vectors for each one of them resulted in a much better flattening of wikipedia categories to their main topics, but there was still too much confusion within the clusters, because clustering doesn't know which topics to discriminate. Hierarchical clustering, i.e. in this case clustering clusters that were still too fuzzy again, was the next step to remove confusion and results actually got less fuzzy, but still some clusters did not match any 'good' topic and of course there were too many clusters to label. Therefore I implemented an automatical cluster labeling mechanism, which did work well. Anyway, as much as clustering seemed appropriate for the task of fattening wikipedia's categories, it still left too much confusion.
The last approach of this kind was sort of a semi-supervised procedure which evolved from the clustering approach. The fact that wikipedia categories are often strongly related by keywords in their titles to a certain topic, e.g. Rocky_Mountains -> Keywords: rocky, mountains -> topic: geography, because 'mountain' is a keyword for geography, could be exploited by labeling those categories with their obvious topics and then training a topical classifier with those obvious examples (categories) for each topic. Note that there were over 200k obvious category assignments (to topics) possible. After training the classifier, it was easy to apply the classifier to yet unassignment wiki categories and estimate the topic to which it should belong. Then it was possible to repeat that procedure with the newly assigned categories as new examples for training.
With categories assigned to topics, it is then easily possible to assign topics to dbpedia resources by evaluating their categories they belong to. Like this it was possible to split the extracted occurences from dbpedia-spotlight to their main topics, which allowed me to generate a topically labeled corpus consisting of dbpedia-spotlight occs.
My last approach was similar to the last one except that I was using resources itself, s.t. there was no need of using categories. In the initial step, resource occurrences were assigned by matching topical keywords to the resources title to the specific topics. This procedure was implicitly generating an initial corpus for training a topical classifier. This classifier could then be used to further assign resource occurrences to topics by calculating the probabilities of the occurrence's context of belonging to a specific topic or not.
The proposed model, which was a simple topical classifier on top of lda features, resulted in good results on 20Newsgroup Dataset but performance on the wikipedia was really bad. So I tried naive bayes multinomial from weka, which is always a good choice for topical classification, and it worked pretty well on both 20Newsgroup and my wikipedia corpus. The advantages of naive bayes multinomial are, that it is already incremental and also really fast. Now there are two options on how to use this model. One was simply to train it on the N different topics we chose as classes (concurrent), which means as single label classifier. The other option was to train for each topic a different model with the classes 'topic' and 'not-topic', which yields in fact a model (consisting of actually N models) that was able to do multi label classification.
- branching/forking main repository, setup ide, learning scala
- scripts for:
- converting tsv files (category, text) to vowpal input format
- converting text to vectors (stopwords, stemming, no numbers etc)
- converting vowpal prediction files to .arff files
- tests on 20 newsgroup dataset with online lda (vowpal wabbit), mlda and slda with reasonable performance
- flattening of wikipedia hierarchy
- extracting occs from wikipedia dump
- getting to know and coordinate with other GSoC students
- sorting of occs, categories_articles (after articles)
- decision: training classifier on paragraphs. Assigned topic should not come from the article of the occurence, but from the occurence's resource, because like this the initial corpus is especially generated by the occurences in the wikipedia -> topical classifier should generate better topics for disambiguation (,but results could be worse for general purpose topical classification, which is not the goal)
- split occs.tsv by top categories
- extracting corpus from splittet occs files
- transforming corpus to vowpal input with vocabulary cut and wordcount transformations
- training of lda model and testing features on weka's classifier's => no classification possible
- possible explanations:
- corpus is no good for topical classification -- proof through applying classifier to bag-of-words will show -- solution: new corpus, possibly derived from clusters over the wikipedia
- lda doesn't work for wikipedia -- proof would be the same, if other classifiers would do better
- tests on wikipedia corpus with naive bayes (incremental) and MaxEnt went well => applying plain lda does not give expected results; generated wikipedia corpus can be used for training topical classifiers
- in the next few days different incremental classifiers should be evaluated and the best should be chosen
- naive bayes (following the paper from [1]) seems like a good candidate (performance similar to MaxEnt)
[1] http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf
20Newsgroup, on testset:
Algorithm | Examples in corpus per category | Accuracy |
---|---|---|
Naive Bayes (incremental) | 10k | 80% |
MaxEnt (OpenNLP) | 10k | 78% |
Wikipedia, on testset:
Note that direct classification is not a good measure here, because occurences of different topics can be in the same paragraph or text.
Algorithm | Examples in corpus per category | Accuracy |
---|---|---|
Naive Bayes (incremental) | 10k | 62% |
MaxEnt (OpenNLP) | 10k | 64% |
Naive Bayes (incremental) | 100k | 45% |
MaxEnt (OpenNLP) | 100k | 49% |
- Progression: use of naive bayes, because it is really fast, robust and incremental
- Future work: implement HSLDA, but for now bayes is fine
- implementation of simple rest service
- decision: better try content oriented wikipedia portals as topics for classification (see: wikiportals) except of general_references and people
- adaptions on indexing part to allow multiple categories under one topic (note: workflow changed slightly)
- indexing of new chosen topics (see May 30)
Results on new corpus from portals:
Algorithm | Examples in corpus per category | Accuracy |
---|---|---|
Naive Bayes (incremental) | 100k | 57% |
- last approach for flattening of the wikipedia hierarchy: clustering over wikipedia categories
- FileOccurenceCategorySource class for iterating over extracted occs and their set of categories
- new TextToWordVector implementation, using lucene
- corpus extraction of wordvectors for each dbpedia category (details see below in workflow part, top category selection)
- decision: clustering with vowpal wabbits fast online lda
- conversion of extracted data to vowpal input format
- clustering
- manual cluster labeling
- some clusters are still fuzzy and would need reclustering
- reclustering would result in too many new clusters
- decision: label clusters automatically
- idea: to do this, we take keywords extracted from dbpedia category names that are members of the same cluster and compare them to keywords of the topics
- problem: get typical keywords for topics...
- solution: wikipedia contains two categories that provide pages with topical outlines and/or indexes which provide me implicitly with a keyword set for each topic
- extraction of topic keywords from wikipedia
- topic selection, see here
- implementation of automatical hierarchical clustering using lda (vowpal wabbit) and automatically cluster labeling
- idea: if label assignment is too fuzzy, cluster the specific cluster again
- clustering with implicit flattening of wikipedias categories to their topics
- splitting occs by topics
- extracting training corpus for training of classifier
- training classifier
- just an idea: maybe work would have been easier if
- I trained a model on newsfeeds in the first place (because they already provide topical metainformation)
- and split occs by topics using the trained model, i.e. determine the topic(s) of an occ and put the occ to the specific topic splits
-
new idea of flattening: Taking the obvious categories for each topic, training a topical classifier on them, assign yet unassigned categories to topics utilizing the model (repeat those steps until it converges)
- why?: well, using classification rather than clustering makes the whole process supervised, thus giving results that we aim for
- implementation of new idea
- Flattening the hierarchy in a semi-supervised fashion
- results are much clearer
- implementation was much easier (and shorter)
- much more intuitive procedure
- trained final model before midterm evaluations
- implemented and executed splitting of not yet to topics assigned occs utilizing the trained model
- implemented multilabel topical classification (training a model for each topic)
- training of the model (can be found on biggy, deployed on rest server)
- implementation of another occ splitting method, which is direct (i.e. no wiki categories are involved anymore) and semi-supervised
- assign occurences of resources that obviously belong to a certain topic (using keywords for each topic and match them against resource names)
- train a topical classifier on these assigned occurences
- split occurences using the trained model
- repeat steps 2-3 for specified number of iterations
- again much shorter and simpler than splitting procedure before
- splitting occurences
- meeting with pablo
- training on new occurences split -> results look really good
- implementation of rss-feed and incremental training, not yet tested
- integration of wikifeed
- tested implemented feed framework within dbpedia spotlight
- indexing splitted occs on biggy (splitted index)
- discussion with pablo about KBA TREC as evaluation corpus for dbpedia spotlight live
- sketch of possible workflow of our "KBA"-system
- refactoring
- documentation
- sketch of whole indexing and training part
- testing
old part:
- ExtractCandidateMap
- Args= $INDEX_CONFIG_FILE
- ExtractOccsFromWikipedia
- Args= $INDEX_CONFIG_FILE | output/occs.tsv
- sort -t $'\t' -k2,2 -k1,1 output/occs.tsv > output/occs.uriSorted.tsv
Either handpicked or from clustering. This part explains creation of category (word vector) corpus for clustering.
-
ExtractCategoryCorpus -- write temporary vector file
- Args = "extract" | path to sorted articles_categories.nt | path to sorted occs | output path of temporary wordvectors | offset
- there was a problem between the 29.200.000-th and 29.700.000-th occ -- solution extract until 29.200.000 occs then extract from (offset=) 29.700.000 th occ
- sort temporary vectors by first column
- ExtractCategoryCorpus -- merge temporary vectors to vowpal's input file (file conversions can be easily done from there)
- Args = "merge" | path to sorted temporary vectors | output path
- Note: vector are TF/IDF vectors
- VowpalWCScalingFilter (this is not necessary anymore, because WriteCategoryCorpus already does that)
- Args = input file (output from WriteCategoryCorpus) | output file | scaling factor (eg 10000 worked well)
- scales word counts down by a given factor (new count = old count / factor)
- shuf -o shuffled.scaled.category.corpus scaled.category.corpus
- download and install vowpal and define path to executable in the indexing.properties
-
FlattenWikipediaHierarchy (OLD)
- Args= articles_categories.nt | outputDir | maximal depth [ | categories file path ]
- categories file structure: each line: "topiclabel=category1,category2..." (eg "culture_arts=Culture,Arts")
- maximal depth is the maximal distance allowed from a top category to its subcategory, everything that does not belong to one of the top categories (given by the topics' categories) within this max depth belongs to a new topic others
-
FlattenHierarchyByClusters (OLD)
- Args= vowpal input file created by WriteCategoryCorpus | vowpal rest-input file created by WriteCategoryCorpus | rest.categories.list from WriteCategoryCorpus | categories.list from WriteCategoryCorpus | path where flattened hierarchy should be written | path to temporary working directory
- this process will cluster the input and automatically assign labels to clusters or recluster if label assignment is too fuzzy
-
FlattenHierarchyByTopics
- best approach, see June 22-26 above for closer description
- path to indexing properties, path to training corpus, path to training corpus' categories, path to evaluation corpus, path to evaluation corpus' categories, path to temporary dir, confidence threshold for assigning a category to a topic (should be high, prob. at least 0.8)
-
SplitOccsByTopics
- Args= indexing.properties | path to sorted occs file | path to output directory
-
SplitOccsSemiSupervised
- Args= 1st: indexing.properties 2nd: path to (sorted) occs file, 3rd: temporary path (same partition as output), 4th: minimal confidence of assigning an occ to a topic, 5th: #iterations, 6th: path to output directory
- [GenerateOccTopicCorpus] (https://github.com/D9891/dbpedia-spotlight/blob/master/index/src/main/scala/org/dbpedia/spotlight/topic/GenerateOccTopicCorpus.scala)
- Args= indexing.properties | path to splitted occs | number of examples per category (<= 0, maximum number) | output file (corpus.tsv) [| 'false', if generation without flattened hierarchy]
- shuf corpus.tsv > shuffled_corpus.tsv
- [TextCorpusToInputCorpus] (https://github.com/D9891/dbpedia-spotlight/blob/master/index/src/main/scala/org/dbpedia/spotlight/topic/convert/TextCorpusToInputCorpus.scala)
- Args= -i input-file/directory, -o outputdir, -d dictionary file, -c categoryinfo file, -a if tagged (with topic: "topic23\tthis is example blablabla...), -s if testset, -t if data should be transformed, -ct type of corpus ('arff' or default:'vowpal')
- java -cp weka.jar weka.classifiers.bayes.NaiveBayesMultinomialUpdateable -t training.arff -T test.arff -d model.dat > weka.out
- mvn scala:run -DmainClass=org.dbpedia.spotlight.topic.WekaMultiLabelClassifier "-DaddArgs=/.../train.corpus.arff|/.../multilabel-model"
-
RunTrecKBA
- Args: spotlight configuration, trec corpus dir, trecJudgmentsFile, training start date (yyyy-MM-dd-hh), end date, minimal confidence of assigning topic to a list of resources, path to trec target entity classifier model dir, evaluation folder (evaluation, if folder exists, no evaluation otherwise), clear (optional, start training from scratch except of topical classifier)