Using the database backed core (GSoC 2012, Jo) - dbpedia-spotlight/dbpedia-spotlight GitHub Wiki
The current version of the database-backed core can be found in my branch: dbpedia-spotlight-db
Data Import
Data Sources
There are Source objects for the following datasets in org.dbpedia.spotlight.db.io
:
Data | Required datasets (pig only) | Required objects |
---|---|---|
SurfaceForm | sfCounts, phrasesCounts | |
DBpediaResource | uriCounts, instanceTypes.tsv | WikipediaToDBpediaClosure |
CandidateMap | pairCounts | WikipediaToDBpediaClosure, SurfaceFormStore, ResourceStore |
Tokens | token_counts*.tsv | |
TokenOccurrences | token_counts*.tsv | WikipediaToDBpediaClosure, TokenStore |
- the file
instanceTypes.tsv
contains DBpedia, Schema.org and Freebase types for each DBpediaResource, it is produced by types.sh. - each source can be created from the legacy files (TSV) or from Pig (note however, that the pig version are more up-to-date).
WikipediaToDBpediaClosure
converts the Wikipedia URL to the DBpedia format and then follows the transitive closure of redirects in DBpedia to the final URI. If the DBpedia resource is a disambiguation page, it throws aNotADBpediaResourceException
. This class requires the DBpedia triple filesredirects_en.nt
anddisambiguations_en.nt
.
Indexers
The interfaces for indexing the data sources are specified in org.dbpedia.spotlight.model
in the index module. There are currently two indexers implementing the interfaces:
- in-memory indexer (
org.dbpedia.spotlight.db.MemoryStoreIndexer
, uses Kryo for serializiation) - disk-based indexer (
org.dbpedia.spotlight.db.JDBMStoreIndexer
, uses JDBM3, but this is still in development since I focused on having a running in-memory version first)
Running the import
Currently, there are two Scala objects for running the import: org.dbpedia.spotlight.db.ImportPig
, org.dbpedia.spotlight.db.ImportTSV
, which need to have the correct data paths and can be run with:
mvn exec:java -pl index -Dexec.mainClass=org.dbpedia.spotlight.db.ImportPig
(needs mvn package first but has less overhead) or
mvn scala:run -DmainClass=org.dbpedia.spotlight.db.ImportPig
The full pig-based import takes about 1.5h (mainly due to reading the token occurrence file) for in-memory and 6-7h for disk-based.
Creating the in-memory version
When creating the in-memory version, the import should be run with enough Heap space. SurfaceForms, DBpediaResource, CandidateMap can be run with -Xmx5GB
or -Xmx6GB
, but TokenOccurrences should be run with at least -Xmx12GB
. The resulting serialized files will be written to disk and require ~7GB of memory when fully loaded (-Xmx10GB
worked well for me). The following files will be written to disk:
135M sf.mem
187M res.mem
204M candmap.mem
19M tokens.mem
4.4G context.mem
The data (except for context.mem) can be downloaded here. The memory consumption after loading each store (the stores are loaded one after the other):
Store | Used heap space |
---|---|
1. MemorySurfaceFormStore | 798MB |
2. MemoryResourceStore | 1526MB |
3. MemoryCandidateMapStore | 2188MB |
4. MemoryTokenStore | 2016MB |
5. MemoryContextStore | 6762MB |
Using the data
All data stores follow the interfaces in org.dbpedia.spotlight.db.model
.
The elements in a data store can usually be queried by their internal ID or by their name (e.g. the URI without prefix for DBpedia resources):
Interfaces
ResourceStore
def getResource(id: Int): DBpediaResource
def getResourceByName(name: String): DBpediaResource
SurfaceFormStore
def getSurfaceForm(surfaceform: String): SurfaceForm
CandidateMapStore
def getCandidates(surfaceform: SurfaceForm): Set[Candidate]
TokenStore
def getToken(token: String): Token
def getTokenByID(id: Int): Token
ContextStore
def getContextCount(resource: DBpediaResource, token: Token): Int
def getContextCounts(resource: DBpediaResource): Map[Token, Int]
Using the in-memory stores
The in-memory stores can be used as follows:
val sfStore = MemoryStore.loadSurfaceFormStore(new FileInputStream("data/sf.mem"))
val candMap = MemoryStore.loadCandidateMapStore(new FileInputStream("data/candmap.mem"), resStore)
[...]
Disk-based stores can be used like this:
val diskContext = new DiskContextStore("data/context.disk")
Database-backed TF*ICF disambiguator
The ParagraphDisambiguator
DBTwoStepDisambiguator relies only on the Store interfaces defined above and uses TF*ICF as the measure for context similarity.
Performance and early Results
The following table shows the time performance on the Wikify dataset (this is not a thorough evaluation but an indication, so the table shows only a single run each). TF*ICF was calculated only for the best k candidates for each surface form, as measured by the prior probability of the candidate P(res|sf).
k | dataset | time |
---|---|---|
0 (uses only prior) | Wikify, 50 paragraphs, 706 disambiguations | 6 sec |
10 | Wikify, 50 paragraphs, 706 disambiguations | 18 sec |
25 | Wikify, 50 paragraphs, 706 disambiguations | 47 sec |
50 | Wikify, 50 paragraphs, 706 disambiguations | 109 sec |
100 | Wikify, 50 paragraphs, 706 disambiguations | 244 sec |
Evaluation
Update: Note that the test dataset still contains disambiguation pages and redirects are not all resolved, so the final result will be slightly better than the results below. I will update the results once I have rebased my branch and can run the latest evaluation.
The accuracy and global MRR for only P(res|sf) derived from the pig data and using the Wikify dataset:
Disambiguator: Database-backed 2 Step TF*ICF disambiguator (k=0)
Correct URI not found = 115 / 706 = 0.163
Accuracy = 528 / 706 = 0.748
Global MRR: 0.7808735541769214
UPDATE: after resolving redirects and excluding disambiguation pages
Corpus: MilneWitten
Number of occs: 706 (original), 638 (processed)
Disambiguator: Database-backed 2 Step TF*ICF disambiguator (k=0)
Correct URI not found = 58 / 638 = 0.091
Accuracy = 526 / 638 = 0.824
Global MRR: 0.7724539606704485
and for using only TF*ICF :
Disambiguator: Database-backed 2 Step TF*ICF disambiguator
Correct URI not found = 123 / 706 = 0.174
Accuracy = 356 / 706 = 0.504
Global MRR: 0.6227541921588945
The accuracy for TFICF only is very low, it is likely that there are still issues with the calculation of the TFICF score.
Evaluations including TFICF will be added here as soon as I have re-estimated the weights for mixing the prior and the TFICF score.
Issues and TODO
- re-estimate the weights for the disambiguator, try to combine the scores using a log-linear model?
- the disk-based stores still need some work
- WikipediaToDBpediaClosure should ultimately be moved to Pig
- check and improve performance of TF*ICF calculation