Using the database backed core (GSoC 2012, Jo) - dbpedia-spotlight/dbpedia-spotlight GitHub Wiki

The current version of the database-backed core can be found in my branch: dbpedia-spotlight-db

Data Import

Data Sources

There are Source objects for the following datasets in org.dbpedia.spotlight.db.io:

Data	Required datasets (pig only)	Required objects
SurfaceForm	sfCounts, phrasesCounts
DBpediaResource	uriCounts, instanceTypes.tsv	WikipediaToDBpediaClosure
CandidateMap	pairCounts	WikipediaToDBpediaClosure, SurfaceFormStore, ResourceStore
Tokens	token_counts*.tsv
TokenOccurrences	token_counts*.tsv	WikipediaToDBpediaClosure, TokenStore

the file instanceTypes.tsv contains DBpedia, Schema.org and Freebase types for each DBpediaResource, it is produced by types.sh.
each source can be created from the legacy files (TSV) or from Pig (note however, that the pig version are more up-to-date).
WikipediaToDBpediaClosure converts the Wikipedia URL to the DBpedia format and then follows the transitive closure of redirects in DBpedia to the final URI. If the DBpedia resource is a disambiguation page, it throws a NotADBpediaResourceException. This class requires the DBpedia triple files redirects_en.nt and disambiguations_en.nt.

Indexers

The interfaces for indexing the data sources are specified in org.dbpedia.spotlight.model in the index module. There are currently two indexers implementing the interfaces:

in-memory indexer (org.dbpedia.spotlight.db.MemoryStoreIndexer, uses Kryo for serializiation)
disk-based indexer (org.dbpedia.spotlight.db.JDBMStoreIndexer, uses JDBM3, but this is still in development since I focused on having a running in-memory version first)

Running the import

Currently, there are two Scala objects for running the import: org.dbpedia.spotlight.db.ImportPig, org.dbpedia.spotlight.db.ImportTSV, which need to have the correct data paths and can be run with:

mvn exec:java -pl index -Dexec.mainClass=org.dbpedia.spotlight.db.ImportPig

(needs mvn package first but has less overhead) or

mvn scala:run -DmainClass=org.dbpedia.spotlight.db.ImportPig

The full pig-based import takes about 1.5h (mainly due to reading the token occurrence file) for in-memory and 6-7h for disk-based.

Creating the in-memory version

When creating the in-memory version, the import should be run with enough Heap space. SurfaceForms, DBpediaResource, CandidateMap can be run with -Xmx5GB or -Xmx6GB, but TokenOccurrences should be run with at least -Xmx12GB. The resulting serialized files will be written to disk and require ~7GB of memory when fully loaded (-Xmx10GB worked well for me). The following files will be written to disk:

135M sf.mem
187M res.mem
204M candmap.mem
 19M tokens.mem
4.4G context.mem

The data (except for context.mem) can be downloaded here. The memory consumption after loading each store (the stores are loaded one after the other):

Store	Used heap space
1. MemorySurfaceFormStore	798MB
2. MemoryResourceStore	1526MB
3. MemoryCandidateMapStore	2188MB
4. MemoryTokenStore	2016MB
5. MemoryContextStore	6762MB

Using the data

All data stores follow the interfaces in org.dbpedia.spotlight.db.model. The elements in a data store can usually be queried by their internal ID or by their name (e.g. the URI without prefix for DBpedia resources):

Interfaces

ResourceStore
  def getResource(id: Int): DBpediaResource
  def getResourceByName(name: String): DBpediaResource

SurfaceFormStore
  def getSurfaceForm(surfaceform: String): SurfaceForm

CandidateMapStore
  def getCandidates(surfaceform: SurfaceForm): Set[Candidate]

TokenStore
  def getToken(token: String): Token
  def getTokenByID(id: Int): Token

ContextStore
  def getContextCount(resource: DBpediaResource, token: Token): Int
  def getContextCounts(resource: DBpediaResource): Map[Token, Int]

Using the in-memory stores

The in-memory stores can be used as follows:

val sfStore = MemoryStore.loadSurfaceFormStore(new FileInputStream("data/sf.mem"))
val candMap = MemoryStore.loadCandidateMapStore(new FileInputStream("data/candmap.mem"), resStore)
[...]

Disk-based stores can be used like this:

val diskContext = new DiskContextStore("data/context.disk")

Database-backed TF*ICF disambiguator

The ParagraphDisambiguator DBTwoStepDisambiguator relies only on the Store interfaces defined above and uses TF*ICF as the measure for context similarity.

Performance and early Results

The following table shows the time performance on the Wikify dataset (this is not a thorough evaluation but an indication, so the table shows only a single run each). TF*ICF was calculated only for the best k candidates for each surface form, as measured by the prior probability of the candidate P(res|sf).

k	dataset	time
0 (uses only prior)	Wikify, 50 paragraphs, 706 disambiguations	6 sec
10	Wikify, 50 paragraphs, 706 disambiguations	18 sec
25	Wikify, 50 paragraphs, 706 disambiguations	47 sec
50	Wikify, 50 paragraphs, 706 disambiguations	109 sec
100	Wikify, 50 paragraphs, 706 disambiguations	244 sec

Evaluation

Update: Note that the test dataset still contains disambiguation pages and redirects are not all resolved, so the final result will be slightly better than the results below. I will update the results once I have rebased my branch and can run the latest evaluation.

The accuracy and global MRR for only P(res|sf) derived from the pig data and using the Wikify dataset:

Disambiguator: Database-backed 2 Step TF*ICF disambiguator (k=0)
Correct URI not found = 115 / 706 = 0.163
Accuracy = 528 / 706 = 0.748
Global MRR: 0.7808735541769214

UPDATE: after resolving redirects and excluding disambiguation pages

Corpus: MilneWitten
Number of occs: 706 (original), 638 (processed)
Disambiguator: Database-backed 2 Step TF*ICF disambiguator (k=0)
Correct URI not found = 58 / 638 = 0.091
Accuracy = 526 / 638 = 0.824
Global MRR: 0.7724539606704485

and for using only TF*ICF :

Disambiguator: Database-backed 2 Step TF*ICF disambiguator
Correct URI not found = 123 / 706 = 0.174
Accuracy = 356 / 706 = 0.504
Global MRR: 0.6227541921588945

The accuracy for TFICF only is very low, it is likely that there are still issues with the calculation of the TFICF score.

Evaluations including TFICF will be added here as soon as I have re-estimated the weights for mixing the prior and the TFICF score.

Issues and TODO

re-estimate the weights for the disambiguator, try to combine the scores using a log-linear model?
the disk-based stores still need some work
WikipediaToDBpediaClosure should ultimately be moved to Pig
check and improve performance of TF*ICF calculation