Downloads - dbpedia-spotlight/dbpedia-spotlight GitHub Wiki

In order to run DBpedia Spotlight on your server, you need to download our software and required data, which will vary in size depending on the kind of annotations that you need.

Table of Contents Software Data Release 0.4 Release 0.5 Configuration

Software

The latest source code is available from the project's Git repository and can be browsed online.

Please refer to our installation instructions for more detailed information on how to install DBpedia Spotlight.

Data

Since we rely on data extracted from the entire Wikipedia, we cannot embed the dataset into our software distribution. We therefore provide here a list of required files in different sizes to suit many needs. You can also build these files yourself if you desire (see index module).

As our development progresses, the system may require different datasets to enable more sophisticated algorithms. Therefore, we organized this section in accordance to the software release tags. Please make sure to use the data required by the release that you have downloaded. Since trunk is very cutting edge, make sure to consult the discussion list if your build breaks - it may need some recently generated dataset.

Furthermore, we we rely on some DBpedia datasets for the generation of dictionaries and resource types, among other information. They that can be downloaded at http://dbpedia.org/downloads.

Release 0.4

If you would like to run DBpedia Spotlight in your server, you will need data from the two files below:

Disambiguation index (Lucene) compact (tar.gz), large (tar.gz)
Spotter lexicon (~LingPipe dictionary) small (gz), medium (gz), large (gz)
Spot selection model: (tar.gz)

Depending on your application, you may also be able to benefit from the DBpedia Lexicalizations Dataset produced as a result of this project. This dataset is not required to run DBpedia Spotlight, since the Disambiguation Index and the Spotter Lexicon contain all the necessary information, including lexicalizations.

DBpedia Lexicalizations dataset n-quads.tar.gz

This release used data from DBpedia 3.6

If you are running indexing, then you will also need our stopwords file (or create your own)

stopwords_en.list

Release 0.5

If you would like to run DBpedia Spotlight in your server, you will need data from the two files below:

Disambiguation index (Lucene) compact (tar.gz), large (tar.gz)
Spotter lexicon (~LingPipe dictionary) small (gz), medium (gz), large (gz)

Optional files for associated components:

Spot selection cooccurrence model: (tar.gz)
OpenNLP models for NERSpotter and OpenNLPNGramSpotter (tar.gz)

If you are running indexing, then you will also need our stopwords file (or create your own)

stopwords_en.list

This release uses data from DBpedia 3.7

Configuration

Assuming you have already downloaded and decompressed the files below:

    wget http://spotlight.dbpedia.org/download/release-0.5/context-index-compact.tgz
    tar zxvf context-index-compact.tgz
    wget http://spotlight.dbpedia.org/download/release-0.4/surface_forms-Wikipedia-TitRedDis.uriThresh75.tsv.spotterDictionary.gz
    gunzip surface_forms-Wikipedia-TitRedDis.uriThresh75.tsv.spotterDictionary.gz

Now you just need to change the server.properties file to point to your newly extracted files:

    org.dbpedia.spotlight.index.dir = index-withSF-withTypes-compressed
    org.dbpedia.spotlight.spot.dictionary = surface_forms-Wikipedia-TitRedDis.uriThresh75.tsv.spotterDictionary

More info on how to: