Data Sources - SeanTater/uncc2014watsonsim GitHub Wiki
Since this is a knowledge-based application, we need serious sources of knowledge in order to power it. So there's a lot of data and a lot to download. We try to make it easy to setup, just two steps. But depending on your connection and machine, these two steps could take quite some time.
The Local Indexes (Indri, Lucene and Jena)
We distribute our data on Dropbox but occasionally we get throttled. Report an issue (on the right) if there are problems. Just download the data/ archive and extract it into the already existing data/ directory, overwriting if necessary.
The Relational Database
Depending on the scale of the installation, you may prefer to use the Postgres database backup, but we recommend using the SQLite database-in-a-file until you are sure you have outgrown it. (Unpack it first though; it's compressed to save bandwidth.)
What all we are using
Our data comes from several sources, including:
- Wikipedia, Wiktionary and Wikiquotes XML dumps, from which we gather and use:
- Articles, contiguously and separated into paragraphs
- Links, including source, target, label, and count
- Redirects, target and source
- Wikipedia pageview statistics, used as a scorer
- Full texts of Shakespeare
- Bing Search (several thousand queries are cached in the database, to reduce traffic and to enable reproducability)
- The DBPedia ontologies, as well as English labels, instance types, and short abstracts.
We also have pre-indexed these in the archive:
- Using Indri search on all the articles and paragraphs
- Using Lucene on articles, paragraphs, and DBPedia labels
- Using Jena on everything else we downloaded from DBPedia.
- Using PostgreSQL and SQLite for the relational database dump, both containing many tables indexed in several ways.
In Development
We're deciding better ways to synchronize the data we need for the project and we think we can do it with Bittorrent Sync instead of Dropbox, to avoid problems with bandwidth. So you should be able to load the external data archive using a Bittorrent Sync link.