Datasets - derlin/bda-lsa-project GitHub Wiki
full wikidump
The full wikidump is available at https://dumps.wikimedia.org/enwiki. We took the enwiki-20170201-pages-articles-multistream.xml.bz2, which is 57.5 GB when uncompressed.
samples
We tried to get a sample of the full wikidump, using two methods: randomSplit and sample.
// get 1 percent of the data
val percent = 0.1
val sampleDF = docTexts.randomSplit(Array(percent, 1 - percent))(0)
// get num samples without duplicates
val numSamples = 1000
val sampleDF = docTexts.takeSample(false, numSamples)
Custom dump
The above examples work on a very big dataset and the random samples do not guarantee that we get any linked articles, i.e. articles with common concepts. So for the execise, we finally decided to create our own sample using the wikipedia export tool.
The list of categories as well as the list of pages are available in the repo (see other
), the final wikidump is also in the repo under the name wikidump-1500.xml
.
The dataset contains 1520 articles.