Wikidump 1500 - derlin/bda-lsa-project GitHub Wiki

Preprocessing

To process our wikidump, we first have to (a) convert the xml into plain text, (b) tokenize, remove stopwords and lemmatize the content of each article, (c) run the TF-IDF algorithm to get the relative frequency of each word.

To convert the XML into text, we ran our XmlToParquetWriter. Part b uses the Stanford NLP utilities, more specifically the spark-nlp library.

The creation of the TF-IDF (part c) if more tricky, since it assumes we know how many words we want in our dictionary (see CountVectorizer.setVocabSize(numTerms) method). So we first ran some analysis on our dataset.

Terms analysis

The scala script spark-shell-scripts/preprocessing-analysis.scala contains the code we used to grasp the content of our articles. The README in src/main/scala/bda/lsa/preprocessing explains some of the results.

After tokenisation/lemmatisation/filtering, the whole corpus contains 70'638 unique words. Some are redundants, for example route and routes, other are chinese or non-english words. The most frequent ones are:

term TF DF
use 8092 1094
woman 7834 391
also 7352 1310
one 6610 1098
people 4942 821

Most words are not frequent at all. Concerning the TF (term frequencies) the median is 2:

stat value
count 70638
min 1
max 8092
avg 20.64172259
median 2

The next table is more expressive, showing that about half of the terms appear only once:

TF count percent
== 1 34'584 49%
> 10 11'141 16%
> 50 4'106 6%
> 100 2'476 4%
> 1000 221 0%

For the DF (document frequency), the number of words appearing in only one document is a bit higher:

DF count percent
== 1 41'559 59%
> 1 29'079 41%
> 10 8'189 12%
> 50 2'599 4%
> 100 1'427 2%
> 1000 4 0%

241 words appear more than 10 times in the corpus, but only in one document.

Summary:

most of the words don't appear more than once. All in all, it seems that a vocabulary size of around 3'000 is appropriate; it will select words appearing at least 50 times across more than 1 document (only 13 words with TF > 50 have a DF < 2).

docTermMatrix

Now that we found an a-priori adequate size of vocabulary size, we can run DocTermMatrixWriter:

    spark-submit --class bda.lsa.preprocessing.DocTermMatrixWriter \
            target/scala-2.11/bda-project-lsa-assembly-1.0.jar  \
            3000