Wikidump 1500 - derlin/bda-lsa-project GitHub Wiki

Preprocessing

To process our wikidump, we first have to (a) convert the xml into plain text, (b) tokenize, remove stopwords and lemmatize the content of each article, (c) run the TF-IDF algorithm to get the relative frequency of each word.

To convert the XML into text, we ran our XmlToParquetWriter. Part b uses the Stanford NLP utilities, more specifically the spark-nlp library.

The creation of the TF-IDF (part c) if more tricky, since it assumes we know how many words we want in our dictionary (see CountVectorizer.setVocabSize(numTerms) method). So we first ran some analysis on our dataset.

Terms analysis

The scala script spark-shell-scripts/preprocessing-analysis.scala contains the code we used to grasp the content of our articles. The README in src/main/scala/bda/lsa/preprocessing explains some of the results.

After tokenisation/lemmatisation/filtering, the whole corpus contains 70'638 unique words. Some are redundants, for example route and routes, other are chinese or non-english words. The most frequent ones are:

term	TF	DF
use	8092	1094
woman	7834	391
also	7352	1310
one	6610	1098
people	4942	821

Most words are not frequent at all. Concerning the TF (term frequencies) the median is 2:

stat	value
count	70638
min	1
max	8092
avg	20.64172259
median	2

The next table is more expressive, showing that about half of the terms appear only once:

TF	count	percent
== 1	34'584	49%
> 10	11'141	16%
> 50	4'106	6%
> 100	2'476	4%
> 1000	221	0%

For the DF (document frequency), the number of words appearing in only one document is a bit higher:

DF	count	percent
== 1	41'559	59%
> 1	29'079	41%
> 10	8'189	12%
> 50	2'599	4%
> 100	1'427	2%
> 1000	4	0%

241 words appear more than 10 times in the corpus, but only in one document.

Summary:

most of the words don't appear more than once. All in all, it seems that a vocabulary size of around 3'000 is appropriate; it will select words appearing at least 50 times across more than 1 document (only 13 words with TF > 50 have a DF < 2).

docTermMatrix

Now that we found an a-priori adequate size of vocabulary size, we can run DocTermMatrixWriter:

    spark-submit --class bda.lsa.preprocessing.DocTermMatrixWriter \
            target/scala-2.11/bda-project-lsa-assembly-1.0.jar  \
            3000