Wikidump 1500 - derlin/bda-lsa-project GitHub Wiki
Preprocessing
To process our wikidump, we first have to (a) convert the xml into plain text, (b) tokenize, remove stopwords and lemmatize the content of each article, (c) run the TF-IDF algorithm to get the relative frequency of each word.
To convert the XML into text, we ran our XmlToParquetWriter
. Part b uses the Stanford NLP utilities, more specifically the spark-nlp library.
The creation of the TF-IDF (part c) if more tricky, since it assumes we know how many words we want in our dictionary (see CountVectorizer.setVocabSize(numTerms)
method). So we first ran some analysis on our dataset.
Terms analysis
The scala script spark-shell-scripts/preprocessing-analysis.scala
contains the code we used to grasp the content of our articles. The README in src/main/scala/bda/lsa/preprocessing
explains some of the results.
After tokenisation/lemmatisation/filtering, the whole corpus contains 70'638 unique words. Some are redundants, for example route
and routes
, other are chinese or non-english words. The most frequent ones are:
term | TF | DF |
---|---|---|
use | 8092 | 1094 |
woman | 7834 | 391 |
also | 7352 | 1310 |
one | 6610 | 1098 |
people | 4942 | 821 |
Most words are not frequent at all. Concerning the TF (term frequencies) the median is 2:
stat | value |
---|---|
count | 70638 |
min | 1 |
max | 8092 |
avg | 20.64172259 |
median | 2 |
The next table is more expressive, showing that about half of the terms appear only once:
TF | count | percent |
---|---|---|
== 1 | 34'584 | 49% |
> 10 | 11'141 | 16% |
> 50 | 4'106 | 6% |
> 100 | 2'476 | 4% |
> 1000 | 221 | 0% |
For the DF (document frequency), the number of words appearing in only one document is a bit higher:
DF | count | percent |
---|---|---|
== 1 | 41'559 | 59% |
> 1 | 29'079 | 41% |
> 10 | 8'189 | 12% |
> 50 | 2'599 | 4% |
> 100 | 1'427 | 2% |
> 1000 | 4 | 0% |
241 words appear more than 10 times in the corpus, but only in one document.
Summary:
most of the words don't appear more than once. All in all, it seems that a vocabulary size of around 3'000 is appropriate; it will select words appearing at least 50 times across more than 1 document (only 13 words with TF > 50 have a DF < 2).
docTermMatrix
Now that we found an a-priori adequate size of vocabulary size, we can run DocTermMatrixWriter
:
spark-submit --class bda.lsa.preprocessing.DocTermMatrixWriter \
target/scala-2.11/bda-project-lsa-assembly-1.0.jar \
3000