Markov Clustering Analysis - NETESOLUTIONS/ERNIE GitHub Wiki

This page is a walk-through of the steps taken to analyze clusters generated using the Markov Clustering (MCL) algorithm for bibliometric data.

For a detailed explanation on running MCL, please see this page.

Table of Contents MCL Text-Based Analysis Text Pre-processing Exploratory Analysis Jensen Shannon Divergence Coherence Computational Efficiency

MCL

There are two main parameters for running MCL on an input graph - Inflation and Expansion.

We used four different values for inflation - 2.0, 3,0, 4.0, 6.0, while keeping expansion constant, at 2.0 Also see the MCL page for additional options.

Clusters with unshuffled edges: This is the original output of the Markov Clustering algorithm that we wish to analyze by way of conductance and coherence. A paired edge-list is supplied. This may be a pair of articles where one cites the other or a pair that has been co-cited.
Clusters with shuffled edges: Once the algorithm has finished assigning clusters, we can obtain a randomized clustering of the original edge-list by shuffling the edges. This serves as a baseline comparison for the MCL output. This option is natively available in MCL.

Text-Based Analysis

We analyzed the titles and abstracts for all the articles in all the clusters (>= size 10) in each clustering.

Text Pre-processing

Python libraries used: NLTK and TextBlob

Lemmatization based on parts-of-speech (POS) tagging
- We use lemmatization of tokens and not stemming in order to preserve the token and its context. The tokens are lemmatized based on POS tagging. The four POS considered are
  - adjective
  - noun
  - verb
  - adverb
- For all other types of parts of speech (including those not classified at all) the token is mapped to 'noun' by default.
Stop-word removal
- We have a stop_word list of 513 tokens comprising of basic NLTK stop-words, PubMed stop-words, and a select list of tokens from the most top 500 most frequent tokens in our dataset.
Unigrams/Bigrams
- We tested using bigrams as a more stringent criterion in order to check topic-homogeneity of each cluster.
- In general, it is preferred to use unigrams (single word tokens) for computing text-similarity and Jensen-Shannon Divergence (JSD). Using bigrams leads to the following issues:
  - If we retain all tokens for our textual analysis, we end up with an extremely large text corpus - this ultimately leads to computational issues. Also, most of the bigrams occur only once per cluster (more than 95% of the time) , thereby not aiding the analysis.
  - If we decide to limit the size of the corpus by setting a threshold such as only retaining those tokens that occur more than once per cluster, we end up with empty documents (articles) within clusters after pre-processing, which ultimately leads to no JSD computed for that article - for random JSD computation.
  - It was interesting to note that for the MCL bigrams output (co-citation data), there was no case where removing bigrams with frequency less than two per cluster resulted in an empty vector for any article in that cluster. The empty vectors were only found while computing Random JSD values.
  - However, with unigrams, even after removing tokens that show up only once per cluster, we are able to retain a significant proportion of the tokens for analysis.

Exploratory Analysis

Python file

For the top 10 clusters (ncf_20) by size we computed cosine similarity of word vectors based on:

Text-matching:
- We identified the top 30 most frequent tokens under each cluster.
Topic Modeling using Latent Dirichlet Allocation (LDA):
- Single-topic assumption - In order to check if MCL clusters generate clusters with articles based on the same/similar topics, we ran LDA under the condition of one underlying topic distribution for each cluster.

Jensen Shannon Divergence

Jensen Shannon Divergence (JSD) is a measure of dissimilarity between two probability distributions (0 < JSD < 1), where 0 means that the two distributions are no different.
In accordance with Boyack et al, we first calculated article-level token probabilities by computing the probability of all tokens in the article, i.e., the frequencies of the tokens relative to the counts of all tokens in the article taken together after pre-processing the title and abstract and concatenating them.
Next, we computed cluster-level token probabilities by computing the frequencies of the tokens relative to the counts of all tokens in the entire cluster.
JSD was then computed between each article and the cluster for all clusters within each clustering. For each cluster, the Cluster JSD is the arithmetic mean of all article-cluster JSDs.
This was done for three types of text pre-processing for co-citation data:
- Bigrams - tokens with frequency 1 in the whole cluster removed
- Bigrams - all tokens retained
- Unigrams - tokens with frequency 1 in the whole cluster removed
For direct-citation data, we used unigrams.
JSD was computed using the distance.jensenshannon function of the scipy.spatial module. Note that this function produces the Jensen-Shannon Distance, which is the square root of the Jensen-Shannon Divergence.

Coherence

As per Boyack et al, for cluster i, coherence (coh_i) is measure by:
- coh_i = JSD(rand)_i - JSD(actual)_i where, JSD(rand) for a cluster size of size n is based on the average JSD of 50 clusters of size n drawn randomly from the corpus. (5000 iterations were used in the original paper).
- Note that Coherence lies between -1 and 1.
Python file for computing
- Random JSD
- Coherence

Computational Efficiency

Text pre-processing and term-frequency computation are memory-intensive operations. We used the swifter and multiprocessing libraries for parallel processing.