Text mining unstructured stackoverflow data - dspk/Rcode GitHub Wiki
Example of how to perform text mining on an unstructured dataset
The task I am addressing in this example is to predict the tags corresponding to a new question. The data I am using is available from stackoverflow (it can also be accessed from a torrent) as a XML file. The data contains questions (including titles and body text) and the tags corresponding to the questions.
Extracting the relevant data
The script to convert the XML file to SQL is available on stack overflow http://meta.stackoverflow.com/questions/28221/scripts-to-convert-data-dump-to-other-formats. After converting the data to SQL format, I used SQLite to extract the relevant data and convert it into CSV format. I then did all the analysis in R.
Cleaning and Pre-processing the data
As a first step I performed cleaning and pre-processing on the tags and titles. Pre-processing the titles involved removing numbers, punctuations, empty strings, whitespaces and converting all fonts to lowercase letters. This was done using R’s text mining package tm. I also deleted certain words which I thought would not be meaningful for the analysis, for example stopwords and prepositions. I downloaded a list of single prepositions from Wikipedia and processed these in R before removing them from the titles. That also makes the program more efficient by reducing the file size (especially handy when looping over tag-word combinations with large files).
Organizing the data
The second step was to organize the data into dictionaries. I created two lists, one for tags and another for words, whose elements were counts of how many times each tag and each word appears in the dataset. Since R does not have dictionaries I created a tag-word list of lists, which contains how many times each tag word combination occurs in the data. So, for example the word ‘caribbean’ appears 25 times in the data, the word ‘travel’ appears 351 times, and the combination “caribbean travel” appears 12 times in the data.
Solving the problem
The most obvious way to approach this problem is to treat it as association between words (from titles) and tags. For each tag word combination I first computed pairwise support measures as the fraction of questions which contain both tag A and word B, i.e. number of places that A and B appear together (A ∩ B) as a fraction of the total number of questions. This allows us to measure the relevance of association. I also compute confidence, a stronger measure of association as the fraction of questions which contain both tag A and word B, divided by the fraction of questions that have the word B in them. Normalizing by the fraction of titles that contain the word for each pairwise tag-word combination allows us to measure the strength/confidence of association. Using both these measures we can generate association rules for building an algorithm for predicting tags from titles.
The table below illustrates the importance of computing both measures for generating association rules since there can be very large differences in both metrics (additional measures can be computed – for example, lift, leverage and conviction). The appearance of the word visa and tag europe together is relevant to 1.6 percent of all the questions, and if a question title contains the word “visa” we can be 3.5 percent confident that it will have the tag “europe”. Both support and confidence are low for this tag word combination. But, while the word “passport” and tag “visas” appear together in about 5 percent of the questions, we can be 38 percent confident that a question with the word “passport” in it will have the tag “visas”.
Word, Tag, Support, Confidence
visa, europe, 1.6%, 3.5%
visa, schengen, 15.1%, 32.4%
passport, visas, 4.5%, 38.2%
wildlife, australia, 0.1%, 25.0%
Prediction Algorithm
Just for illustration I’ve written a prediction algorithm which uses a simple association rule (confidence > 50%). However, more complex associations (instead of pairwise) with varying support and confidence thresholds could improve the prediction accuracy. For example, by using the two stage apriori algorithm where, in the first stage, a frequent itemset is generated with support ≥ minimum support constraint and, in the second stage, high confidence association rules are generated from each frequent itemset by using confidence ≥ minimum confidence constraint.
Run the code
Test out the algorithm - run the entire script and where the tag prediction function is being called, enter a travel-related question in inverted commas. The default question is “What is the weather like in London in Februrary?” My simple algorithm predicts that the tags for this question will be “weather” and “London”.