Content Filter - wywfalcon/twitter-healthcare-analysis GitHub Wiki

Content Filter

Spam has been a problem since the beginnings of the world wide web. How do we counter that? How do we know what is spam anyways? One way is to use supervised learning. By labeling some tweets spam or not spam, we can guess if unlabeled tweets are spam or not.

How it works

Go to the project folder Twitter-Healthcare-Analysis\src\Twitter-Filter

Usage

Generate topics from the tweets file

$ gen_topics.py [downloadedTweetsFile]

Label each topic as 0 (not spam) and 1 (spam) in topic_keys.txt and save the file as topic_keys_labeled.txt by adding a zeroth column

Original

0.1	levraihoroscope yeux film larmes devant triste ...
3	0.1	cancer_hn eres cncer pones sabes mejor capaz ...
...

Modified

0	0.1	levraihoroscope yeux film larmes devant triste ...
0	3	0.1	cancer_hn eres cncer pones sabes mejor capaz ...
...

Generate training data

$ gen_training_data.py [sameFileAsBefore]

Filtering raw tweets (make sure the number of tweets from the training set is the same as that of the unfiltered set)

$ gen_filtered_tweets [newUnfilteredTweetsFile] [trainingFolder] [destinationFolder]

Explanation

  1. The program first extract the tweets and dump it into a folder of a master file with all the tweet messages and additionally one file per tweet
  2. It then creates a mallet file which is used for grouping tweets together into topics
  3. The user is then requested to label the tweets by adding a zeroth column of 0 (not spam) and 1 (spam) to topic_keys.txt and save the file as topic_keys_labeled.txt
  4. Based on the labels, the program will extract the features and labels for each tweet
  5. From these files, we can create the matrices used in the machine learning models
  6. We repeat the same process with the unfiltered set
  7. The program then uses the matrices from the training set and the generate data from the unfiltered set to predict whether or not each tweet of the unfiltered set is spam.
  8. The resultant matrices are then converted back to a JSON file with the spam removed.

Precision and Recall

This test filters tweets which are in English. Please note that scores are worse because the language filter is applied after topic modeling which does not differentiate language when grouping words as topics. To test the positives of filtering, a language detector is used and may have small biases as well.

Sample Size: 1056

Actual Positives: 352
Actual Negatives: 704
Selected Positives: 263
Selected Negatives: 793

True Positives: 178
False Positives: 85
True Negatives: 614
False Negatives: 179

Precision: 178 / (178+85) = 0.6768
Recall: 178 / (178+179) = 0.4986