Content Filter - wywfalcon/twitter-healthcare-analysis GitHub Wiki
Content Filter
Spam has been a problem since the beginnings of the world wide web. How do we counter that? How do we know what is spam anyways? One way is to use supervised learning. By labeling some tweets spam or not spam, we can guess if unlabeled tweets are spam or not.
How it works
Go to the project folder Twitter-Healthcare-Analysis\src\Twitter-Filter
Usage
Generate topics from the tweets file
$ gen_topics.py [downloadedTweetsFile]
Label each topic as 0 (not spam) and 1 (spam) in topic_keys.txt
and save the file as topic_keys_labeled.txt
by adding a zeroth column
Original
0.1 levraihoroscope yeux film larmes devant triste ...
3 0.1 cancer_hn eres cncer pones sabes mejor capaz ...
...
Modified
0 0.1 levraihoroscope yeux film larmes devant triste ...
0 3 0.1 cancer_hn eres cncer pones sabes mejor capaz ...
...
Generate training data
$ gen_training_data.py [sameFileAsBefore]
Filtering raw tweets (make sure the number of tweets from the training set is the same as that of the unfiltered set)
$ gen_filtered_tweets [newUnfilteredTweetsFile] [trainingFolder] [destinationFolder]
Explanation
- The program first extract the tweets and dump it into a folder of a master file with all the tweet messages and additionally one file per tweet
- It then creates a mallet file which is used for grouping tweets together into topics
- The user is then requested to label the tweets by adding a zeroth column of 0 (not spam) and 1 (spam) to
topic_keys.txt
and save the file astopic_keys_labeled.txt
- Based on the labels, the program will extract the features and labels for each tweet
- From these files, we can create the matrices used in the machine learning models
- We repeat the same process with the unfiltered set
- The program then uses the matrices from the training set and the generate data from the unfiltered set to predict whether or not each tweet of the unfiltered set is spam.
- The resultant matrices are then converted back to a JSON file with the spam removed.
Precision and Recall
This test filters tweets which are in English. Please note that scores are worse because the language filter is applied after topic modeling which does not differentiate language when grouping words as topics. To test the positives of filtering, a language detector is used and may have small biases as well.
Sample Size: 1056
Actual Positives: 352
Actual Negatives: 704
Selected Positives: 263
Selected Negatives: 793
True Positives: 178
False Positives: 85
True Negatives: 614
False Negatives: 179
Precision: 178 / (178+85) = 0.6768
Recall: 178 / (178+179) = 0.4986