Improving Accuracy by LOOKING AT THE DATA - sahajss/knowledge_base GitHub Wiki
My Project
Remember the Myers Briggs personality test we all took freshman year? So my project is an Artificial Intelligence algorithm that looks at the subreddit for each Myers Briggs personality types and uses those users’ visit patters (what subreddits they visit and how often) to predict what a given user’s personality type.
I started this project over the summer at the Naval Research lab, but my eight-week internship only allowed me to look at the ENFJ and ISTP personalities. I tested users simply as a binary between these two and had a pretty high accuracy around 85%-90%.
So this year when I started I figured that this high accuracy would translate, at least mostly, to my testing of all 16 personalities. I even had more data points for most of the other personalities. Unfortunately when I first ran the initial testing, straight with the data points (no cleaning of the data) I had an accuracy of about 0.01%. Amazing right? Nope.
First Solution - The Users
So my first thought was that perhaps it was the users who post in many personalities and therefore are added into the calculation for multiple personality models. I figured that I should try to determine what personality these 'repeat' users are, and if I can't find that out then throw out those users.
for path in types:
for filename in os.listdir(path):
user = filename.split(".")
if(user[1] == "log"):
user = user[0]
if user not in redditors:
redditors[user] = path
elif user in repeats:
fill = repeats[user]
fill.append(path)
repeats[user] = fill
else:
fill = []
fill.append(redditors[user])
fill.append(path)
repeats[user] = fill
This segment adds users that have posted in more than one personality subreddit to an array.
for user in repeats:
mystr = user+":"
fill = {}
top = 0
top2 = 0
topp = ""
for typ in repeats[user]:
if typ in fill:
fill[typ] += 1
else:
fill[typ] = 1
if len(fill) > 1:
for typ in fill:
mystr += typ + "," + str(fill[typ]) + "|"
if fill[typ] > top:
top2 = top
top = fill[typ]
topp = typ
if(top-top2)>3:
print user, top-top2
out.write(user + "\n")
output.write(user + "," + topp + "\n")
else:
repeat.write(mystr + "\n")
count += 1
redditors.pop(user, None)
This sets it so that if the most visited subreddit has been visited more than three times by the next most visited subreddit then the user is classified as the most visited, or else they are discarded.
Expecting this to have solved my problems I ran it again only to receive an accuracy of about 5%.
Second Solution - The Subreddits
Realizing that the solution to my problem went beyond the repeat users, I figured it had to be something about the actual way I was building my models for personality. I looked at the subreddits that I used to “define” a personality. I noticed that several of the most viewed subreddits were the same between personalities. This wasn’t much of a problem since the level of visits was different…expect for AskReddit. Every single personality model visited this subreddit exponentially more than any other subreddit. So in my aggregation of data (calculation of getting the models) I took out consideration of AskReddit.
def getTop(users, path):
top = {}
for user in users:
posts = users[user]
for post in posts:
if post != ‘AskReddit’:
if post in top:
top[post] +=1
else:
top[post] = 1
sorted_top = sorted(top.items(), key=operator.itemgetter(1), reverse = True)
top = [x[0] for x in sorted_top[:50]]
return top
This increased my accuracy by A LOT (granted its still not great). But I believe the key takeaway is CHECK YOUR DATA!!!!!! Your model needs to be built off an ACCURATE dataset and your initial collected dataset might not be accurate.