Recommendation Engine - drink-this/drink-this-backend GitHub Wiki

About the Engine

The recommendation engine is a memory-based approach to collaborative filtering that uses euclidean distance to compare user data. This article has a ok primer if you scroll down to the section on Collaborative Filtering and this other article has some great descriptions/comparisons with content-based filtering. Basically, collaborative filtering means that we are comparing the user who requests the recommendation to other users, finding people with similar patterns in ratings, then returning our requester a drink that those similar to them have rated highly.

This is considered a memory based approach, as opposed to a model based approach. This article does a good job explaining the difference between the two, which will also help you understand why we're not using training data, etc.

We chose to use Euclidean Distance, but it's not the only measure that we could have used to determine similarity between users. This article has a good explainer on euclidean distance, as well as cosine similarity, which was the other measure that we considered using.

Super Important Resources For Building This

Algorithm Walkthrough (OUTDATED)

Convert the data to a DataFrame, set the name as an index, and make nil values into 0s

csv_data = Pandas.read_csv("./db/data/cocktail_ratings.csv") #replace with our ratings data

df = csv_data.set_index('name').fillna(0) #will be user_id instead of name

initial_data_frame
The initial DataFrame

Get the euclidean distances between the vector and itself, pairwise

euclidean = Numpy.round(sklearn.metrics.pairwise.euclidean_distances(df,df),2)

euclidean_distances
Matrix of euclidean distances

Make the array of euclidean distances into a new DataFrame (pairwise, so the columns and rows are both users)

similar = Pandas.DataFrame.new(data=euclidean, index=df.index,columns=df.index)

comparison_of_euclidean_distances_by_user
As you can see, the distance between each user and themself is 0.0, whereas other users show different distances

Get the distances between user requesting the recommendation and each other user (and make it into a DataFrame)

max = similar.loc['Max'].sort_values(0,ascending=true)[1..5] #here, Max is replacing our USER_ID for the user requesting the rec

df_max = Pandas.DataFrame.new(data=max, index=df.index)

comparisons_of_user_to_max
Distance from each user to the requester

Prep the cocktail data to merge with the user distances ((dis)similarity) data

pivoted = Pandas.melt(df.reset_index(),id_vars='name',value_vars=df.keys)

pivoted_data
Cocktail rating data pivoted into rows

Scrape out 0s, which are non-existent ratings and merging distance data with cocktail data

scraped_pivot = pivoted[pivoted.value != 0]

total = df_max.reset_index().merge(scraped_pivot).dropna()

Create a comparison metric (weighted rating) based on euclidean distance (ed_adjusted * rating)

total['weightedRating']=(1 / total['Max']+1)*total['value']

Note: We are taking the distance and taking its inverse (dividing by one) because SMALLER distances equal MORE similarity, but we need to get a LARGER number for MORE similarity. Then, we add 1 to the denominator to avoid division by 0 and also to make the highest level of similarity equal to 1. Here is a post where one of the comments explains this quite well.

merged_cocktail_and_user_data
Weighted rating shows the rating multiplied by the distance from the requester

Add up the distances and aggregate the data by cocktail

similarity_weighted = total.groupby('variable').sum()[['Max','weightedRating']]

weighted_ratings
Distances are added to prep for taking an average

Remove drinks the requester has already rated (so we don't recommend something they've already tried)

max_pivot = pivoted[pivoted.name == 'Max'] # won't be 'Max', will be user_id (current_user.id?)

max_pivot
Strip out rows from the original pivoted rating data for the requester

reset_sw = similarity_weighted.reset_index()

reset_sw
Reset the index of the weighted data to ensure a good merge

unknown_to_user = reset_sw.merge(max_pivot).set_index('variable')

unknown_to_user = unknown_to_user[unknown_to_user.value == 0]

unknown_to_user
Take only the rows where the requester has a 0 (no rating yet)
Finish calculating the weighted average inverse euclidean distance

empty = Pandas.DataFrame.new() empty['weightedAvgRecScore'] = unknown_to_user['Max']/unknown_to_user['weightedRating']

empty
weightedAvgRecScore calculated by dividing the sums of the distances for each cocktail by their weighted ratings

Make the final (best) recommendation

recommendation = empty[empty.weightedAvgRecScore == empty.weightedAvgRecScore.max()]

recommendation
Return the most highly rated cocktail in the list of recommendations
⚠️ **GitHub.com Fallback** ⚠️