Recommendation Engine - drink-this/drink-this-backend GitHub Wiki
The recommendation engine is a memory-based approach to collaborative filtering that uses euclidean distance to compare user data. This article has a ok primer if you scroll down to the section on Collaborative Filtering and this other article has some great descriptions/comparisons with content-based filtering. Basically, collaborative filtering means that we are comparing the user who requests the recommendation to other users, finding people with similar patterns in ratings, then returning our requester a drink that those similar to them have rated highly.
This is considered a memory based approach, as opposed to a model based approach. This article does a good job explaining the difference between the two, which will also help you understand why we're not using training data, etc.
We chose to use Euclidean Distance, but it's not the only measure that we could have used to determine similarity between users. This article has a good explainer on euclidean distance, as well as cosine similarity, which was the other measure that we considered using.
- https://medium.com/swlh/how-to-build-simple-recommender-systems-in-python-647e5bcd78bd
- https://github.com/mrkn/pycall.rb
- https://www.practicalai.io/using-scikit-learn-machine-learning-library-in-ruby-using-pycall/
- https://readysteadycode.com/howto-execute-python-code-with-ruby
- https://github.com/maxhumber/BRE/blob/master/distance.ipynb
csv_data = Pandas.read_csv("./db/data/cocktail_ratings.csv") #replace with our ratings data
df = csv_data.set_index('name').fillna(0) #will be user_id instead of name

The initial DataFrame
euclidean = Numpy.round(sklearn.metrics.pairwise.euclidean_distances(df,df),2)

Matrix of euclidean distances
Make the array of euclidean distances into a new DataFrame (pairwise, so the columns and rows are both users)
similar = Pandas.DataFrame.new(data=euclidean, index=df.index,columns=df.index)

As you can see, the distance between each user and themself is 0.0, whereas other users show different distances
Get the distances between user requesting the recommendation and each other user (and make it into a DataFrame)
max = similar.loc['Max'].sort_values(0,ascending=true)[1..5]
#here, Max is replacing our USER_ID for the user requesting the rec
df_max = Pandas.DataFrame.new(data=max, index=df.index)

Distance from each user to the requester
pivoted = Pandas.melt(df.reset_index(),id_vars='name',value_vars=df.keys)

Cocktail rating data pivoted into rows
scraped_pivot = pivoted[pivoted.value != 0]
total = df_max.reset_index().merge(scraped_pivot).dropna()
total['weightedRating']=(1 / total['Max']+1)*total['value']
Note: We are taking the distance and taking its inverse (dividing by one) because SMALLER distances equal MORE similarity, but we need to get a LARGER number for MORE similarity. Then, we add 1 to the denominator to avoid division by 0 and also to make the highest level of similarity equal to 1. Here is a post where one of the comments explains this quite well.

Weighted rating shows the rating multiplied by the distance from the requester
similarity_weighted = total.groupby('variable').sum()[['Max','weightedRating']]

Distances are added to prep for taking an average
Remove drinks the requester has already rated (so we don't recommend something they've already tried)
max_pivot = pivoted[pivoted.name == 'Max'] # won't be 'Max', will be user_id (current_user.id?)

Strip out rows from the original pivoted rating data for the requester
reset_sw = similarity_weighted.reset_index()

Reset the index of the weighted data to ensure a good merge
unknown_to_user = reset_sw.merge(max_pivot).set_index('variable')
unknown_to_user = unknown_to_user[unknown_to_user.value == 0]

Take only the rows where the requester has a 0 (no rating yet)
empty = Pandas.DataFrame.new()
empty['weightedAvgRecScore'] = unknown_to_user['Max']/unknown_to_user['weightedRating']

weightedAvgRecScore calculated by dividing the sums of the distances for each cocktail by their weighted ratings
recommendation = empty[empty.weightedAvgRecScore == empty.weightedAvgRecScore.max()]

Return the most highly rated cocktail in the list of recommendations