ZZZ_Methodology - GetRecced/IR670_Spring2018 GitHub Wiki
Methodology
Preprocessing
- The dataset has the special property that each product has at least 5 reviews/ratings and each user has given at least 5 reviews/ratings.
- For splitting between training and testing data, we moved at least 1 review per user and at least 1 review per product to the test dataset and remaining to the train dataset.
Experiments
We Compared and evaluated the following recommendation models:
- Global Average Model
- Baseline Model
- Collaborative Filtering
- Latent Dirichlet Allocation (LDA)
- Hidden Factors as Topics (HFT)
Global Average Model
We started with this approach to see what results we get if we use a model as straightforward as this one. In this, we predict the global average as the rating for each user-item pair.
Baseline Model
This model is an improvement over the first one. In addition to the global average, we also consider a user bias and a movie bias while prediction.
Collaborative Filtering
In this approach we considered user-user collaborative filtering. We have used Pearson correlation to calculate similarity between two users. Also, we use 10 nearest neighbor approach for predicting the rating and for the recommendations.
Latent Dirichlet Allocation (LDA)
Every word in a review texts belongs to one or more topics. Using LDA, we can only observe the text and words, not the topic themselves. We tried to find these hidden topics in the text reviews and below is the distribution of these topics. !https://github.com/GetRecced/IR670_Spring2018/blob/master/Images/Data%20Analysis/LDA_Topics.png! Once we get the word distribution per topic, for each document we calculate the topic distributions. Here, all the reviews on one product are considered as one document. Similarly, all the reviews given by one user are considered as one document.
Hidden Factors as Topics (HFT)
HFT model takes advantage of both ratings and reviews by combining latent factor model and latent dirichlet allocation model. Here, for each user and item, we calculate 5 latent factors which are also known as user preferences and product properties. The weights for these factors are learnt by considering the reviews in addition to global average ratings and biases.