Experimental Evaluation - sschia/magres GitHub Wiki
To carry out the experiments, we used a dataset corresponding to the Yelp Dataset Challenge which contains information about a popular LBSN named Yelp. This type of LBSN is based on the users' check-ins. In this type of applications, users are encouraged to share their current locations, such as restaurants or museums. With information about real-time location of the users, a person can discover friends (from their social network) around their location to allow social activities in the physical world, for example, by inviting people to dine or to go shopping. Users can also add comments to places that other users can read as tips. With this type of service, actions such as check-ins, adding a comment, and publication of photos are required to be associated with a location.
The dataset contains check-ins from various cities in the world, but for the purpose of the experiments we took into account those check-ins belonging to the Arizona state. We chose to sample the dataset because (i) most datasets used for testing location-based recommender systems (LBRS) are extracted from single major cities, such as New York, Los Angeles, Tokio, etc., for example (Yang et al., 2013), and (ii) we want to recommend POIs to a group based on their geo-location relationships, and the information outside their home city is not so relevant for our approach. We particularly decided to use the information of Arizona state as this dataset contains information that is large and significant enough for recommendation. Also, we only considered those users who have at least 9 check-ins (or ratings). The resulting dataset contains 19,000 users. Regarding the distribution of check-ins, the dataset contains 497,029 check-ins (ratings) in 33,136 POIs generated by the selected users. Places in this LBSN are organized into a three-level hierarchy. The first level, the most generic one, contains categories such as Food, Arts & Entertainment, Active Life. The second level contains categories such as Bagels, Specialty Food, Art Galleries, Casinos, Diving, Fitness & Instruction. The third level contains categories such as Brewpubs, Candy Stores, Pasta Shops, Outdoor Movies, Ranches, Dance Studios. In this dataset places can simultaneously belong to various categories. Regarding the graph processed in the dataset, the social graph consists of 298,361 nodes (users) and 2,102,231 edges (friendship relationships between users).
In order to evaluate the effectiveness of MAGReS, we compared the recommendations resulting from a traditional group recommendation techniques against those produced by agent negotiation. As the baseline for this comparison, we implemented two recommender systems that are based on traditional approaches commonly used in the literature, namely:
- TRADGRec-PA: a GRS that uses a traditional preference aggregation approach, as proposed in (Masthoff, 2015; Felfernig et al., 2018).
- TRADGRec-RA: a GRS based on the aggregation of recommendations produced for each group member. This baseline is based on the approach proposed in the chapter 2 (section 2.4) of (Felfernig et al., 2018), which generates a recommendation containing k items for each group member and then combines all those recommendations into a single recommendation. To achieve this, the group preference (or rating) of each candidate (an item recommended to a group member) is computed by aggregating the preferences (either existing or predicted) of each group member over the candidate, and then the k candidates with the highest group preference values are selected to be part of the group recommendation.
In MAGReS, we also implemented a protocol known as the One-Step protocol (Rosenschein and Zlotkin, 1994) as an additional baseline. The One-Step protocol is a variant of MCP in which all the negotiations occur in one single round. The agents simply interchange their proposals (one proposal each) and seek for an “agreement”, but there is no concession. The proposal each agent proposes is selected from a pool of candidate proposals generated by asking a SUR a list of recommendations for the user represented by the agent. To choose a proposal the agents use the Egocentric initial proposal strategy (see Section [subsec:MCP]). To select the agreement all the proposals are evaluated by all the agents using their utility function, and the proposal with the highest product of utilities is selected. It is important to notice that when using this protocol, the negotiation will rarely end with a conflict, since for that to happen neither of the agents has to be able to make a proposal (i.e. the SUR must to be unable to generate recommendations for all users represented by the agents, which can happen if none of the group members has rated enough items, but it is rare).
In this section we detail the metrics used for assessing the group recommendations generated by MAGReS and the baselines. It is important to notice that, as explained in (Harper and Konstan, 2015; Trattner et al., 2018), most of the datasets (including the one used for this article) used to evaluate group recommendations are derived from single-user recommender systems (SURs), which implies that they lack group ratings. Because of this situation, there is no ground truth and thus it is not possible to follow a standard evaluation methodology such as the one for assessing individual recommendations, which normally divides the dataset into train and test parts and then computes metrics such as precission, recall, nDCG, HitRate, among others. This issue has been documented in recent literature (Trattner et al., 2018). According to these related works, two approaches to deal with the issue have been proposed:i) to “build” the ground truth (i.e. to synthesize a dataset containing group ratings) (Baltrunas et al., 2010). In this case, most authors create the groups using clustering techniques in order to group users that have rated the same items, which helps them to minimize the number of ratings being predicted; ii) to use custom metrics for assessing the recommendations (Bekkerman et al., 2006; Boratto et al., 2016; Nguyen and Ricci, 2017).
For our experiments we chose the second approach, as we intended to use randomly generated groups instead of clustering-based ones (as some authors do). Particularly, for measuring the quality of the group recommendations in terms of the satisfaction at the group and individual levels, we defined a set of metrics called satisfaction metrics. We also used two well known error metrics, MAE and RMSE, but only to assess the quality of the individual recommendations produced by the SURs.
Group Satisfaction (GS): It measures the satisfaction of the group with respect to an item (or a list of items) being recommended. The satisfaction is equivalent to the estimated preference (or rating) of the user/group with respect to the item.
- item level: the group satisfaction for an item x_{j} is computed considering n is the number of group members (\mid g\mid) in the group (g) and S_{i}(x_{j}) is the satisfaction of group member u_{i} over item x_{j}.
- recommendation level: the GS of a recommendation r consisting of k items r=<x,x_{2},...,x_{k}> is computed as the average of the GS of every item in r
Members Satisfaction Dispersion (MSD): It assesses how uniformly the group members are satisfied by either a single item x_{j} or a recommendation r. The lower the MSD is the more uniformly satisfied the group members will be.
- item level: the MSD for an item x_{j} is computed as the standard deviation of the group members satisfaction
- recommendation level: the MSD for a recommendation r that consists of k items r=<x_{1},x_{2},...,x_{k}> is computed as the average of the MSD for every item in r
Fairness: It is a metric proposed in (Trattner et al., 2018) for evaluating a recommendation of an item (x_{j}) to a group. It is defined as the percentage of group members satisfied by the recommendation. To determine which users are satisfied, the authors set a threshold th to 3.5 stars (out of 5 stars, the equivalent to 0.7 out of 1, and any group member with a satisfaction value above the threshold is considered satisfied. We kept th=0.7 and extended this metric for applying it to a recommendation r of k items. As it can be seen the fairness of a recommendation r of k items is computed as the average of the fairness of the items.
Error Metrics
These metrics were only used for evaluating the quality of the predictions made by the SURs when using different neighborhood selection strategies. We used two error metrics available in the Mahout framework: MAE (Mean Absolute Error) and RMSE (Root Mean Square Error). For both metrics, we took into account both the score of the metric and the coverage. The latter was measured in terms of how many of the predictions (those the SUR made during the evaluation of the metric) succeeded over the total amount of predictions made (which is the amount of ratings in the testing set of the evaluation).
The objectives of the experiments reported in this article are the following:
- To compare the recommendations generated by MAGReS, TRADGRec-PA and TRADGRec-RA in the POI domain.
- To study how the use of LBSN information in the recommendation process can affect the recommendations generated by MAGReS and the baselines.
To carry out the experiments, we performed a series of steps :
- Generate the n groups involved in the test.
- Determine the number of (k) items to be recommended
- Configure the environment, which involves: (a) selecting the Single-User Recommender (SUR) and the neighborhood selection strategies. (b) selecting the parameterization of each approach (MAGReS, TRADGRec-PA, TRADGRec-RA).
- Using each approach, generate a group recommendation containing k items for each of the n groups involved in the test.
- Evaluate the recommendations, by computing the metrics GS, MSD and fairness for each recommendation.
- Compare the approaches, by computing the average and standard deviation of the metrics used in step 5 for each approach.
The groups used in the experiments were randomly generated but with a single restriction: the group members had to be direct friends in the social network (level 1 friendship relationships). At first we tried generating purely random groups, i.e. without using information from the LBSN, but we found out that given the sparsity of the dataset and the amount of users, the possibility of the group members sharing interests was low and so was the quality of the recommendations for the groups. Because of this, we decided to use the LBSN information regarding user friendship relationships. We created 45 groups of 3, 4 and 5 people (15 groups of each size).
For this experiment we set the amount of recommendations (k) to 10 as it is a common amount of recommendations in the literature (top-10) (Felfernig et al., 2018; Karypis, 2001). In the case of MAGReS (either when using the One-Step or Monotonic Concession protocols), we ran the protocol k times, in order to produce the k recommendations. For a given run, we removed from the negotiation space the items that were agreed by the agents in the previous run.
The SUR used by both the agents in MAGReS (to build their candidate proposals) and the traditional approaches TRADGRec-PA (to generate recommendations for the group user) and TRADGRec-RA (to generate the individual recommendations which will be later aggregated into a single group recommendation), was implemented using the Mahout framework. Particularly, we used the User-Based (UB) implementation of the recommender (which is a CF-based recommender) along with the Pearson (weighted) similarity metric. With regard to the neighborhood selection strategies we analyzed the following:
- NearestN (NN): it is available in the Mahout framework and it does not use LBSN information.
- NearestNUserZone (NNUZ): it is an implementation of the strategy proposed in (Rios et al., 2018) and uses geolocation information from the LBSN to select the neighbors of a user. NNUZ only contemplates the geographical relationships among the users, not their social relationships (such as friendship). This strategy works by first computing the “movement area” (i.e., the area in which the places the user has rated and usual visits are) of each user and then selecting the neighbors of the target user by taking into account those users whose movement area is close, i.e. within a certain radius (parameter), to the target user movement area, independently of the ratings they have given to the same places (POIs, items). With regard to the performance of this strategy we must say that even though it is necessary to compute all the distances among the user-user areas of movement (computation that can be performed offline), the amount of neighbors decreases which results in: (i) a small amount of user-user similarity computations that need to be performed (computing the distance between the target user and the rest of the users in the dataset is cheaper than computing the user-user similarity among them) and, (ii) a potential improvement in the quality of the predictions (as only the more relevant neighbors, those that have similar habits to the target user, are taken into consideration).
In both cases we have tested with 5 neighborhood sizes (N): 50, 100, 200, 300 and 500. As for NNUZ we used three different values for the radius parameter (measured in kilometers): 1, 3 and 5. In the following table we detail the parameterizations used with regard to the neighborhood selection strategies and the codename (IDs) we will later used to refer to them.

Aside from the parameters of the SUR that MAGReS, TRADGrec-PA and TRADGRec-RA use the amount of parameters that can be tunned strongly depends on the approach. Table [tab:Parameters-for-each-approach] shows the most relevant parameters of each approach and their possible values.
- In TRADGRec-PA and TRADGRec-RA there is only one parameter that can be modified: the aggregation strategy. In the first case, the aggregation strategy is used to aggregate the individual (group members) preferences (or ratings), either given or predicted, when building the group preference profile (i.e. the group rating of an item is computed as the aggregation of the ratings of the group members over that item). In the second case, as explained at the beginning of Section [subsec:ExpSetting-Single-User-Recommender-(SUR)], the aggregation strategy is used also to aggregate preferences but when merging the recommendations built for each of the group members into a single recommendation (the group recommendation).
- In MAGReS the amount of configurable parameters is larger as there are several strategies that dictate the behavior of the agent (like the initial proposal strategy) and the protocol (like the agreement and the concession strategies). Since there are many combinations of parameters only those considered relevant for this article were used in the experiments. Parameters for each approach
So as to easily identify the parameterizations of the approaches we will use the following ids: MAGReS [One-Step] and MAGReS [MCP] for the variants of MAGReS using the negotiation protocols One-Step and MCP respectively (the rest of the parameters, such as the concession strategy are fixed to the values set in the table, and TRADGRec-PA [Average] and TRADGRec-RA [Average] for the variants of TRADGRec-PA and TRADGRec-RA (respectively) that use the average aggregation technique.

One of the objectives of this work was to evaluate the negotiation-based group recommendation approach in the context of POI recommendation. For this, we conducted tests using the NN neighborhood selection strategy.
As it can be seen in the following Table, independently on the size of the neighborhood (n), MAGReS [MCP] outperforms TRADGRec-PA [Average] and TRADGRec-RA [Average] in terms of GS (group satisfaction, see Section [subsec:Evaluation-criteria]), as all the recommendations produced by the former had a higher GS than the recommendations generated by the latter two. Additionally, it can be seen that MAGReS [One-Step] was able to outperform only to TRADGRec-PA [Average], but failed to do so when compared to MAGReS [MCP] and TRADGRec-RA [Average], which can be explained by how the One-Step protocol works.

With regard to TRADGRec-PA [Average], it can be observed that it performed worse than any of the other approaches. This was caused by how this approach works: the recommendations are generated for the virtual user (the group user) who owns the group preference model computed by aggregating the individual preference models of the group members. Due to how CF (collaborative filtering) works, the more preferences a user has in her preference profile, the more likely is to find neighbors during the rating/preference prediction process. Given that the group user preference profile is made up of all the preferences of the group members, it is highly likely that the group user will have many more neighbors than the individual users (the group members). This way, when the GS of the group recommendation r (i.e. the items recommended to the group user) is computed, the S_{i}(x_{j}) (satisfaction of the group member u_{i} over the item x_{j}, see Section [subsec:Evaluation-criteria]) for each item x_{j}\in r has to be computed aswell. The problem of TRADGRec-PA [Average] resides in that computation, as when computing S_{i}(x_{j}) it might happen that either it is not possible to find neighbors for the group member u_{i} that have rated x_{j}, or that the amount of neighbors found for u_{i} is not enough for the similarity metric to produce a feasible prediction, thus causing the prediction to be 0 (zero), thus lowering the GS score of the group recommendation.
Regarding to MSD and fairness, in the following tables it can be seen that MAGReS [MCP] outperformed all the other approaches. This indicates that MAGReS [MCP] recommendations not only have a higher group satisfaction (GS) score, but also satisfy the group members more uniformly (because of the lower MSD score) and ensure that a higher percentage of the group members will be satisfied with the recommendation (due to the higher fairness score).


With regard to the three previous tables it is important to notice that for NN [n=50], even when MAGReS [MCP] outperformed all the other approaches in terms of GS, MSD and fairness, the standard deviations were really high. For example: the average GS for MAGReS [MCP] is 0.5642 and the standard deviation is 0.4059, which denotes that there was a high variation in the GS of the recommendations. When we investigated into the situation, we noticed that for many groups (15 out of 45) the recommendations produced by MAGReS [MCP] were empty, which caused the GS, MSD and fairness of those group recommendations to be 0 (zero)If we exclude the empty recommendations the averages and standard deviation for GS, MSD and fairness are 0.8463\pm0.0548, 0.0853\pm0.0982 and 0.9344\pm0.095 respectively. The main cause behind this problem were the rating predictions made by the SURs (the one the agents use to build their candidate proposals and to evaluate the utility value of the proposals they make and receive), as they were impacted by the low number of neighbors each user had (which caused many predictions to be zero). As a result, each agent had less candidate proposals to use during the negotiation, which increased the chances of the negotiations ending in conflict (and so the group recommendations being empty).
The results were validated through statistical analysis. For each neighborhood size (n), we first performed a Shapiro-Wilk test to determine if the samples were normal. Given that in most cases there was at least one sample that did not follow the normal distribution, we performed a pairwise Wilcoxon Signed Ranks test (which is a non-parametric test) so as to compare each pair of recommendation techniques (for example: MAGReS [One-Step] versus MAGReS [MCP] or MAGReS [MCP] versus TRADGRec-RA [Average]). To do this, we first used a two-sided test to determine if both samples were significantly different from one another and then, if the samples were different, we used two one-sided tests to determine which of the samples was greater or less than the other. For the experiments in Tables GS-NN and FarinessNN we wanted to test whether one approach, A (e.g. MAGReS [MCP]), was significantly better than the other B (e.g. TRADGRec-PA [Average]) in terms of group satisfaction, so the null hypothesis was “the sample of the approach A is not greater than the sample of the approach B”. For the experiments reported in Table MSD-NN we wanted to test that one approach A was significantly better than the other B in terms of MSD, and therefore the null hypothesis was “the sample of the approach A is not worse than the sample of the approach B”. The statistical analysis confirmed our observations as, for each test, the null hypothesis was rejected at a significance level of 95% (\alpha=0.05).
For the neighborhood selection strategy NearestNUserZone (NNUZ) we conducted the same tests as those for the NN strategy (see Section [subsec:Results-NN]), and found similar results except when the zone radius (zr) parameter was set to 1 (i.e. 1 km). As it can be seen in Tables GS-NNUZ, MSD-NNUZ and Fariness-NNUZ, MAGReS [MCP] outperformed all the other approaches in terms of GS, MSD and Fairness for all the neighborhood sizes when the zr parameter was set to either 3 or 5. When zr was set to 1 we found that, despite the size of the neighborhood (n) chosen, the same thing that happened when NN [n=50] was used happened when NNUZ was used: many of the group recommendations were empty, which lowered the averages and increased the standard deviations for all the metrics (GS, MSD and fairness).



The results were validated through statistical analysis. For each neighborhood size (n) and zone radius (zr) we performed the same tests performed to analyze the results presented in Section [subsec:Results-NN]: we tested normality first and then we performed a pairwise Wilcoxon Signed Ranks test to compare each recommendation approach (for example: MAGReS [MCP] versus TRADGRec-RA [Average]) with respect to the GS, MSD and fairness. The statistical analysis confirmed our observations as, for each test, the null hypothesis was rejected at a significance level of 95% (\alpha=0.05).
In previous sections we evaluated our approach when two different neighborhood selection strategies were used: one that does not use the information provided by the LBSN (NN) and another that uses that information (NNUZ). Even though our observations confirmed that MAGReS [MCP] was able to outperform the other approaches in terms of GS, MSD and fairness, it is not possible to do an straightforward comparison among the average GS and MSD scores of the same approach (for example, MAGReS [MCP]) when different neighborhood selection strategies are used (which would be required to compare the performance of NN and NNUZ in terms of GS and MSD). This is due to the way in which CF algorithms work, as changing the neighborhood selection strategy (or its parameters) changes the way the SUR (which uses CF to make predictions and recommendations) computes a rating prediction for a pair <user,item>, thus changing the S_i(x_j) used by both GS and MSD. As a result, to be able to evaluate the impact of the usage of the LSBN data in the recommendation process, we decided to focus on the following factors:
- The average number of items recommended to each group.
- The quality of the recommendations produced by the recommenders, which was measured according to the fairness of the recommendation. It is important to notice that even though the fairness computation “uses” the S_{i}(x_{j}) (i.e., the user u_{i} satisfaction level with respect to an item x_{j}), which depends on the SURs rating predictions and thus is affected by changing the neighborhood selection strategies, it can be used to compare recommendation generated when using different neighborhood selection strategies. This is because, according to the equation of the fairness, the fairness score of a recommendation measures the percentage of group members which were satisfied above a given threshold (independently of how said satisfaction value was computed).
- The quality of the (rating) predictions made by the SURs, which was measured by using the Mean Average Error (MAE) and the Root Mean Square Error (RMSE).
- The trade-off between the quality of the predictions made by the SUR and the amount of users we needed to use to compute each neighborhood.

Regarding to the number of items recommended to each group, in Table [fig:Effective-Recs-NN-and-NNUZ] it is possible to see that MAGReS [One-Step] was the only approach that, independently on the neighborhood selection strategy and the size of the neighborhood, was able to generate the 10 (k=10) recommendations requested for each of the 45 groups involved in the test. This was expected because of how the One-Step protocol works (see Section [subsec:Baselines-definition]): as long as the user agents are able to make one proposal, one item will be recommended to the group. Also as expected, due to how NNUZ works, the amount of neighbors per user was lower (see Table [tab:NeighborsPerUser]), which ultimately caused the number of items recommended to be lower than 10 (because of what we explained in Section [subsec:Results-NN] with regard to NN [n=50]) when NNUZ was used. This was specially noticeable for the zone radius (zr) 1 and 3.
With regard to the quality of the recommendations, we observed that the fairness score of the recommendations generated when using NNUZ only improved for MAGReS [MCP], MAGReS [MCP] and TRADGRec-RA when the neighborhood size (n) was 50 and the zr parameter was set, in most cases, to either 3 or 5. This shows that, when the number of neighbors is low, the usage of the LBSN information during the neighbors selection process can help to select “better neighbors” (those with interests closer to the interests of the target user) which in the end helps to improve the quality of the recommendation. When n was set to 100, 200, 300 and 500, the fairness score of the recommendations produced when using NNUZ was, in most cases, lower than the fairness score of the recommendations produced when using NN. This was specially noticeable when the zr parameter was set to either 1 or 3 (in the case of TRADGRec-PA, this happened for every neighborhood size and zr value), and the reason behind this is the same one explained in Sections NN and NNUZ with regard to how the number of neighbors affects the amount of items recommended, which later translates in many group recommendations having GS, MSD and fairness scores equal to 0 (zero). Additionally, as it can be observed in the Table [tab:Fairness-changes:-NNUZ-vs-NN], the parameterizations of NNUZ that produce the best results in terms of fairness were those in which the zr parameter was set to 5 (namely: NNUZ[n=50, zr=5], NNUZ[n=100, zr=5], NNUZ[n=200, zr=5], NNUZ[n=300, zr=5] and NNUZ[n=500, zr=5]), as for MAGReS [MCP] the fairness loss was not higher than 5%, and for TRADGRec-RA [Average] was not higher than 14.32%.

The next factor we analyzed was how changing one neighborhood selection strategy that does not use LBSN information (NN) for another that uses that information (NNUZ) affected que quality of the SURs predictions and recommendations. For this we used the MAE and RMSE evaluators provided by Mahout Framework. The table of ratings (or check-ins), which contains 497029 entries, was split in two parts: 70% of the ratings (347920 ratings) were used for training, and the remaining 30% (149109 ratings) for testing. In the following Table we show the results of the tests for all the neighborhood sizes. As it can be seen in that table, even though in most cases the best scores of MAE and RMSE were achieved when using the NN, the coverage was lower than when using NNUZ, and so we can conclude that NNUZ was better in terms of MAE and RMSE. However, when n was set to 300 and 500, NN outperformed NNUZ in terms of MAE and RMSE as not only the scores were higher but also the coverage was higher too. Due to how NNUZ computes the neighborhood of a target user, many users had less neighbors than the amount of neighbors requested (300 and 500), for example: when using NNUZ[n=500, zr=3] only 89 of the 180 users involved in the test had at least 500 neighbors. This caused many predictions to fail and therefore the coverage to be lower.

Finally, we analyzed the trade-off between the quality of the predictions made by the SUR and the cost of finding the neighbors for a target user. According with it can be seen in Table [tab:NeighborsPerUser], and what it was explained in the previous paragraph regarding to the MAE and the RMSE, it is possible to say that NNUZ helps to select the users more efficiently (it needs to explore less users to find relevant neighbors) while maintaining (and even improving) the quality of the rating predictions made by the SURs. Moreover, and particularly for MAGReS [One-Step], MAGReS [MCP] and TRADGRec-RA [Average], the loss in fairness score incurred when using NNUZ with the zr parameter set to 5, could be considered minimal (15% in the worst case scenario) if we take into account that the neighbors are being selected more efficiently. The efficiency was measured in terms of the number of the users (of the dataset) the selection strategy had to explore when computing the neighborhood of a target user. This is because once the users are selected, the user-user similarity computation has to be performed, and the bigger the amount of users, the more expensive that computation becomes (as the user-user similarity matrix gets bigger), which has direct impact on the time it takes to make a recommendation to either a single user or a group. For example, to generate a group recommendation of 10 items to a group of 5 individuals, TRADGRec-RA needs 17.8 seconds when using NN[n=500] vs 3.9 seconds when using NNUZ[n=500, rz=5].
Overall, it is possible to say that, specially for small neighborhoods, the usage of information provided by the LBSN in the neighbors selection process has proven to be helpful: not only it can help to select “better” neighbors (which later translates into an improvement of the predictions and recommendations quality) but also to make said process more efficient in terms of how many user-user similarity computations have to be made, all while maintaining (and even improving) the quality of the predictions of the recommender.
