Evaluator - shark8me/lenskit GitHub Wiki
LensKit provides a flexible framework for conducting offline evaluations of recommenders. Currently, train-test evaluation of recommender prediction accuracy is supported; in the future, we will be adding additional evaluation capabilities.
Warm-up Example
Let's start with an example:
import org.grouplens.lenskit.knn.item.*
import org.grouplens.lenskit.transform.normalize.*
trainTest {
dataset crossfold("ml-100k") {
source csvfile("u.data") {
delimiter "\t"
domain {
minimum 1.0
maximum 5.0
precision 1.0
}
}
}
algorithm("PersMean") {
bind ItemScorer to UserMeanItemScorer
bind (UserMeanBaseline, ItemScorer) to ItemMeanRatingItemScorer
}
algorithm("ItemItem") {
bind ItemScorer to ItemItemScorer
bind UserVectorNormalizer to BaselineSubtractingUserVectorNormalizer
within (UserVectorNormalizer) {
bind (BaselineScorer, ItemScorer) to ItemMeanRatingItemScorer
}
}
metric CoveragePredictMetric
metric RMSEPredictMetric
metric NDCGPredictMetric
output "eval-results.csv"
}
Save this script in the ML-100K directory as eval.groovy
and run lenskit-eval
(included in the LensKit binary distribution).
This script does a few things:
- Splits the MovieLens 100K data set into 5 partitions for cross-validation.
- Generates predictions for test user/item pairs using two algorithms: personalized mean and item-item CF.
- Evaluates these two algorithms with three metric families: coverage, RMSE, and nDCG.
- Writes the evaluation results to
eval-results.csv
, one row for each combination of algorithm and fold.
You can then load eval-results.csv
into R, Excel, LibreOffice, or your favorite data analysis tool to inspect and plot the algorithm performance. So let's use R and draw a box plot of the per-user RMSE:
library(ggplot2)
library(data.table)
results = data.table(read.csv("eval-results.csv"))
ggplot(results[,list(RMSE=mean(RMSE.ByUser)),by=list(Algorithm,Partition)]) +
aes(x=Algorithm, y=RMSE) +
geom_boxplot()
Walking through the script
To run an evaluation, you need four basic things:
- Data to evaluate with.
- Algorithms to evaluate.
- Metrics to measure their performance.
- Somewhere to put the output.
In LensKit, the train-test evaluator builds and tests the algorithms on the data, measures their output with the metrics, and writes the results to a file. The outer block, trainTest
, tells LensKit that we want to do a train-test evaluation. There are other commands as well, but we'll get to those later.
Input Data
At the beginning of the trainTest
block, we have the following:
dataset crossfold("ml-100k") {
source csvfile("u.data") {
delimiter "\t"
domain {
minimum 1.0
maximum 5.0
precision 1.0
}
}
}
This piece of code loads the main ratings file from the data set and prepares it for cross-validation.
The first important piece is dataset
. It's a directive provided by trainTest
that adds a data set to the evaluation. You can have multiple data sets and evaluate on all of them at once. In fact, under the hood that is what this is doing, because…
The crossfold
command takes a data set and partitions it for crossfold validation. The result is actually N separate train-test data sets, one for each fold. The crossfold
command returns these data sets, and LensKit sees dataset
is being invoked with a list of data sets and adds them all to the evaluation.
The crossfolder operates on a data source. In this case it is a CSV file (actually tab-separated, but LensKit calls all delimited text files CSV files). The file name is u.data
, the delimiter is \t
, and it is on a 1–5 star scale with a precision of 1 star (the domain
block specifies the domain of ratings).
Specifying the Algorithms
Next comes a pair of algorithm
blocks specifying the algorithms to test:
algorithm("PersMean") {
bind ItemScorer to UserMeanItemScorer
bind (UserMeanBaseline, ItemScorer) to ItemMeanRatingItemScorer
}
algorithm("ItemItem") {
bind ItemScorer to ItemItemScorer
bind UserVectorNormalizer to BaselineSubtractingUserVectorNormalizer
within (UserVectorNormalizer) {
bind (BaselineScorer, ItemScorer) to ItemMeanRatingItemScorer
}
}
Each algorithm has a name (‘PersMean’ and ‘ItemItem’). The algorithm configuration is based on the concept of bindings: binding component interfaces (e.g. ItemScorer
) to the desired implementations (e.g. ItemItemScorer
for item-item collaborative filtering).
The personalized mean (PersMean) algorithm operates by computing user and item average offsets from the global rating. It implements the prediction rule p(u,i) = μ + bᵢ + bᵤ, where μ is the global mean rating, bᵢ is the difference between the item's mean rating and the glob mean, and bᵤ is the mean of the differences between the user's rating for each item and that item's mean. This is done by using UserMeanItemScorer
, which scores items using a user average, as the ItemScorer
, and telling it to use the item mean rating as the offset from which to compute user means (the UserMeanBaseline
).
The item-item CF algorithm (ItemItem) uses standard item-item collaborative filtering. This is enabled by choosing ItemItemScorer
as the item scorer implementation. It then sets up normalization, normalizing the ratings by subtracting item means prior to computing similarities and scores. This is done by the UserVectorNormalizer
, which here is configured to subtract a baseline; the baseline, in turn, is set to the item mean rating. The default settings are used for the rest of the algorithm's parameters, such as similarity function and neighborhood size.
For more on configuring algorithms, see:
- Anatomy of an Algorithm (describes the core components common to many algorithms, as well as LensKit baselines)
- Configuration
- The documentation for various algorithm families in the manual
Metrics
Next, we set up three metrics:
metric CoveragePredictMetric
metric RMSEPredictMetric
metric NDCGPredictMetric
These metrics are each classes in the org.grouplens.lenskit.eval.metrics.predict package. The metric
directive takes either a metric instance or a metric class; it will automatically instantiate the class using its default constructor.
Each metric computes some measurement over the recommender's output and adds it to the evaluation output. Each metric can produce multiple measurements that will appear in separate columns in the output file. These metrics produce:
CoveragePredictMetric
: coverage and general counting statistics (you'll usually want to include it). These include:NUsers
, the number of users testedNAttempted
, the number of predictions attemptedNGood
, the number of predictions madeCoverage
, the fraction of attempted predictions actually made
RMSEPredictMetric
: computes the RMSE of predictions with respect to actual user ratings. It computes both per-user (RMSE.ByUser
) and global (RMSE.ByRating
) RMSE.NDCGPredictMetric
: Computes the nDCG of the prediction output, ranking items by prediction and computing the normalized discounted cumulative gain of this list using the user's rating as each item's gain.
Output
Not a whole lot here, just a simple output setting:
output "eval-results.csv"
This directs the evaluator to write its output to the file eval-results.csv
. This file contains the algorithm name, data set (name and partition), the wall clock time used to build and test the recommender, and the aggregate output of each of the metrics.
You can also set two additional output files:
userOutput
will write a file containing metric results for each test user. Use this if you want to post-process metric results on a user-by-user level.predictOutput
writes each prediction (and its associated actual rating) to a CSV file. This allows you to compute your own prediction accuracy metrics externally.
More about Scripts
The evaluation scripts are actually Groovy scripts, using an embedded domain-specific language (EDSL) for evaluating recommenders provided as a part of the LensKit evaluation framework. Simple scripts look a lot like sectioned key-value configuration files, but if you have more sophisticated evaluation needs, the full power of Groovy is available.
Running Scripts
Scripts can be run two ways: with the lenskit-eval
script in the binary distribution (which invokes the org.grouplens.lenskit.eval.cli.EvalCLI
class) or with the run-eval
goal in the LensKit Maven plugin.
lenskit-eval
is modeled after tools like Make and Ant. If you give it no arguments, it runs the script eval.groovy
in the current directory. You can tell it to run a specific script file with the -f
command line option.
Targets
LensKit eval scripts can also define targets to allow complex evaluations to be run in a piecewise fashion. A target is just like a target in other tools like Ant and make: it is a named sequence of tasks to run. Targets can also depend on other targets.
Here's a rewrite of the script above to use targets:
import org.grouplens.lenskit.knn.item.*
import org.grouplens.lenskit.baseline.*
import org.grouplens.lenskit.transform.normalize.*
// use the target method to define a target
def ml100k = target("crossfold") {
crossfold("ml-100k") {
source csvfile("ml-100k/u.data") {
delimiter "\t"
domain {
minimum 1.0
maximum 5.0
precision 1.0
}
}
}
}
target("evaluate") {
// require the crossfold target to be run first
// can also require it by name
requires ml100k
trainTest("item-item algorithm") {
dataset ml100k
algorithm("PersMean") {
bind ItemScorer to UserMeanItemScorer
bind (UserMeanBaseline, ItemScorer) to ItemMeanRatingItemScorer
}
algorithm("ItemItem") {
bind ItemScorer to ItemItemScorer
bind UserVectorNormalizer to BaselineSubtractingUserVectorNormalizer
within (UserVectorNormalizer) {
bind (BaselineScorer, ItemScorer) to ItemMeanRatingItemScorer
}
}
metric CoveragePredictMetric
metric RMSEPredictMetric
metric NDCGPredictMetric
output "eval-results.csv"
}
}
defaultTarget "evaluate"
In this version, the actual tasks from before — trainTest
and crossfold
— are not run immediately. They are run when the targets containing them are run.
If you run lenskit-eval
with no arguments, this script will run as before. That is because it specifies a default target of evaluate
. But you can just crossfold, without the actual recommender evaluation:
lenskit-eval crossfold
The requires
directive specifies that the evaluate
target depends on the crossfold
target (saved as the variable ml100k
) must be run first. You can depend on a target either by name (crossfold
) or by object; the target
command returns a target object that can be used for this purpose. The object can also be used to access the data returned by its last task: this is why dataset ml100k
works, even though ml100k
is a target. Its last task is crossfold
, which returns a list of data sets, and dataset ml100k
arranges for these data sets to be configured once the crossfold
target has been run so its output is available.
Additional Cross-Folding Options
Crossfolding (the crossfold
command) is implemented by CrossfoldTask. It supports several additional directives to control its behavior:
source
: the input datapartitions
: the number of train-test splits to create.holdout N
: hold out N items per user.holdoutFraction f
: hold out a fraction f of each user's items.order
: specify an ordering for user items prior to holdout. Can be either RandomOrder for random splitting or TimestampOrder for time-based splitting.name
: a name for the data source, used for referring to the task & the default output names. The string parameter to the crossfold directive, if provided, sets the name.train
: a format string taking a single integer specifying the name of the training data output files, e.g.ml-100k.train.%d.csv
. The default isname + ".train.%d.csv"
. The format string is applied to the number of the partition.test
: same astrain
, but for the test set.
The crossfold task, when executed, returns a list of TTDataSets representing the different train-test partitions.
Top-N metrics
The metrics discussed above are all prediction accuracy metrics, evaluating the accuracy of the rating predictor either for ranking items or for predicting the user's rating for individual items. LensKit also supports metrics over recommendation lists; these are called Top-N metrics, though the recommendation list may be generated by some other means.
Configuring a top-N metric is a bit more involved than a prediction accuracy metric. It requires you to specify a few things:
- The length of recommendation list to consider
- The items to consider as candidates for recommendation
- The items to exclude from recommendation
- For some metrics, the items considered ‘good’ or ‘bad’
For example, to compute Top-N nDCG of 10-item lists over all items the user has not rated in the training set:
metric topNnDCG {
listSize 10
candidates ItemSelectors.allItems()
exclude ItemSelectors.trainingItems()
}
As of LensKit 2.0.3, the following Top-N metrics are available:
topNnDCG
— normalized discounted cumulative gaintopNLength
— actual length of the top-N list (to measure truncated lists due to low coverage)
Dumping graphs
Besides trainTest
, the LensKit evaluator also supports dumpGraph
task that writes a GraphViz file diagramming the configuration of an evaluator:
dumpGraph {
output "graph.dot"
algorithm("PersMean") {
bind ItemScorer to UserMeanItemScorer
bind (UserMeanBaseline, ItemScorer) to ItemMeanRatingItemScorer
}
}
Further Reading
Read more about how the evaluator works internally in Evaluator Internals.