Leave Cluster Out Crossvalidation is necessary for Scoring Functions derived on Diverse Protein Datasets - janeshdev/ggplot2 GitHub Wiki
Leave-Cluster-Out Crossvalidation is necessary for Scoring Functions derived on Diverse Protein Datasets
NIBR
In pharmaceutical research we are interested in the free energy of binding, a measure for the quality of a potential drug. There are various ways to calculate free energies. The simplest ways are called scoring functions, where we relate simple structural features of a ligand in its binding pockets to the free energy measured. Recently a new scoring function has been published that is based on atom pair counts: For each type of ligand atom all the protein atoms of a specific type in a 12 Angstroem sphere around the ligand atom are summed up. This gives 36 features for [C,N,O,S] in the protein times [C,N,O,F,P,S,Cl,Br,I] in the ligand. These features (or descriptors) are then correlated with the free energy of binding. The training set for this scoring function is composed of a collection of Protein-Ligand crystal structure complexes for which measured free energies are available (The PDBbind database).
Despite its simplicity and neglect of common physical attributes such as quantum mechanics and specific atom environments this scoring function has been remarkably successful in predicting Free energies on a standard test set. The test set is a subset of the training set which consists of the three most different members of each protein family with at least four members. Thus for each sample in the test set there is at least structurally similar sample in the training set.
In the present study it is shown that the prediction performance decreases drastically when the scoring function is built without any samples from the same family in the training set. In the first plot predicted versus measured values for each protein family with at least ten members(A-W) and the families with 4-9(X), 2&3(Y) and 1(Z) member after leave-cluster-out crossvalidation (i.e. leave-family-out crossvalidation) is shown.
The distribution of measured and predicted values shows that the spread of the predicted values is much narrower than the measured values, and the averages do not always agree. This illustrates that in most cases the samples inside each cluster are not well differentiated. Additionally the overall spread of activities is much larger than the spread within the cluster.
The predictive performance in our more strict evaluation is much worse (overall averaged R2 = 0.21) than the performance obtained when using the standard procedure with the standard test set without leave-cluster-out crossvalidation (R2 = 0.59). An analysis of the location of different clusters in the descriptor space clearly shows that it is possible to distinguish different families by such simple descriptors. This is shown below by mapping the location of different families after multidimensional scaling to two dimensions. Different families cluster in different regions.
A random forest model is flexible enough to recognize protein families and assign the free energy based on the cluster. However this is a special kind of overfitting and clearly not the intended use of a scoring function. So Leave-Cluster-Out crossvalidation is necessary for these kinds of scoring functions.
The code below calculates all the random forest models and generates the three plots shown above. We intentionally do not give titles with the plot; more detailed explanations are covered in either the text or figure captions.
##################################################
# Out-of-cluster validation skript for random forests in R
#
# Executed with the PDBbind09 dataset
##################################################
library(randomForest)
rawdata = read.csv("PDBbind_refined09_12A.csv", na.strings=c(".", "NA", "", "?"))
itarget = 1 # column of target in table
nfeats = ncol(rawdata) - 2 # with target (1), but without pdb code (second-last column)
ndata = nrow(rawdata) # number of pdb complexes for training
itotal = seq(1,ndata,1)
totaldata = rawdata[ itotal, 1:nfeats]
rownames(totaldata)[1:nrow(totaldata)] = as.vector(rawdata[ itotal, nfeats+1])
Clusters = levels(rawdata[1,ncol(rawdata)]) # Get Cluster IDs
# Select only nonzero features
selfeats = c(1:3,6, 10:12, 15, 19:21, 24, 28:30, 33, 37:39, 42, 46:48, 51, 55:57, 60, 64:66, 69, 73:75, 78)
#################################################
# Iterate over each cluster #
#################################################
for (cluster in Clusters) { # Assemble Test and Training Set
for (i in seq(1,ndata,1))
if (rawdata[i,ncol(rawdata)]==cluster)
if(exists("testset"))
testset = rbind(testset,totaldata[i,])
else
testset = totaldata[i,]
else
if(exists("trainset"))
trainset = rbind(trainset,totaldata[i,])
else
trainset = totaldata[i,]
ntrndata = nrow(trainset) # number of pdb complexes for training
ntstdata = nrow(testset) # number of pdb complexes for testing
seed = 1
itrain = seq(1,ntrndata,1)
nsample = ntrndata
set.seed(seed)
itrain = sample(itrain)[1:nsample] # shuffle selected complexes... not really necessary...
trainTarget = trainset[ itrain, itarget]
trainFeats = trainset[ itrain, selfeats]
itest = seq(1,ntstdata,1)
testTarget = testset[ itest, itarget]
testFeats = testset[ itest, selfeats]
##################################################
# Calculate Random Forest #
##################################################
RF_Score = randomForest(trainTarget ~ .,
data=trainFeats,
nodesize=5,
ntree=500, mtry=40, na.action=na.omit)
rmse_OOB = round(sqrt(mean(( RF_Score$y - RF_Score$predicted)^2)),digits=3)
OOB_corr_pear = format(round(cor(RF_Score$y , RF_Score$predicted ), digits=3))
OOB_corr_spear = format(cor(rank(RF_Score$y) , rank(RF_Score$predicted)), digits=3)
##################################################
# RF-SCORE PREDICTION OF TEST DATASET #
##################################################
testPred = predict(RF_Score, newdata = testFeats)
rmse = format(sqrt(mean(( testTarget - testPred)^2)), digits=3)
sdev = format(sd( testTarget - testPred ), digits=3)
fitpoints = na.omit(cbind(testTarget, testPred))
fitcorr = format(round(cor(fitpoints[,1], fitpoints[,2]), digits=3)) #Pearson correlation coefficient (R)
sprcorr = format(cor(rank(fitpoints[,1]), rank(fitpoints[,2])), digits=3) #Spearman rank correlation
print(paste("Cluster",cluster,"RMSE_OOB=",rmse_OOB,"R(pe)=",OOB_corr_pear,"R(sp)=",OOB_corr_spear,
"RMSE_vali=",rmse,"R(pe)=",fitcorr,"R(sp)=",sprcorr))
preds = data.frame(identifier = rownames(testset),
measured = testTarget,
predicted = testPred,
cluster = rep(cluster, times=ntstdata))
if (exists("sum_preds"))
sum_preds = rbind(sum_preds,preds)
else
sum_preds = preds
rm (testset,trainset)
}
# Evaluate results for clusters A-W only
resM1 <- sum_preds[1:693,c(1,2,4)]
resP1 <- sum_preds[1:693,c(1,3,4)]
colnames(resM1) <- c("identifier", "activity", "cluster")
colnames(resP1) <- c("identifier", "activity", "cluster")
resM1$type <- rep('measured',dim(resM1)[1])
resP1$type <- rep('predicted',dim(resP1)[1])
result1 <- rbind(resM1, resP1)
########################################
# Create Plots
########################################
library(ggplot2)
#############
# Create a plot that shows measured versus predicted activities for all the clusters
#############
lim <- range(result1$activity)
p <- qplot(measured, predicted, data = sum_preds) + facet_wrap(~ cluster) +xlim(lim[1],lim[2]) + ylim(lim[1],lim[2])
p + geom_abline(intercept=0, slope = 1)+ opts(panel.grid.minor = theme_blank()) + opts(aspect.ratio = 1)
############
# Create a plot that compares the distribution of measured and predicted activities across Clusters A-W
############
p <- ggplot(result1, aes(factor(cluster), activity))
p + geom_boxplot(aes(fill=factor(type))) + xlab("Cluster") + opts(legend.position="none") + ylab("Activity [pKi|pKd]")
#############################################
# Calculate proximities by Multidimensional Scaling and Plot Distribution in the first 2 new dimensions
#############################################
keepColumns <- c(2:82,84)
X <- rawdata[,keepColumns]
keepRows <- c(1:693) # Remove the X, Y, and Z blocks
X <- X[keepRows,]
selfeats = c(1:3,6, 10:12, 15, 19:21, 24, 28:30, 33, 37:39, 42, 46:48, 51, 55:57, 60, 64:66, 69, 73:75, 78, 82)
X = X[selfeats]
x_new = scale(X[1:36], center=TRUE, scale=TRUE)
x_new = data.frame(x_new)
ids = X[37]
ids = data.frame(ids)
data_mat = cbind(x_new[1:36],ids)
loc <- cmdscale( dist(data_mat[,c(1:36)]), 2 ) # Calculate Multidimensional Scaling
loc = data.frame(loc)
loc = cbind(loc,data_mat[37])
colnames(loc) <- c("X","Y","ClusterID")
##############
# Create a plot that shows the relative location of families after Multidimensional Scaling
##############
p <- qplot(X, Y, data = loc)
p + facet_wrap( ~ ClusterID) + xlab("X1(MDS) [Euclidean distance]") + ylab("X2(MDS) [Euclidean distance]")
The data set necessary to rebuild this study can be obtained and assembled from
1.) The code in the supporting information of the original publication about the Scoring Function
2.) The PDBbind Database, containing all the crystal structure information
3.) The cluster assignment published in the supporting information of the study summarized above (accepted under the same title at the Journal of Chemical Information and Modeling)