Machine Learning Journal - GeorgeIniatis/Blood_Brain_Barrier_Drug_Prediction GitHub Wiki

Experiment 1: Try to build Classification models

  • Classification models will make use of the Class as the label
  • Two different model categories:
    • Category 1: Models with just the Chemical Descriptors used as features
    • Category 2: Models with Chemical Descriptors, Side Effects and Indications used as features (does the addition of Side Effects and Indications to the Chemical Descriptors improve our predictive performance?)
  • Training sets:
    • For category 1 the whole dataset will be used, excluding the entries used in the Test set
    • For category 2 a subset of the dataset will be used, those entries that have Side Effects and Indications available, again excluding the entries used in the Test set
  • Test set:
    • Will be used to compare the models against against each other
    • 20% subset of the dataset entries that have Chemical Descriptors and Side Effects and Indicators. This is to allow us to use compare the performance of the two different categories of models using the same test setperformance of the two different categories of models using the same test set

Experiment 2: Try to build Regression models

  • Regression models will make use of the LogBB as the label
  • Training set:
    • A subset of the dataset will be used, those entries that have LogBB available, again excluding the entries used in the Test set
  • Test set:
    • Will be used to compare the models against against each other
    • 20% subset of the training set

Experiment 3: Try to find the most relevant Side Effects and Indications

  • Using RFECV

Models

  • Classification:
    • Dummy Classifier
    • Logistic Regression
    • Support Vector Classifier
    • K-Nearest Neighbour Classifier
    • Random Forest Classifier
    • Decision Tree Classifier
    • Stochastic Gradient Descent Classifier
  • Regression:
    • Dummy Regressor
    • Linear Regression
    • Support Vector Regression
    • K-Nearest Neighbour Regressor
    • Random Forest Regressor
    • Decision Tree Regressor
    • Stochastic Gradient Descent Regressor

Metrics

  • Will not rely on just one metric as it can lead to extremely wrong conclusions about the model's performance
  • Classification Models
    • Sensitivity/Recall:
      • How many of the actual positives are labelled as positive by our model
      • tp / (tp/fn)
    • Precision:
      • How many of positive predictions were actually true
      • tp / (tp/fp)
    • F1 Score:
      • Mean of precision and recall
      • Other versions that add more/less weight to precision or recall
    • Matthews correlation coefficient
    • Others that could be used:
      • ROC curce & AUC
      • PR curve (Better for class imbalance)
    • What do we care about most? False Positives or False Negatives?
  • Regression Models
    • Negated Mean Absolute Error
    • R2

Evaluation

  • Dummy models
  • Test set for each experiment
  • Permutation testing for model robustness

Common Practices

  • Data will be scaled
  • Some data exploration will be performed
  • Cross validation will be used to find the best hyperparameters for our models
  • Multiple metrics will be reported for each of our models
  • The models will take the class imbalance into account
  • The testing sets will be stratified, preserving the class imbalance and will be used to reach appropriate conclusions

Things to Investigate Further

  • Regularisation
  • Feature selection
  • Lasso Regression + Rich Regression = Elastic Net
  • Pearson correlation
  • Chi square correlation