Report 10 - GeorgeIniatis/Blood_Brain_Barrier_Drug_Prediction GitHub Wiki

Predicting drugs that can cross the blood-brain-barrier

George Iniatis

2329642

Proposal

Motivation

The brain is surrounded by a permeable boundary that prevents many pathogens from getting in. However, it can also stop many useful drugs from entering the brain. This is especially important when trying to deliver critical therapeutics, such as chemotherapy, to brain tumours. Accurate prediction of whether a drug will easily cross the blood-brain barrier is a valuable tool for developing and testing new drugs for various diseases.

Aims

This project aims to gather publicly available data on drugs known to cross into the brain and those that cannot and place them into a new dataset. Then using this newly created dataset, build a machine learning system that uses a drug’s chemical structure to predict whether it can pass the blood-brain barrier.

Progress

  • Read 5 academic papers discussing how other people have solved the same problem using a variety of strategies and methods.
  • Created the dataset
    • Used a very small subset of important chemical descriptors as well as the drug/compound side effects and indications
    • Chemical descriptors were retrieved from PubChem
    • Side effects and indications were retrieved from SIDER
    • Used the datasets already discovered in the background research and augmented them where necessary, calculating descriptors, finding side effects, etc
    • Used the APIs offered by PubMed and Springer to add even more compounds and drugs to our dataset. Mainly those that cannot pass into the brain to correct the class imbalance discovered early on
    • Whole process in detail can be found in the Dataset Creation Journal
  • Laid out experiments to be conducted
    • Experiment 1: Chemical Descriptors Classification vs Chemical Descriptors Regression
    • Experiment 2: Does the addition of Side Effects and Indicators to the Chemical Descriptors improve our predictive performance?
    • More details can be found in the Machine Learning Journal
  • Created a Datalore notebook (JetBrains equivalent of Jupyter Notebooks) to do some data exploration, produce plots and build the ML models.
    • Splitted our dataset into subsets based on the experiments mentioned
    • Produced PCA, TSNE and UMAP plots
    • Built a very basic Logistic Regression model
    • Notebook Link

Problems and risks

Problems

Dataset Problems (Copied directly from the Dataset Creation Journal)

  • Some SMILES include special characters (/,) that even when URL encoded alter the SMILES itself.
    • Solved using POST requests to the PubChem API as suggested by the documentation
  • Complexity issues. Algorithms taking too long to run
    • Solved using code refactoring, reformats and making use of binary searches where possible
  • Discovered a bug with one of the functions, get_pubchem_cid_and_smiles_using_name
    • Only affected the entries associated with the Gao et al. datasets
    • Some drug names can have multiple SMILES associated with them and the bug caused the function to only retrieve the first SMILES available
    • Fixed through some code refactoring. The titles of the compounds retrieved by PubChem would also be returned and compared with the drug name we are currently searching. If we find a match we return that specific compound CID and SMILES
    • Had to repopulate the whole dataset which took a couple of hours
  • While performing automated google searches, after roughly 100 search results were retrieved, a 429: Too many requests error would pop up. Possibly caused by a website discovering a bot was being used to scrape data
    • Multiple queries were used to gather as much data as possible until that error was thrown
    • In the end we decided againt directly using Google Searches and made use of the APIs offered by PubMed and Springer for a more targeted approach

General Problems

  • First time working with Machine Learning in such depth
    • The Machine Learning course definetely helped
    • Watched multiple scikit-learn and ML tutorials to get up to speed

Risks

  • A large percentage of the dataset is made up of previous datasets created by other researchers. Therefore our current dataset is as good as those it has built upon
  • The matches returned from the APIs offered by PubMed and Spriner had to be manually verified. Any human validated data is bound to have at least a few errors

Plan

Semester 2

  • Week 1-2: Developing all the models
    • Deliverable: A variety of models for our different experiments
  • Week 3: Comparing/Evaluating the models
    • Deliverable: Data and statistics gather after using multiple evaluation techniques to check the robustness of our models
  • Week 4-5: Produce a system that makes use of the best performing models
    • Deliverable: At the very least a notebook
  • Week 6-10: Writing Dissertation
    • Deliverable: First draft submitted to supervisor two weeks before the deadline