Report 4 - GeorgeIniatis/Blood_Brain_Barrier_Drug_Prediction GitHub Wiki

  • Worked on the dataset
    • Discovered a minor bug, involving the get_pubchem_cid_and_smiles_using_name function, that could introduce some errors in the dataset
    • Only affected the entries associated with the Gao et al. datasets
    • Some drug names can have multiple SMILES associated with them and the bug caused the function to only retrieve the first SMILES available
    • To remove uncertainty from my dataset, drugs with multiple SMILES were decided to be removed
    • Instead of just fixing the mistake and removing the affected entries I decided to repopulate the whole dataset which was easy given my functions, but it took a couple of hours
    • Retrieved indicators from SIDER database
    • Produced one hot encodings for both side effects and indicators
  • Performed Automated Google Searches
    • Performing specific drug-targeted google searches was proven to be ineffective. Would get irrelevant results
    • Performing different general queries to gather as much data as possible was proven to be ineffective as well. Too noisy
    • Decided to perform site-targeted queries to gather as much reliable data as possible without the need of manual verification
    • For each query roughly 100 URLs were retrieved and regular expressions were used to retrieved matches
    • Google seemed to discovered a bot was being used after roughly 100 URLs were retrieved
  • Question/Topics to discuss
    • Next steps. ML tutorials and material?