Report 4 - GeorgeIniatis/Blood_Brain_Barrier_Drug_Prediction GitHub Wiki
Worked on the dataset
Discovered a minor bug, involving the get_pubchem_cid_and_smiles_using_name function, that could introduce some errors in the dataset
Only affected the entries associated with the Gao et al. datasets
Some drug names can have multiple SMILES associated with them and the bug caused the function to only retrieve the first SMILES available
To remove uncertainty from my dataset, drugs with multiple SMILES were decided to be removed
Instead of just fixing the mistake and removing the affected entries I decided to repopulate the whole dataset which was easy given my functions, but it took a couple of hours
Retrieved indicators from SIDER database
Produced one hot encodings for both side effects and indicators
Performed Automated Google Searches
Performing specific drug-targeted google searches was proven to be ineffective. Would get irrelevant results
Performing different general queries to gather as much data as possible was proven to be ineffective as well. Too noisy
Decided to perform site-targeted queries to gather as much reliable data as possible without the need of manual verification
For each query roughly 100 URLs were retrieved and regular expressions were used to retrieved matches
Google seemed to discovered a bot was being used after roughly 100 URLs were retrieved