Dataset Creation Journal - GeorgeIniatis/Blood_Brain_Barrier_Drug_Prediction GitHub Wiki

Plan

Use all available datasets and augment them where necessary, calculating descriptors/etc
Use PubMed's E-Utilities API and Springer Nature's API to gather BBB- compounds and drugs, that are hopefully not already in the dataset, in an effort to reduce the class imbalance
Use a subset of descriptors as in the Saber et al.
- Which means essentially keeping only the SMILES format, the BBB permeability and the logBB when available from the different datasets found
Use the PubChem API to get the Name, PubChem_CID, MW, TPSA, XLogP, NHD, NHA, NRB for all compounds
Use the SIDER base to get SIDER_CID, Side_Effects and Indications
- Decided to only use the PT (Preferred Term) side effects and indications to make everything simpler. The LLT (Lowest Level Term) side effects are taken from labels of drugs but they can be too complicated. Multiple LLTs can be simplified using a single PT.
Recalculate BBB permeability as 1 for BBB+ and 0 for BBB- using the thresholds suggested by Li et al.
- BBB+ if LogBB >= -1
- BBB- if LogBB < -1
Remove any duplicates, unknown compounds and compounds without all chemical descriptors available

Process

Started with the Singh et al. dataset and added on top of it the Zhao et al, Gao et al, Zhang et al datasets
- All columns were removed except the ones with the SMILES format, drug Name, experimental logBB value and BBB permeability (Class)
- When the experimental logBB was available, BBB permeability was recalculated based on the threshold mentioned above
- When SMILES was available it was used to retrieve the PubChem_CID, MW, TPSA, xLogP, NHD, NHA, NRB, Synonyms and the drug Name (Essentially the first synonym) using the PubChem API
- When SMILES wasn't available but the drug Name was, it was used to retrieve the PubChem_CID and SMILES format and then from there all the mentioned descriptors and variables were retrieved
- Once the synonyms were retrieved for a specific drug or compound they were looked up in the SIDER dataset. If a synonym was found in the SIDER dataset we retrieved the SIDER_CID and the associated Side_Effects and Indications
Used PubMed's E-Utilities API to get abstracts from PubMed and academic papers from PubMed Central that matched multiple queries pointing to a negative brain permeability, in an effort to reduce the class imbalance discovered early on.
- The various paragraphs of abstracts and academic papers were extracted using XML parsing
- The sentences were then extracted and multiple regular expressions were used to find matches
- Matches were then loaded into excel files and manually verified
- The resulting excel files produced from these searches can be found on the repo
- PubMed API searches produced 15, manually verified, compounds and drugs (14 BBB-, 1 BBB+) from 35 matches
- PubMed Central API searches produced 91, manually verified, compounds and drugs (91 BBB-) from 361 matches
Used Springer Nature's API to get abstracts, articles and journals and followed the same process
- Springer Meta V2 API searches produced 42, manually verified, compounds and drugs (41 BBB-, 1 BBB+) from 108 matches
- Springer Open access API searches produced 109, manually verified, compounds and drugs (106 BBB-, 3 BBB+) from 491 matches
Duplicates, unknown compounds and compounds without all chemical descriptors available were removed
Compounds and drugs that returned multiple PubChem_CIDs and SMILES were removed
The DOI of each academic paper was provided as source for compounds and drugs. When it wasn't available either a link to PubMed or PubMed Central was provided as the source
The dataset was sorted based on drug name
Dataset before removing any duplicates or unknown compounds was 3748 compounds and drugs
Dataset after removing duplicates and unknown compound is currently at 2396 compounds and drugs
- 1751 BBB+
- 645 BBB-
- 345 entries having side effects and indications
Noticed diminishing returns trend with each dataset added. Only a small number of new compounds and drugs were discovered

Unused Ideas

Tried to add more drugs and compounds by performing Automated Google Searches
- Performing specific drug-targeted google searches was proven to be ineffective. Would get irrelevant results
- Performing different general queries to gather as much data as possible was proven to be ineffective as well. Too noisy
- Had the idea perform site-targeted (https://pubs.acs.org/, https://pubmed.ncbi.nlm.nih.gov/, https://www.ncbi.nlm.nih.gov/pmc/) queries to gather as much reliable data as possible decreasing the efforts needed to verify results but discussion with my supervisor pointed me to PubMed's API which i ended up using instead of Automated Google Searches
- For each query we would gather as much URLs as possible before getting a 429: Too many requests error and regular expressions were used to retrieved matches. We were interested at the words before the query text

Problems Encountered

Some SMILES include special characters (/,) that even when URL encoded alter the SMILES itself.
- Solved using POST requests to the PubChem API as suggested by the documentation
Complexity issues. Algorithms taking too long to run
- Solved using code refactoring, reformats and making use of binary searches where possible
Discovered a bug with one of the functions, get_pubchem_cid_and_smiles_using_name
- Only affected the entries associated with the Gao et al. datasets
- Some drug names can have multiple SMILES associated with them and the bug caused the function to only retrieve the first SMILES available
- Fixed through some code refactoring. The titles of the compounds retrieved by PubChem would also be returned and compared with the drug name we are currently searching. If we find a match we return that specific compound CID and SMILES
- Had to repopulate the whole dataset which took a couple of hours
While performing automated google searches, after roughly 100 search results were retrieved, a 429: Too many requests error would pop up. Possibly caused by a website discovering a bot was being used to scrape data
- Multiple queries were used to gather as much data as possible until that error was thrown
- In the end we decided againt directly using Google Searches and made use of the APIs offered by PubMed and Springer for a more targeted approach