NCBI GEO Use Case - bicbioeng/DeepSeqDock GitHub Wiki
Step 1 : Searching Data in NCBI
Website: NCBI Gene Expression Omnibus (GEO) DataSets
- Access the NCBI GEO DataSets portal.
- Perform a keyword search:
- Use terms like ALS, Alzheimer, Monkeypox, or brain tumor.
- Filter for species: Homo Sapiens (human).
- Identify datasets with RNA count matrices:
- Look for datasets that include RNA-seq count matrices or raw RNA-seq data.
- Select relevant datasets for download.
Here is the image for searching the Alzheimer from the NCBI Geo Datasets portal for Homo sapiens.
Step 2. Search for RNA-seq Count Matrices
- Once datasets are found, check their descriptions for mentions of:
- RNA count matrices.
- Prioritize datasets that provide data split into train, validation, and test sets or can be manually split with more data.
- Click on “Download RNA-seq counts”.
Step 3. Download RNA-seq Count Data
- Navigate to the dataset of interest on NCBI.
- Right click on “Series RNA-seq raw counts matrix” - eg: GSE255982_raw_counts_GRCh38.p13_NCBI.tsv.gz and Copy the link address. Typically, files are provided in TSV or CSV formats.
From the above image, get the link address.
Paste the link in the Google Colab below to generate the train, test and valid datasets. It will automatically generate the datasets after pasting the link and download the files.
Step 4. Download SRA Run Selector for Metadata
- On the main page, find the SRA Run Selector button in the bottom of the page.
- We need this to export metadata:
- This file will contain details like sample names, source names, and pathology.
After clicking the SRA Run Selector button, the below webpage opens and click on the Metadata to download the SRARunTable, which will be later used to generate the meta.csv file.