1. SRR metadata DB construction - labbces/SpliceScape GitHub Wiki

Scripts

🟡 get_metadata.py

The Python script get_metadata.py is used to retrieve sequence metadata from the NCBI's SRA (Sequence Read Archive) database and store it in a SQLite database. It allows specifying search filters such as species and library layout (PAIRED or SINGLE) and can export lists of SRA identifiers to text files.

  • Requirements:
Category Requirements
Python Python 3
Standard Libraries argparse, time, sqlite3, xml.etree.ElementTree, datetime, os
External Libraries requests, beautifulsoup4, biopython
Internet Connection Required to access the NCBI API and Google Scholar
Valid Email & API Key Necessary to use the NCBI Entrez API (--email and --api_key)
  • Arguments:
Argument Function Required Example
--mode Sets the mode: "all" (for species search) or "srr" (for SRR list) True --mode all
-e, --email Specifies the user email for NCBI Entrez API True -e [email protected]
-a, --api_key Specifies the NCBI API key True -a ABCD1234...
--database SQLite3 database file True --database metadata.db
-sp Species name (required for mode "all") Cond. -sp Setaria viridis
-ll Library layout: SINGLE or PAIRED (required for mode "all") Cond. -ll PAIRED
--srr_file File with SRR accessions (required for mode "srr") Cond. --srr_file srr_ids.txt
--max_n_ids Maximum number of identifiers to return False --max_n_ids 1000
--verbose Enables verbose output False --verbose
--keep_unavailable Keep unavailable datasets in the database False --keep_unavailable True
--summary Displays summary statistics False --summary
--srr_list_out File to export list of filtered SRA accessions False --srr_list_out sra_accessions.txt
--srr_list_out_with_pmid File to export list with associated PMIDs False --srr_list_out_with_pmid sra_accessions_with_pmid.txt
-v, --version Displays the script version False -v

â„šī¸ Note: If --mode all is used, both -sp and -ll must be provided.
If --mode srr is used, --srr_file must be provided.

  • Run example:
python3 get_metadata.py --mode all -e [email protected] -a YOUR_API_KEY -sp "Setaria italica" -ll PAIRED --database metadata.db --verbose --max_n_ids 500

Errors

NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x2acd3c2b37f0>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution'))

đŸŸĸ metadata_plots.R

metadata_plots.R

  • Requirements:
Category Requirements
Packages ggplot2
  • Output:
ING PORT

đŸŸĸ get_metadata_PacBio.py

The Python script get_metadata_PacBio.py is used to retrieve sequence metadata from the NCBI's SRA (Sequence Read Archive) database and store it in a SQLite database. It allows specifying search filters such as species and library layout (PAIRED or SINGLE) and can export lists of SRA identifiers to text files. This script was modified to retrieve only PacBio data.

  • Requirements:
Category Requirements
Python Python 3
Standard Libraries argparse, time, sqlite3, xml.etree.ElementTree, datetime, os
External Libraries requests, beautifulsoup4, biopython
Internet Connection Required to access the NCBI API and Google Scholar
Valid Email Necessary to use the NCBI Entrez API
  • Arguments:
Argument Function Required Example
-v, --version Displays the script version False -v
--verbose Enables verbose output False --verbose
--summary Displays summary statistics False --summary
-e, --email Specifies the user email True -e [email protected]
-sp Specifies the species name True -sp Setaria viridis
-ll Specifies the library layout (SINGLE or PAIRED) False -ll PAIRED
--database Specifies the SQLite3 database file True --database database.db
--max_n_ids Sets the maximum number of identifiers to return False --max_n_ids 1000
--srr_list_out Outputs the list of SRA accessions filtered by species and layout False --srr_list_out sra_accessions.txt
--srr_list_out_with_pmid Outputs the list of SRA accessions with associated PMIDs False --srr_list_out_with_pmid sra_accessions_with_pmid.txt
  • Run example:
python3 "/home/LandscapeSplicingGrasses/SplicingLandscapeGrasses/metadata/get_metadata_PacBio.py" -e [email protected] -sp "Setaria italica" --database database.db --verbose --max_n_ids 100000

Errors

đŸŸĸ metadata_plots_PacBio.R

metadata_plots_PacBio.R

  • Requirements:
Category Requirements
Packages ggplot2
  • Output:
ING PORT