1. SRR metadata DB construction - labbces/SpliceScape GitHub Wiki
Scripts
The Python script get_metadata.py is used to retrieve sequence metadata from the NCBI's SRA (Sequence Read Archive) database and store it in a SQLite database. It allows specifying search filters such as species and library layout (PAIRED or SINGLE) and can export lists of SRA identifiers to text files.
Category |
Requirements |
Python |
Python 3 |
Standard Libraries |
argparse , time , sqlite3 , xml.etree.ElementTree , datetime , os |
External Libraries |
requests , beautifulsoup4 , biopython |
Internet Connection |
Required to access the NCBI API and Google Scholar |
Valid Email & API Key |
Necessary to use the NCBI Entrez API (--email and --api_key ) |
Argument |
Function |
Required |
Example |
--mode |
Sets the mode: "all" (for species search) or "srr" (for SRR list) |
True |
--mode all |
-e , --email |
Specifies the user email for NCBI Entrez API |
True |
-e [email protected] |
-a , --api_key |
Specifies the NCBI API key |
True |
-a ABCD1234... |
--database |
SQLite3 database file |
True |
--database metadata.db |
-sp |
Species name (required for mode "all" ) |
Cond. |
-sp Setaria viridis |
-ll |
Library layout: SINGLE or PAIRED (required for mode "all" ) |
Cond. |
-ll PAIRED |
--srr_file |
File with SRR accessions (required for mode "srr" ) |
Cond. |
--srr_file srr_ids.txt |
--max_n_ids |
Maximum number of identifiers to return |
False |
--max_n_ids 1000 |
--verbose |
Enables verbose output |
False |
--verbose |
--keep_unavailable |
Keep unavailable datasets in the database |
False |
--keep_unavailable True |
--summary |
Displays summary statistics |
False |
--summary |
--srr_list_out |
File to export list of filtered SRA accessions |
False |
--srr_list_out sra_accessions.txt |
--srr_list_out_with_pmid |
File to export list with associated PMIDs |
False |
--srr_list_out_with_pmid sra_accessions_with_pmid.txt |
-v , --version |
Displays the script version |
False |
-v |
âšī¸ Note: If --mode all
is used, both -sp
and -ll
must be provided.
If --mode srr
is used, --srr_file
must be provided.
python3 get_metadata.py --mode all -e [email protected] -a YOUR_API_KEY -sp "Setaria italica" -ll PAIRED --database metadata.db --verbose --max_n_ids 500
Errors
NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x2acd3c2b37f0>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution'))
metadata_plots.R
Category |
Requirements |
Packages |
ggplot2 |
ING |
PORT |
 |
 |
The Python script get_metadata_PacBio.py is used to retrieve sequence metadata from the NCBI's SRA (Sequence Read Archive) database and store it in a SQLite database. It allows specifying search filters such as species and library layout (PAIRED or SINGLE) and can export lists of SRA identifiers to text files. This script was modified to retrieve only PacBio data.
Category |
Requirements |
Python |
Python 3 |
Standard Libraries |
argparse , time , sqlite3 , xml.etree.ElementTree , datetime , os |
External Libraries |
requests , beautifulsoup4 , biopython |
Internet Connection |
Required to access the NCBI API and Google Scholar |
Valid Email |
Necessary to use the NCBI Entrez API |
Argument |
Function |
Required |
Example |
-v , --version |
Displays the script version |
False |
-v |
--verbose |
Enables verbose output |
False |
--verbose |
--summary |
Displays summary statistics |
False |
--summary |
-e , --email |
Specifies the user email |
True |
-e [email protected] |
-sp |
Specifies the species name |
True |
-sp Setaria viridis |
-ll |
Specifies the library layout (SINGLE or PAIRED) |
False |
-ll PAIRED |
--database |
Specifies the SQLite3 database file |
True |
--database database.db |
--max_n_ids |
Sets the maximum number of identifiers to return |
False |
--max_n_ids 1000 |
--srr_list_out |
Outputs the list of SRA accessions filtered by species and layout |
False |
--srr_list_out sra_accessions.txt |
--srr_list_out_with_pmid |
Outputs the list of SRA accessions with associated PMIDs |
False |
--srr_list_out_with_pmid sra_accessions_with_pmid.txt |
python3 "/home/LandscapeSplicingGrasses/SplicingLandscapeGrasses/metadata/get_metadata_PacBio.py" -e [email protected] -sp "Setaria italica" --database database.db --verbose --max_n_ids 100000
Errors
metadata_plots_PacBio.R
Category |
Requirements |
Packages |
ggplot2 |
ING |
PORT |
 |
 |