Submitting a genome to NCBI - The-Bioinformatics-Group/Albiorix GitHub Wiki
Step 1 - Register on NCBI
Create an account on NCBI here
Step 2 - Register a BioProject
Follow the steps on the BioProject submission page
- Click New submission
- 1 - Submitter
- Enter your contact details - name, email address and postal address
- 2 - Project Type
- Specify the project data type (genome sequencing and assembly, metagenome, transcriptome, etc.)
- Specify the sample scope (monoisolate, multiisolate, multispecies, etc.)
- 3 - Target
- Specify the organism name and strain details, and give a brief description
- 4 - General Info
- Specify when you want the project information to be released - immediately or on a specified date
- Specify the project title and public description
- State the relevance of the project - agricultural, medical, model organism, etc.
- State whether the project is part of a larger initiative (uncommon)
- Give details of any grants related to the project
- 5 - BioSample
- If BioSamples have already been registered, they can be assigned to this BioProject if relevant
- 6 - Publications
- Add links to any publications relevant to the project
- 7 - Overview
- Check details prior to submission
Step 3 - Register a BioSample
Follow the steps on the BioSample submission page
- Click New submission
- 1 - Submitter
- Enter your contact details - name, email address and postal address
- 2 - General Info
- Specify when you want the project information to be released - immediately or on a specified date
- Specify whether you intend to upload batch/multiple BioSamples, or a single BioSample
- 3 - Sample Type
- Specify the sample type - microbe, model organism, metagenome, human, virus, etc.
- 4 - Attributes
- Specify the sample name, organism name, strain, etc., and other attributes related to collection, culturing, etc.
- 5 - BioProject
- If the relevant BioProject has already been registered, this BioSample can be assigned to it now
- 6 - Description
- Specify the project title and public description
- 7 - Overview
- Check details prior to submission
Step 4 - Preparing data for submission
Before anything else, BLAST the genes at the end of each contig, to ensure that the wraparound of the circular replicon hasn't cut any genes in half
Fasta file preparation
- Separate the files, one contig/replicon per fasta, and suffix them as .fsa
- For chromosomes, use the following definition line format:
>SpeGenStr_Chrom [organism=] [strain=] [gcode=11]
- For plasmids, use the following definition line format:
>SpeGenStr_pStr-X [organism=] [strain=] [plasmid-name=pStr-X] [completeness=complete] [topology=circular] [gcode=11] [location=plasmid]
Annotation file preparation
- Ensure you are using suitable locus tags, such as the format SpeGenStr; if needed, replace all instances using
sed
- Based on the Prokaryotic Genome Annotation Guide, tidy up the GenBank file:
- If the product name contains a molecular weight in kDa, BLASTp the sequence and try to find an alternative name, e.g. subunit X
- Put the original name in a /note field
- Greek letters should be in lower-case, except for Delta "in the context of steroid/fatty acid metabolism nomenclature"
- Un-capitalise product names where possible (there are acceptable exceptions, e.g. abbreviations)
- This can be achieved using
sed
:sed -i 's/Word/word/g' filename
- This can be achieved using
- If the product name contains a species name, e.g. 'Agrobacterium tumefaciens protein', BLASTp the sequence and try to find an alternative name; if not, label as 'hypothetical protein'
- Put the original name in a /note field
- If the product name contains a molecular weight in kDa, BLASTp the sequence and try to find an alternative name, e.g. subunit X
- Other changes that should be made:
- Check for trailing underscores in product names, and remove these
- Put the original name in a /note field
- Check for names ending with square brackets, and replace these with regular brackets
- If possible, if two genes have the same name but different products, if this difference is trivial, rename one to match the other
- For the sake of tidiness, consider removing '_1', '_2', etc. from the ends of gene names
- If the GenBank was produced using the GenBank_Consensus.py script, remove any /note fields stating 'From [filename].gbk' or 'Both records hypothetical'
- Similarly, remove any /inference fields stating 'Similar to *.fasta'
sed -i '/.fasta/d' *.gbk
will achieve this, but usegrep ".fasta"
first just in case
- Similarly, remove any /inference fields stating 'Similar to *.fasta'
- Check the more obscure feature types, like 'ncRNA_class', to ensure they are what they claim to be
- Check for trailing underscores in product names, and remove these
- When all alterations have been made, separate the GenBank file - one contig/replicon per file
Conversion to .sqn format
- Once all changes are made, run the files through the GenBank_to_NCBI_tbl.py script, with the identifier GotUniMarDep
- Ensure that the header is
>Features SpeGenStr_Chrom
or>Features SpeGenStr_pStr-X
- Once everything is checked, complete and download both a submission template (.sbt file) and genome structured comment (.asm file) from the submission portal
- Use tbl2asn to check your files for errors or inconsistencies, and fix any problems which are mentioned in the discrep or errorsummary.val files generated; if the steps above have been followed, the errors encountered should be minimal
- path\to\tbl2asn.exe -p path\to\directory -t template_file.sbt -M n -Z discrep -w structured_comment.asm
- Any errors noted as FATAL or ERROR must be corrected; those labelled as WARNING can be disregarded if necessary
- One common 'error' noted is when deprecated EC numbers have been used; a .ecn file will be created by tbl2asn listing the EC number changes made to each contig's resultant .sqn file, and these results can be used to correct your .tbl file if you intend to run tbl2asn again after correcting any errors
- Another common error noted in the errorsummary.val output file is WARNING: SEQ_FEAT.CollidingGeneNames; this occurs when multiple genes share the same name, often as a result of the removal of the suffixes '_1', '_2', etc. recommended above; this can be safely disregarded
- Once a final .sqn file has been generated or each contig, run the .sqn files through the consistency checker on the NCBI website, which will inform of any problematic sequence overlaps and possible pseudogenes
- A common issue here is overlapping genes; if you wish to look at these manually, BLASTx the sequence encompassing both genes, although the consistency checker also does this
- Any issues highlighted by this checker and not addressed will be brought up by NCBI staff when submitting the genome
Step 5 - Submitting the assembly and annotation
Follow the steps on the WGS submission page
- Click New submission
- Submission type - Select either single genome or multiple/batch genome submission
- 1 - Submitter
- Enter your contact details - name, email address and postal address
- 2 - General info
- Give the BioProject and BioSample reference numbers relevant to the project
- Specify when you want the project information to be released - immediately or on a specified date
- If you didn't include the structured comment mentioned in the previous section, you can include genome assembly metadata here instead
- Answer a handful of yes/no questions
- Did your sample include the full genome?
- Is this the final version?
- Is it a de novo assembly?
- Is it an update of existing submission?
- Name the submission (e.g. by giving the species and strain)
- Enter any private comments to NCBI staff regarding the submission (internal use by NCBI only)
- 3 - Source
- Give the name and address of the PI from whom the bacteria and/or source DNA can be obtained
- Specify whether or not you would like NCBI to annotate the genome on your behalf (even if you select 'No', a version which they annotate appears to become available on the NCBI ftp site anyway)
- 4 - Files
- Choose from one of the following options regarding your assembly:
- Each chromosome is in a single sequence and there are no extra sequences
- One or more chromosomes are still in multiple pieces and/or some sequences are not assembled into chromosomes
- We are submitting just the AGP file(s) for a genome assembly; the components of the AGP file are already in GenBank
- Select the file type you will be uploading - .sqn or fasta
- Upload each of the .sqn/fasta files for your assembly
- Files larger than 2GB require the Aspera Connect plugin (which is now incompatible with Firefox)
- Choose from one of the following options regarding your assembly:
- (Gaps)
- (If there are any Ns in the sequence, you will be asked for additional information regarding these)
- 5 - Assignment
- Answer whether any sequences belong to a plasmid, and whether the organism has only one chromosome
- Specify which of the uploaded sequences are chromosomes and which are plasmids
- Note whether each replicon is complete and/or circular
- 6 - Overview
- Check details prior to submission
NCBI will email any queries they have, or any corrections that need to be made; otherwise, you should receive the accession number(s) in a few days
Step 4 - Submitting raw reads to SRA
Obtaining raw read data
- Raw read data is saved on Albiorix
- For example, SMRT data is saved at
/home/smrtanalysis/userdata/inputs_dropbox/path_to_sample/run1/.../Analysis_Results/
- Going any deeper into the file structure than
input_dropbox
requiressudo
rights
- Going any deeper into the file structure than
- Copy the four required data files - one bas.h5 file and three bax.h5 files - from the relevant directory into a directory for which you have normal permissions
- Change the permissions for these files using
chmod 755 filenames
(chmod 755 *.h5
ought to work) - Make a note of the full filenames, then tarball and zip the files together
- Download either a tab-delimited file or an Excel spreadsheet, fill in the required metadata for the submission, and save the file as a tab-delimited .txt/.tsv file (if you downloaded the Excel spreadsheet version, save only the second sheet)
- For example, SMRT data is saved at
Submitting raw read data
Follow the steps on the Sequence Read Archive (SRA) submission page
- On the main SRA submission page, before clicking New Submission, click the FTP upload dropdown; this will give you instructions on how and where to upload your raw reads via FTP
- Click New submission
- 1 - Submitter
- Enter your contact details - name, email address and postal address
- 2 - General info
- Specify the BioProject reference number, and state whether you need to create a new BioSample for the submission
- Specify when you want the project information to be released - immediately or on a specified date
- 3 - SRA Metadata
- Upload the metadata file (tab-delimited file or Excel spreadsheet) completed in the previous section
- 4 - Files
- Files can be uploaded in one of two ways:
- Directly via HTTP or the Aspera Connect plug-in
- Via FTP, as mentioned at the beginning of this section
- Files can be uploaded in one of two ways:
- 5 - Overview
- Check details prior to submission
Step 5 - Submitting methylation data
Running methylation analysis
- In SMRT Portal, run a job using the protocol RS_Modification_and_Motif_Analysis.1
- Select the relevant SMRT Cell dataset, and the final assembly reference for the sample
- Download modifications.csv.gz, modifications.gff.gz, motif_summary.csv and motifs.gff.gz, and unzip as required
- This can be downloaded either directly from SMRT Portal, or from Albiorix - /home/smrtanalysis/userdata/jobs/path_to_job/data/
Submitting analysis results
Follow the steps on the Supplementary Files submission page
- Click New submission
- 1 - Submitter
- Enter your contact details - name, email address and postal address
- 2 - Data Type
- Select the type of supplementary data being uploaded
- BioNano Maps
- Beta-lactamase gene
- PacBio methylation data (this is the option to which the rest of this tutorial refers)
- Select the type of supplementary data being uploaded
- 3 - General Info
- Specify the BioProject and BioSample reference number(s), and the genome accession number(s)
- Specify when you want the project information to be released - immediately or on a specified date
- 4 - Files
- Upload the four methylation data files mentioned above
- If there are issues with uploading this data, it can be achieved via Albiorix instead
ssh -X -Y [email protected]
firefox &
- Log into NCBI in the new Albiorix Firefox window, and return to this step
- If there are issues with uploading this data, it can be achieved via Albiorix instead
- Upload the four methylation data files mentioned above
- 5 - Overview
- Check details prior to submission