Submitting 16S rRNA gene sequence data to NCBI's Sequence Read Archives - MariaAlvBla/NCBI-Tutorial Wiki


This instructive provides step-to-step instructions to deposit 16S rRNA gene amplicon sequence data to the National Center for Biotechnology Information (NCBI).


Table of contents


  1. Registering to NCBI
  2. Accessing the Sequence Read Archive (SRA)
  3. Submitting data to SRA through the Submission Wizard
    1. Aspects to consider before submitting data
    2. Submitting data
      1. Step 1. Submitter
      2. Step 2. General Information
      3. Step 3. Project (BioProject) information
      4. Step 4. BioSample type
      5. Step 5. BioSample attributes
        1. Possible Errors at this step
      6. Step 6. SRA Metadata
        1. Recommendations to avoid common errors when submitting SRA metadata
        2. Submitting in new BioSamples vs submitting to already existing ones
        3. Explanation of the elements of a public display at a single SRA Sample
      7. Step 7. Files
        1. Possible Errors or Warnings at this step
      8. Step 8. Review and Submit
  4. Accessing an unfinished submission
  5. Processing of the submission
    1. The Project is being reviewed by NCBI’s staff
    2. The Project has been accepted
      1. Public display and searchable elements of a BioSample
      2. Public display and searchable elements of a SRA Experiment
  6. Changing a submission
  7. Downloading data
    1. Downloading data corresponding to one accession number
    2. Downloading data corresponding to several accession numbers



Registering to NCBI


To register to NCBI follow the next steps:


  1. Access NCBI's homepage and click Log in.




  1. A menu with several lo Login options will be displayed. You can choose whichever you prefer for setting up your account.




Accessing the Sequence Read Archive (SRA)


For accessing the Sequence Read Archive (SRA) follow the next steps:


  1. While being logged in, Access NCBI's homepage.


  1. Click Submit.




  1. The main page of the Submission Portal will be displayed. For the occasion of the Datathon, write 16S rRNA in the search bar and click SRA.




  1. A webpage with information about the Sequence Read Archive (SRA) will be displayed. SRA specializes in high throughput sequence data, including 16S rRNA gene sequence data.Click Submit.




Submitting data to SRA Submission Wizard]


Aspects to consider before submitting data


  • If the data comes from a human study, donor consent is usually necessary.

  • Each upload must be kept under 5 TB, if you have more, split the upload across multiple submissions.

  • Submissions can be linked to the same BioProject to ensure all data are searchable with a single accession code.

  • Every fastq file should be less than 100 GB in size. If compressed files are larger than 100 GB, please split them before submission.


Submitting data


This process requires several steps. To save your progress, click ***Continue***. You can review or make changes to your previous steps during submission by clicking on the preceding tabs.

At any point after having saved your progress, you can leave NCBI and continue the process of submission later. If, however, you click the Submit button at the last step, making changes will require additional steps.

You may get Error or Warning messages when saving your progress. Error messages describe the Error and suggest a solution that must be corrected before you can move to the next step of your submission. On the other hand, the Warning messages attempt to prevent you from making a possible mistake and do not block you from continuing your submission.


Step 1. Submitter




Here, the data archiver will be asked to include professional information. We recommend using you institutional e-mail and writing the information of the institution you work for.


Step 2. General Information




The BioProject represents the research project from which the dataset originated. The information supplied in the Biosample provides context to your experimental data. Every metagenome, time point, tissue type, or treatment type must have its own Biosample; but biological and technical replicates are not unique BioSamples. If the data you will submit is not linked to any previously submitted data, select No for the BioProject and BioSample categories in this step.

Note: most often, each sample will also be its own BioSample. If unsure, it is best to have each sample be a separate BioSample.




Depending on your answers at this step, the next steps would follow one of these pathways:




The default release date is Release immediately following processing, but you can select a specific date for releasing your data. If you don’t know the exact date you can change it even after having finished the submission by clicking on the Manage tab at the Submission Portal.

Step 3. Project (BioProject) information




In the Public description provide information that best describes your research, which will become the description of your BioProject. If you have an abstract or research summary of your research project, you should add it here. Also, we recommend that at URL you add the DOI link to any publication of yours that is related to this data.

Step 4. BioSample type


In this step, you will select a **Package** that best fits the nature of your Biosample. According to your selected package, the Submission Portal will supply you with a customized **attribute table** for the [next step](#bioattributes) that best describes the context of your BioSamples.

Select the package MIMARKS Survey related. In the displayed drop-down menu, select the sample type that better describes your sample.




Step 5. BioSample attributes


This step provides contextual information about your samples.




For the Datathon, select Uploading a file using Excel format and use the following customed Excel table:

MIMARKS.survey.soil.5.0_Dathaton.xlsx


Please read the instructions included in the excel carefully before filling in the values. Remember that you can only upload the tab-delimited text file version of the sheet **MIMARKS.survey.soil.5.0**. If working in Excel, export this spreadsheet as a tab-delimited file

The sample_name you give each sample in the attribute table will be again used at the SRA metadata table to link the sequence data and metadata. The sample names must be the same in both Excel files for them to be linked together.


Possible Errors at this step

Error: Multiple BioSamples cannot have identical attributes


Problem

After filling out values for attributes provided in the template, your individual samples are not distinguishable by at least one or a combination of attributes.

Solution

Make sure the combined value of all attributes is unique for each Biological sample. Note that sample name, sample title, and description are not included in this check for the uniqueness of the sample's attributes. If this problem arises because of biological replicates, please add a replicate column to the sheet and record the replicate numbers to differentiate them.


Error: Multiple BioSamples cannot have identical attributes


Problem

Less often, this error may arise if you are attempting to deposit sequences that have already been deposited to NCBI, and the Submission Portal is preventing you from creating duplicates.

Solution

If you want to include an existing BioSample in the new BioProject, go back to the General Info tab and select Yes to the question Did you already register BioSamples for this data set?. The SRA Submission Wizard will then skip the BioSample type and attributes steps.

If you are using the SRA_metadata_Datathon.xlsx, in the SRA metadata step, you need to change the name of the first column from sample_name to biosample_accession. Then you can add the existing BioSample's accession numbers (SAMN#) to link the new sequence files to the already existing BioSamples, and to include them in the new BioProject.

To find the accession numbers of Biosamples you already registered go to the Submission Portal and follow the next steps:


  1. Click My submissions.




  1. Click objects in the BioSample section of the Project.



Step 6. SRA Metadata


The SRA metadata describes the technical aspects of each sequencing experiment: the sequencing libraries, preparation techniques, and the names of the data files.




For the Datathon, select Uploading a file using Excel format and use the following customed Excel table:

SRA_metadata_Datathon.xlsx


Please read the instructions included in the spreadsheet carefully before filling in the values. You can only upload the tab-delimited text file version of the spreadsheet **SRA data**. If working in Excel, export that spreadsheet as a tab-delimited file.

When submitting the project, SRA Experiment captures the unique combination of techniques that was used to sequence a particular sample (i.e., each combination of library + sequencing strategy + layout + instrument model represents a different experiment). _Note: most often, all samples within a project will be sequenced using the same combination of techniques, and will thus belong to a single Experiment. The most common exception is when two gene regions (e.g., 16S rRNA and ITS) are sequenced for the same project.


Recommendations to avoid common errors when submitting SRA metadata

  • Paired-end data files (forward/reverse) must be listed together in the same Run (in the case of the spreadsheet in the same row) for the two files to be correctly processed as paired-end. All data files listed in a Run will be merged into a single sra archive file. Therefore, files from different samples or experiments should not be grouped in the same Run.

  • File name(s) for the Experiments shouldn’t contain any sensitive information, because they will appear publicly on the Google and AWS clouds.

  • Avoid submitting duplicated files because the Portal does not accept this, and such files may be suppressed without warning.


Submitting new sequence data vs submitting new metadata to already existing ones

When submitting new BioSamples, during the [BioSample attributes step](#bioattributes), a specific name for each sample is assigned in the **sample_name** column of the [**MIMARKS.survey.soil.5.0_Dathaton.xlsx** file](#bioattributes). In the SRA Metadata step, in the [**SRA_metadata_Dathaton.xlsx** spreadsheet](#metadata), the **sample_name** must match that given to the new BioSample, to correctly link the sequence data to the metadata.

Explanation of the elements of a public display for each Sample in the SRA



Step 7. Files


In this step, you will upload the files listed in the SRA Metadata excel file. Files can be compressed using gzip or bzip2 and may be submitted in a _tar archive, but archiving and/or compressing your files is not required. Uploading zip files is not permitted. If you are uploading a tar archive, list each file name within the archive, not the archive's name.




We recommend you use the Web browser upload via HTTP or Aspera Connect plugin option to upload the files unless you have more than 10 GB of data or more than 300 files to upload at once.

We recommend you select Autofinish submission once the files have been successfully uploaded. Take into consideration that depending on the size and number of files, uploading may take from several minutes to a few hours.

Don’t forget to press Continue to save your progress. Otherwise, you have to upload the files again.

Possible Errors or Warnings at this step

Warning: You uploaded one or more extra files that are not in your Metadata table


Problem

You have uploaded files not listed in your SRA Metadata template.

Solution

If you do not intend to include these files in your SRA submission, click Continue. All files not included in the SRA Metadata will be ignored. If you intend to include these files in your SRA submission, return to the SRA Metadata step and update their names.


Error: Some files are missing. Upload missing files or fix metadata table


Problem

The program does not find all files listed in the SRA Metadata table in your submission folder.

Solution

Upload files that are reported missing. Also, check that the filenames are listed in your metadata table, and make sure that the file extensions (.fq, .fastq, .sff, etc.) exactly match those of the files you want to upload. In the latter case, go back to the SRA Metadata tab, delete your metadata file and upload a new one with the correct filenames. Click Continue.


Error: File is corrupted. Please re-upload the file...


Problem

This Error occurs either because you have corrupt files on your side or the files became corrupted during transfer.

Solution

Re-upload the files that were reported corrupt. For this, click the Fix button and follow the instructions. The filenames must be the same. Before re-uploading, check the files for integrity on your side. If the gzip utility reported an error, find and upload an uncorrupted version of this file before proceeding. If the file is OK, you can re-upload it.

Step 8. Review and Submit


During this step, you can review your submission's summary and make sure that everything is correct. You can still return to and change any step of your submission at this stage by clicking on the corresponding tabs at the top.




Click Submit when you are sure everything is correct. After submitting, future changes to the BioProject are limited or can only be achieved by contacting NCBI's service desk.

If, on the other hand, you want to delete the whole submission click Delete submission. This is the last chance to delete the submission without emailing NCBI’s service desk.


Accessing an unfinished submission


To access an unfinished submission follow the next steps:


  1. While logged in, go to NCBI's homepage.


  1. Click Submit


  1. In the Submission Portal click on My submissions.




  1. Find the submission with the Unfinished Status that has the title of the submission or the submission ID (SUB#) you are looking for.




Processing the submission


The Project is being reviewed by NCBI’s staff


Once submitted, your submission will be queued for processing, and you will likely get feedback within 24 hours. If submitting through the SRA Wizard, you will receive feedback from the Wizard first.

If your submission was successfully registered, you will receive the following email.




The project number you have been given (PRJ#) is permanent and unique, but it will not appear to other users until NCBI's staff has fully processed it. We kindly ask you to provide the project number for the Dathaton's database.

The Project has been accepted


After the submitted data has been processed, you will receive the following email.




Once the Project has been accepted, when someone searches for your project, the following information will be displayed.




Public display and searchable elements of a BioSample




The BioSample (SAMN#) is the identifier of specific Biosamples. Clicking on Retrieve all samples from this project allows you to see all the other BioSamples associated with the BioProject.

Public display and searchable elements of a SRA Experiment




The marked elements are:

  • Experiment (SRX#): identifier of instrument and library information of a specific sample (SRS#).

  • Study (SRP#): identifier of a study within a BioProject.

  • Sample (SRS#): identifier of a sample of sequence data.

  • Run (SRR#): identifier of the data file(s) derived from sequencing a library described by the associated Experiment.



Changing a submission


Follow the next steps:


  1. While being logged in, go to NCBI's homepage.


  1. Click Submit


  1. At the Submission Portal click on Manage data.




  1. Select the BioProject (PRJNA#) you want to update. You can also filter it by BioSamples at the BioSample tab or by Experiments at the SRA tab. With these other filtering options, the data shown by the Data Manager actually can't be edited.



The BioProject’s managing page allows you to:

  • Edit fields that were written during the submission.

  • Add information that was not written during the submission.

  • Edit most fields of the SRA Metadata. You have to check the boxes for the Experiments you want to modify first.




If you want to add more data to an existing BioProject or Biosample, create a new SRA submission and enter the accession number of the BioProject (PRJNA#) or the Biosample (SAMN#) when asked. This will ensure that the new data is linked to the existing BioProject.

If you want to change the attributes or withdraw a BioProject or BioSample that has already been submitted and not necessarily accepted, you have to contact [email protected] or [email protected] for assistance in updating your BioProject or BioSample submission respectively.

A Submission represents a discrete act of depositing data (a transaction). The submission has a temporary non-public ID as a SUB#. You cannot add more data to a completed submission. To update a submission, contact [email protected].

After the Run is fully loaded, neither its files can be replaced, nor filenames can be changed. You will have to submit new files in a separate submission using existing BioProject and BioSample accessions and request withdrawal of the Run containing the old files.


Downloading data


Downloading data corresponding to a project


As we have seen, NCBI supports the inclusion of exhaustive metadata when uploading data. To download all sequence data corresponding to a specific project or accession number, we recommend using the portal from the European Bioinformatics Institute as part of the European Molecular Biology Laboratory (EMBL-EBI). For this you would have to follow the next steps:


  1. Access EMBL-EBI's homepage


  1. Write the accession code of the BioProject, BioSample, or SRA you are interested in, and click Search.




  1. Scroll down and select the entry corresponding to the accession number you were looking for.


  1. A shortened version of the metadata will be shown, as well as the files belonging to this accession number. For accessing the file's download option, scroll to the right at the sections with the list of files.




  1. Click Download All if you want to dowload all the Experiments (SRR) under this accession number. If you want to dowload only specific Experiments, select the corresponding check-boxes and click Download selected files



Downloading data corresponding to several accession numbers


To download large amounts of SRA data we recommend you use the [**SRA Toolkit**](https://external.ink?to=https://github.com/ncbi/sra-tools/wiki).
⚠️ **GitHub.com Fallback** ⚠️