Submitting 16S rRNA and ITS gene sequence data to NCBI's Sequence Read Archives - MariaAlvBla/NCBI-Tutorial GitHub Wiki
This instructive provides step-to-step instructions to deposit 16S rRNA and ITS regions gene amplicon sequence data to the National Center for Biotechnology Information (NCBI).
- Registering to NCBI
- Accessing the Sequence Read Archive (SRA)
- Submitting data to SRA through the Submission Wizard
- Accessing an unfinished submission
- Processing of the submission
- Changing a submission
- Downloading data
To register with NCBI, follow the next steps:
- Access NCBI's homepage Click Log in.

- A menu with several lo Login options will be displayed. You can choose whichever you prefer for setting up your account.

For accessing the Sequence Read Archive (SRA), follow the next steps:
- While logged in, Access NCBI's homepage.
- Click Submit.

- The main page of the Submission Portal will be displayed. For the occasion of the Datathon, write 16S rRNA in the search bar and click SRA.
**You should also right 16S rRNA if you are depositing ITS sequences. If your write ITS on the search bar, it will lead you to the GenBank portal, which is not suitable for submitting unassembled sequence reads**

- A webpage about the Sequence Read Archive (SRA) will be displayed. Click Submit.

-
Donor consent is usually necessary if the data comes from a human study.
-
Each upload must be kept under 5 TB. If you have more, split the upload across multiple submissions.
-
Submissions can be linked to the same BioProject to ensure all data are searchable with a single accession code.
-
Every fastq file should be less than 100 GB in size. If compressed files are larger than 100 GB, please split them before submission.
This process requires several steps. To save your progress, click Continue. You can review or make changes to your previous steps during submission by clicking on the preceding tabs.
At any point after having saved your progress, you can leave NCBI and continue the process of submission later. If, however, you click the Submit button at the last step, making changes will require additional steps.
When saving your progress, you may get Error or Warning messages. Error messages describe the Error and suggest a solution that must be corrected before you can move to the next step of your submission. On the other hand, the Warning messages attempt to prevent you from making a possible mistake and do not block you from continuing your submission.

Here, the data archiver will be asked to include professional information. We recommend using your institutional e-mail and writing the information of the institution you work for.

The BioProject represents the research project from which the dataset originated. The information supplied in the BioSample provides context to your experimental data. Every metagenome, time point, tissue type, or treatment type must have its own BioSample, but technical replicates are not unique BioSamples. If the data you submit is not linked to any previously submitted data, select No for the BioProject and BioSample categories in this step.
Note: most often, each sample will also be its own BioSample. If the samples are technical replicates, therefore, several sequences of the same ADN, they belong to the same BioSample. If unsure, it is best to have each sample be a separate BioSample.

Depending on your answers at this step, the next steps would follow one of these pathways:

The default release date is Release immediately following processing, but you can select a specific date for releasing your data. If you don’t know the exact date, you can change it after finishing the submission by clicking on the Manage tab at the Submission Portal.

In the Public description, provide information that best describes your research, which will become the description of your BioProject. If you have an abstract or research summary of your research project, you should add it here. Also, we recommend that at URL, you add the DOI link to any publication of yours that is related to this data.
In this step, you would usually select a Package that best fits the nature of your Biosample. According to your chosen Package, the Submission Portal would supply you with a specific attribute table for the next step. For the Datathon we have created a customized table to homogenize the spelling of data entered in the tables. Our table was created from the table given by NCBI when selecting the specific settings shown in the following picture. First, select the package MIMARKS Survey related, then, in the drop-down menu select the sample type soil.

If your data doesn't come from the soil and you consider that our custom table lacks some categories necessary to describe your type of data; you can select the BioSample type that adjusts better to your data. Then, you can compare the columns of NCBI's table and the Datathon's custom data, and add the columns that are mandatory for the Datathon but are missing from NCBI's table. The mandatory columns are marked in green and orange on our custom table.
When you do this, please write only the "accepted values" specified at the Key tab of the custom table in these mandatory columns. Use the exact spelling. This way, it's possible to maintain a homogenized format for all the data shared in the Datathon.
This step provides contextual information about your samples.

For the Datathon, select Uploading a file using Excel format and use the following customed Excel table:
MIMARKS.survey.soil.5.0_Dathaton.xlsm *This table contains Excel Macros, and they should be manually habilitated. For that, before opening the table, open the file's Properties, and in the General tab, tick the Unblock box at the Security section.
Please read carefully the instructions included in the excel before filling in the table. Note that to upload the table to NCBI, you must first delete the sheet named "Key", and then save the file in .xlsx format. Keep in mind that when exporting the excel sheet it is not necessary to delete the rows with instructions.
The sample_name you give each sample in the attribute table will be again used at the SRA metadata table to link the sequence data and metadata. The sample names must be identical in both Excel files to be linked together.
If you want to include your data in MiCoDA's database please send the version of MIMARKS.survey.soil.5.0_Dathaton.xlsm, that you submitted to NCBI, to [email protected]. When sending it, change the file's name to include the last name of the first three authors of the data in the following manner "Last name author 1_Last name author 2_Last name author 3_MIMARKS.survey.soil.5.0_Dathaton.xlsm", and include in the email the full name of each author and their contact emails.
Error: Multiple BioSamples cannot have identical attributes
Problem
After filling out values for attributes provided in the template, your individual BioSamples are not distinguishable by at least one or a combination of attributes.
Solution
Make sure the combined value of all attributes is unique for each Biological sample. Note that sample name, sample title, and description are not included in this check for the uniqueness of the sample's attributes. If necessary, you can add new columns that allow you to differentiate the samples. If this problem arises because of biological replicates, please add a replicate column to the sheet and record the replicate numbers to differentiate them. If, on the other hand, technical replicas are involved, they go in a single BioSample and the replicas are placed in the same row of the table used in the SRA metadata step.
Error: These samples have the same Sample Names and identical attributes
Problem
Less often, this Error may arise if you attempt to deposit sequences that have already been deposited to NCBI, and the Submission Portal prevents you from creating duplicates.
Solution
If you want to deposit new sequences to previously deposited BioSamples, go back to the General Info tab and select Yes to the question Did you already register BioSamples for this data set?. The SRA Submission Wizard will then skip the BioSample type and attributes steps. In the SRA metadata step, you need to sue the table SRA_metadata_Datathon_EN_previous biosamples.xlsm. In this table, the accession codes of the BioSamples (SAMN#) should be added to the biosample_accession column to link the new sequence files to the pre-existing Biosamples.
*This table contains Excel Macros, and they should be manually habilitated. For that, before opening the table, open the file's Properties, and in the General tab, tick the Unblock box at the Security section.
To find the accession numbers of Biosamples you already registered, go to the Submission Portal and follow the next steps:
- Click My submissions.

- Click objects in the BioSample section of the Project.

The SRA metadata describes the technical aspects of each sequencing experiment: the sequencing libraries, preparation techniques, and the names of the data files.

For the Datathon, select Uploading a file using Excel format and use the following customed Excel table:
SRA_metadata_Datathon.xlsm *This table contains Excel Macros, and they should be manually habilitated. For that, before opening the table, open the file's Properties, and in the General tab, tick the Unblock box at the Security section.
Please read the instructions included in the spreadsheet carefully before filling in the values. You can only upload the tab-delimited text file version of the spreadsheet SRA data. If working in Excel, export that spreadsheet as a tab-delimited file. If this doesn't work, you can try uploading the Excel document after deleting the sheets that are not SRA data and saving the file as xslx.
When submitting the project, the SRA Experiment captures the unique combination of techniques used to sequence a particular sample (i.e., each combination of library + sequencing strategy + layout + instrument model represents a different experiment). If two of your sequences have exactly the same values in these columns, it is a clear indication that they are technical replicates and should be in the same row.
Note: most often, all samples within a project will be sequenced using the same combination of techniques and will thus belong to a single Experiment. The most common exception is when two gene regions (e.g., 16S rRNA and ITS) are sequenced for the same project.
If you want to include your data in MiCoDA's database, please send the version of SRA_metadata_Datathon.xlsx, that you submitted to NCBI, to [email protected]. Before sending it, change the file's name to include the last name of the first three authors of the data in the following manner "Last name author 1_Last name author 2_Last name author 3_SRA_metadata_Datathon.xlsm", and include in the email the full name of each author and their contact emails.
-
Paired-end data files (forward/reverse) must be listed together in the same Run (in the case of the spreadsheet in the same row) for the two files to be correctly processed as paired-end. All data files listed in a Run will be merged into a single sra archive file. Therefore, files from different samples or experiments should not be grouped in the same Run.
-
File name(s) for the Experiments shouldn’t contain sensitive information because they will appear publicly on the Google and AWS clouds.
-
Avoid submitting duplicated files because the Portal does not accept this, and such files may be suppressed without warning.
When submitting new BioSamples, during the BioSample attributes step, a specific name for each sample is assigned in the sample_name column of the MIMARKS.survey.soil.5.0_Dathaton.xlsm. In the SRA Metadata step, in the SRA_metadata_Dathaton.xlsm spreadsheet, the sample_name must match that given to the new BioSample to link the sequence data to the metadata correctly.
If, on the other hand, you want to deposit sequences to pre-existing BioSamples, in the SRA metadata step, you must use the table SRA_metadata_Datathon_EN_previous biosamples.xlsm. In this table, the accession codes of the BioSamples (SAMN#) should be added to the biosample_accession column to link the new sequence files to the pre-existing BioSamples, and thus include them in the new BioProject.
*All of these tables contains Excel Macros, and they should be manually habilitated. For that, before opening the table, open the file's Properties, and in the General tab, tick the Unblock box at the Security section.

In this step, you will upload the files listed in the SRA Metadata excel file. Files can be compressed using gzip or bzip2 and may be submitted in a tar archive, but archiving and/or compressing your files is not required. Uploading zip files is not permitted. If you are uploading a tar archive, list each file name within the archive, not the archive's name.

We recommend you use the Web browser upload via HTTP or Aspera Connect plugin option to upload the files unless you have more than 10 GB of data or more than 300 files to upload at once.
We recommend you select Autofinish submission once the files have been successfully uploaded. Take into consideration that depending on the size and number of files, uploading may take from several minutes to a few hours.
Don’t forget to press Continue to save your progress. Otherwise, you would have to upload the files again.
Warning: You uploaded one or more extra files that are not in your Metadata table
Problem
You have uploaded files not listed in your SRA Metadata template.
Solution
If you do not intend to include these files in your SRA submission, click Continue. All files not included in the SRA Metadata will be ignored. If you want these files in your SRA submission, return to the SRA Metadata step and update their names.
Error: Some files are missing. Upload missing files or fix metadata table
Problem
The program does not find all files listed in the SRA Metadata table in your submission folder.
Solution
Upload files that are reported missing. Also, check that the filenames are listed in your metadata table, and make sure that the file extensions (.fq, .fastq, .sff, etc.) match those of the files you want to upload. In the latter case, go back to the SRA Metadata tab, delete your metadata file and upload a new one with the correct filenames. Click Continue.
Error: File is corrupted. Please re-upload the file...
Problem
This Error occurs either because you have corrupt files on your side or the files became corrupted during transfer.
Solution
Re-upload the files that were reported corrupt. For this, click the Fix button and follow the instructions. The filenames must be the same. Before re-uploading, check the files for integrity on your side. If the gzip utility reported an error, find and upload an uncorrupted version of this file before proceeding. If the file is OK, you can re-upload it.
During this step, you can review your submission summary and ensure everything is correct. You can still return to and change any step of your submission at this stage by clicking on the corresponding tabs at the top.

Click Submit when you are sure everything is correct. After submitting, future changes to the BioProject are limited or can only be achieved by contacting NCBI's service desk.
If, on the other hand, you want to delete the full submission, click Delete submission. This is the last chance to delete the submission without emailing NCBI’s service desk.
To access an unfinished submission, follow the next steps:
- While logged in, go to NCBI's homepage.
- Click Submit
- In the Submission Portal, click on My submissions.

- Find the submission with the Unfinished Status that has the submission title or the submission ID (SUB#) you are looking for.

Once submitted, your submission will be queued for automatic processing, and you will likely get feedback within 24 hours.
If your submission were successfully registered, you would receive the following email.

The project number you have been given (PRJ#) is permanent and unique, but it will not appear to other users until NCBI's staff has fully processed it. We kindly ask you to provide the project number for Dathaton's database.
After the submitted data has been processed, you will receive the following email.

Once the BioProject has been accepted, the following information will be displayed when someone searches for your project.


The BioSample (SAMN#) is the identifier of specific Biosamples. Clicking on Retrieve all samples from this project lets you see all the other BioSamples associated with the BioProject.

The marked elements are:
-
Experiment (SRX#): identifier of instrument and library information of a specific sample (SRS#).
-
Study (SRP#): identifier of a study within a BioProject.
-
Sample (SRS#): identifier of a sample of sequence data.
-
Run (SRR#): identifier of the data file(s) derived from sequencing a library described by the associated Experiment.
To change certain elements of a submission that was already submitted, follow the next steps:
- While being logged in, go to NCBI's homepage.
- Click Submit
- At the Submission Portal, click on Manage data.

- Select the BioProject (PRJNA#) you want to update. You can filter it by BioSamples at the BioSample tab or Experiments at the SRA tab. The data shown by the Data Manager can't be edited with the last two filtering options.

The BioProject’s managing page allows you to:
-
Edit fields that were written during the submission.
-
Add information that was not written during the submission.
-
Edit most fields of the SRA Metadata. You must check the boxes for the Experiments you want to modify first.

If you want to add more data to an existing BioProject or Biosample, create a new SRA submission and enter the accession number of the BioProject (PRJNA#) or the Biosample (SAMN#) at step 2. This will ensure that the new data is linked to an existing element.
Suppose you want to change the attributes or withdraw a BioProject or BioSample that has already been submitted and not necessarily accepted. In that case, you must contact [email protected] or [email protected] for assistance updating your BioProject or BioSample submission, respectively.
After the Run is fully loaded, neither its files can be replaced nor filenames can be changed. You will have to submit new files in a separate submission using existing BioProject and BioSample accessions and request withdrawal of the Run containing the old files.
As we have seen, NCBI supports the inclusion of detailed metadata when uploading data. But downloading large amounts of sequences can be somehow tricky in this portal.
To download all sequence data corresponding to a specific project or accession number, we recommend using the portal from the European Bioinformatics Institute as part of the European Molecular Biology Laboratory (EMBL-EBI). For this, you would have to follow the next steps:
- Access EMBL-EBI's homepage
- Write the accession code of the BioProject, BioSample, or SRA you are interested in, and click Search.

- Scroll down and select the entry corresponding to the accession number you were looking for.
- A shortened version of the metadata will be shown, as well as the files belonging to this accession number. To access the file's download option, scroll to the right at the sections with the list of files.

- Click Download All if you want to download all the Experiments (SRR) under this accession number. If you want to download only specific Experiments, select the corresponding check-boxes and click Download selected files

To download large amounts of SRA data, we recommend you use the SRA Toolkit.