Submitting data on ENA - jsgounot/metagenomic-pipelines GitHub Wiki

Overview

This page is a guide to submitting data on ENA (which is synchronized with SRA). Data submission to ENA is usually done for reads, but you might need to submit your bins or MAGs as well. The first part of this guide is related to reads, the second one is for MAGs (in which I assume you already went though reads submission). You first need an ENA WebIn account. For CSB5 users, please ask for the team's identifiers.

Submissions made through Webin are represented using a number of different metadata objects. Before submitting data to ENA, it is important to familiarise yourself with the ENA metadata model and what parts of your research project can be represented by which metadata objects. This will determine what you need to submit.

For example, a publication is typically associated with a study (project), the sequenced source material is represented using samples, and sequencing experiment details are captured by the experiment object.

Note that data files are also submitted by associating them with metadata objects. Sequence read data is associated with run objects while other data files are associated with analysis objects. The full metadata model with relationships between the different types of objects is illustrated below.

ENA general guideline

Metadata Model

Study: A study (project) groups together data submitted to the archive and controls its release date. A study accession is typically used when citing data submitted to ENA. Note that all associated data and other objects are made public when the study is released.
Sample: A sample contains information about the sequenced source material. Samples are associated with checklists, which define the fields used to annotate the samples. Samples are always associated with a taxon.
Experiment: An experiment contains information about a sequencing experiment including library and instrument details.
Run: A run is part of an experiment and refers to data files containing sequence reads.
Analysis: An analysis contains secondary analysis results derived from sequence reads (e.g. a genome assembly).
Submission: A submission contains submission actions to be performed by the archive. A submission can add more objects to the archive, update already submitted objects, or make objects publicly available.

Main portal

Submitting reads

Major steps

Register your study
Register you samples
Upload and submit your data (fastq or other)

Register your study

This is done through the WebIn portal and is pretty straightforward.

Register your samples

ENA will offer you to generate a template to fill on the WebIn portal. The easiest is to download the minimum template (ERC000011, this link is just a template description, go on the WebIn submission step to download the template). You will also need to identify one taxonomic ID for each of your samples, use the tree viewer to explore what seems the most appropriate for your samples. Read the specific guideline for this step. Fill out the template and send it.

Note that just by registering your samples these will not be affiliated with a study or any data. The association of samples with a study happens in subsequent steps when you submit sequence data and point to your sample(s) from the experiment object(s).

Upload data (via ascp)

Before anything

You must upload data files into your private Webin file upload area at EMBL-EBI before you can submit the files through the Webin submission service. Here upload is as it says uploading your files (most likely reads) to their temporary space, while submit is the process of linking your file(s) to one sample and project.

The most user-friendly approach is Using Webin File Uploader but this approach asks you to transfer each file manually. This tutorial shows you how to send fastq batches using aspera, which is less user friendly, but much more convenient if your data are on a server and/or if you have a lot of files to send at once.

ENA official documentation for aspera.

You will also need sequencing metadata for each of your reads, with some being constrained by specific values. These include sequencing platform, instrument; insert size; library source, selection and strategy. When saved, reads will be linked to a project and sample you submitted before.

Note that uploaded data does not stay forever on their FTP server and will be removed (~ 2 months) if they stay there too long. The fastq files will be automatically removed on FTP, after ENA submission is completed.

The data upload areas are provided as a temporary place in which data are held while in transit. As such, they are neither intended nor suitable for any longer-term storage of data. Such storage is provided in ENA itself. Once in ENA, data can be released immediately following submission or can be held confidential prior to analysis and literature publication if required.

We expect any given data file to remain in a data upload area for no longer than 2 months before the instruction is given by the user to submit the file. While we attempt to remind users of this policy at the 2 months time point we reserve the right to routinely delete any data files that persist in them for more than 2 months.

We place no absolute limit within the 2-month period on the total volume of user data that may exist in a data upload area at any one time and are keen to accommodate the largest submissions where possible.

Preparing data

It might be a good idea to not work on the fastq file itself but on symlink to avoid mistakes like erasing the initial fastq files.

ln -s /full/path/of/your/fastqs/*.fastq.gz /temporary/directory/

Official file preparation guideline.

Create your md5 files (bash)

You should have the command md5sum.

for fname in /full/path/of/your/fastqs/*.fastq.gz
do 
if [ ! -f "$fname.md5" ]
then
	echo $fname
	md5=($(md5sum $fname))
	echo $md5 > $fname.md5
fi
done

This might take a while since md5sum needs to read each file entirely.

Creating files list

Create a text file containing the filename or full path of every fastq file and md5 file that you wish to upload:

ls -d /full/path/of/your/fastqs/*.fastq.gz > fileslist.txt
ls -d /full/path/of/your/md5/*.md5 >> fileslist.txt

Installing ascp

Aspera ascp is a commercial file transfer protocol that may provide better transfer speeds than FTP. Download the aspera CLI here.

Upload data over ascp

Go to the folder where files are. One some server, you might need to create a screen to have a stable connection (should not be the case on AWS or Ronin). Save the script below under ena_submit.exp and if needed, install expect on your terminal through apt.

Script is based on this. This is based on ascp version 3.*, does not work with version 4+.

#!/usr/bin/expect
 
set fofn [lindex $argv 0]
set dropbox [lindex $argv 1]
set pass [lindex $argv 2]

set files [open $fofn]
set subs [read $files]

set direxist 0
set timeout -1
 
foreach line [split $subs \n] {
  if { "" != $line } {
    set seqfile [exec basename $line]
    set lst [split $line "/"]
    spawn ascp -QT -l200M $line [email protected]:.
    expect "Password:"
    send "$pass\r"
    expect eof
	wait
	sleep 5
  }
}

You can then run the script like this using your Webin identifiers.

expect path/to/ena_submit.exp fileslist.txt Webin-ID Webin-Password

Wait that upload is completed.

Check your data

You can check the files are correctly uploaded checking directly on the FTP server using your WebIN identifiers and the webin.ebi.ac.uk as host name (with filezilla for example).

Submit your reads

There are different way to submit your reads. The easy way is to submit a table using templates provided on the portal. Since this step require to link all metadata together, this will need some tuning on your side.

You might need to download sample information available on the WebIn portal in Samples Report / Download all results (under the search box). You can also check the uploaded files which are not submitted yet and available on the WebIn system under Unsubmitted Files Report.

ENA will send a warning email (to WebIn account's maintainers) if files are corrupt. Reupload corrupted files and correct md5 values see this.

Once your files are processed (usually one to a couple of days), your files are directly available on the projectID webpage (if public). Congratulation, you did it!

Submitting assemblies (bins, MAGs or other)

Official documentation.

Submitting bins or MAGs requires the webin-cli tool. Note that ENA definition of a MAG differs a bit to the usual definition. In their dictionary, a MAG is a dereplicated genome from a metagenomic study which show the highest level of quality and potential similarity with known genomes of a species (see here). You most likely want to upload binned metagenome(s) instead, you can find the official guide here. The procedure described below is for binned metagenomes.

Binned samples

The procedure is as follow:

Create your binned sample (with correct taxonomy) and submit the list
Prepare the fasta file with the linked metadata and submit the list

Samples creation

ENA requests to create a sample for each for your bin which will act as link between your original sample (containing the reads) and your assembly. This sample contains taxonomic information of your bin, the ENA ID of your original sample and several descriptive values (completeness, contamination, sample origin, ...). The complete checklist for a classic binned metagenome sample can be found here.

Note that the derived sample must also reference the environmental sample (the one you create for your reads) in its description like this: This sample represents a metagenomic bin from the metagenomic sample ERSXXXXX.

For the taxonomic ID, you most likely need to provide an environmental organism-level taxonomy, unless you were able to isolate the organism linked to your MAG. The identification must be with the most granular identification possible, up to Genus level, meaning that you should not specify the species. This means that you need to register your sample like this: uncultured mag_genus_name sp. While ENA guideline suggests you should be able to do something like uncultured family-rank-level sp. when genus is missing, it looks like from my experience that this is not a correct taxonomic rank for NCBI (while genus level is). Therefore, if you don't have the genus name, you will most likely use uncultured bacterium. If you're not able to find your ID, you can check directly on NCBI.

To automatically check whether you submit a correct taxonomy, you can query the ENA API for taxonomic input. Here is an example of how to do it with python:

import requests

def search_ena_api(name):
    name = name.replace(' ', '%20')
    url = f'https://www.ebi.ac.uk/ena/taxonomy/rest/suggest-for-submission/{name}'
    return requests.get(url, headers={}, cookies={}, auth=()).json()
    
search_ena_api('uncultured Mitsuokella')

This will also provide you the taxonomic ID.

You can try to automate the process for all your bins like this:

'''
You have to rerun this cell multiple times to deal with the remote disconnection
'''

import time

ranks = ['division', 'phylum', 'class', 'order', 'family', 'genus'][::-1]
lru_cache = {}

def search_value(value):
    if not isinstance(value, str): 
        return None, None

    if '_' in value:
        value = value.split('_')[0]
      
    request = f'uncultured {value} sp.'
    
    if value in lru_cache:
        return request, lru_cache[value]    
    
    try:
        res = search_ena_api(request) or None
    except requests.SSLError:
        time.sleep(2)
        res = search_ena_api(request) or None        
        
    lru_cache[value] = res
    return request, res 

# ---------------------------

association = {}

for idx, row in gtdb.iterrows():
    row = row.to_dict()
    name = row['name']
    
    print (name, end="\r")
    
    for rank in ranks:
        value = row[rank]
        request, res = search_value(value)
        if res is None: continue
            
        clean_res = [subres for subres in res if subres['displayName'] == request]
        assert value
            
        association[name] = clean_res[0]
        
    if name not in association:
        association[name] = {
            'taxId': 77133,
            'scientificName': 'uncultured bacterium'
        }

print ({name: association[name] for name in list(association)[:5]})

In this case gtdb is a pandas dataframe with all ranks. On jupyter, put lru_cache into another cell and rerun the code until completion (server returns SSLError sometimes). For missing sample, uncultured bacterium is provided. As indicated before, this script only returned genus level taxonomy when I used it.

Generation of the samples xml file

You then need to use all those information to generate the samples xml file. ENA is very sensitive during fields validation (must be the exact sentence, number formatting, ...). Here is a python template which should format correctly all fields:

import xml.etree.cElementTree as ET

def format_comcon(value):
    value = float(value)
    if not 0 <= value <= 100:
        raise Exception(f'Value not inside [0-100]? Value: {value}')
    if value == 100: return '100'
    return f'{value:.2f}'

def add_attribute(sattr, tag, value):
    attr = ET.SubElement(sattr, 'SAMPLE_ATTRIBUTE')
    ET.SubElement(attr, 'TAG').text = tag
    ET.SubElement(attr, 'VALUE').text = value

def xml_pretty_print(current, parent=None, index=-1, depth=0):
    for i, node in enumerate(current):
        xml_pretty_print(node, current, i, depth + 1)
    if parent is not None:
        if index == 0:
            parent.text = '\n' + ('\t' * depth)
        else:
            parent[index - 1].tail = '\n' + ('\t' * depth)
        if index == len(parent) - 1:
            current.tail = '\n' + ('\t' * (depth - 1))

# ----------------------------

quality_value = {
    'medium': 'Many fragments with little to no review of assembly other than reporting of standard assembly statistics',
    'high': 'Multiple fragments where gaps span repetitive regions. Presence of the 23S, 16S and 5S rRNA genes and at least 18 tRNAs'
}

root = ET.Element("SAMPLE_SET")

for sample in samples():
    # assuming you have all samples information into a nested dict samples_data
    sample_data = samples_data[sample]
    name = sample_data ['name']
    
    sampleid = name.split('_')[0]
    sname = str(association[name]['scientificName'])
    sample = ET.SubElement(root, 'SAMPLE', alias=name)
    
    ET.SubElement(sample, 'TITLE').text = f'Human gut {name} {sname}'
    
    sname = ET.SubElement(sample, 'SAMPLE_NAME')
    ET.SubElement(sname, 'TAXON_ID').text = str(association[name]['taxId'])
    ET.SubElement(sname, 'SCIENTIFIC_NAME').text = str(association[name]['scientificName'])
    
    sample_derived = frep[name.split('_')[0]]
    description = f'This sample represents a metagenomic bin from the metagenomic sample {sample_derived}'
    ET.SubElement(sample, 'DESCRIPTION').text = description
    
    sattr = ET.SubElement(sample, 'SAMPLE_ATTRIBUTES')
    add_attribute(sattr, 'project name', 'Singapore Platinum Metagenome Project')
    add_attribute(sattr, 'sequencing method', 'Illumina HiSeq 4000;Oxford Nanopore MinION')
    add_attribute(sattr, 'assembly software', 'OPERA-MS;0.9.0')
    
    quality = quality_value[samples_data['CheckMStatus'].lower()]
    completeness = format_comcon(samples_data['Completness'])
    contamination = format_comcon(samples_data['Contamination'])
    
    add_attribute(sattr, 'completeness score', f'{completeness}', '%')
    add_attribute(sattr, 'completeness software', 'CheckM;1.04')
    add_attribute(sattr, 'contamination score', f'{contamination}', '%')
    add_attribute(sattr, 'binning software', 'metabat2;2.12.1')
    add_attribute(sattr, 'assembly quality', quality)
    add_attribute(sattr, 'investigation type', 'metagenome-assembled genome')
    add_attribute(sattr, 'binning parameters', 'coverage and kmer')
    add_attribute(sattr, 'taxonomic identity marker', 'multi markers approach')
    add_attribute(sattr, 'taxonomic classification', 'GTDBTk;1.4.1')
    add_attribute(sattr, 'isolation_source', 'human gut')
    add_attribute(sattr, 'collection date', '2008')
    add_attribute(sattr, 'geographic location (country and/or sea)', 'Singapore')
    add_attribute(sattr, 'geographic location (latitude)', '1.290270', 'DD')
    add_attribute(sattr, 'geographic location (longitude)', '103.851959', 'DD')
    add_attribute(sattr, 'broad-scale environmental context', 'digestive tract environment')
    add_attribute(sattr, 'local environmental context', 'human gut')
    add_attribute(sattr, 'environmental medium', 'human stool')
    add_attribute(sattr, 'sample derived from', sample_derived)
    add_attribute(sattr, 'metagenomic source', 'human gut metagenome')
    add_attribute(sattr, 'ENA-CHECKLIST', 'ERC000050')
    
# ----------------------------

xml_pretty_print(root)
tree = ET.ElementTree(root)
tree.write("samples.xml",)

You generate a generic submission xml file:

<?xml version="1.0" encoding="UTF-8"?>
<SUBMISSION>
   <ACTIONS>
      <ACTION>
         <ADD/>
      </ACTION>
   </ACTIONS>
</SUBMISSION>

And you're ready for the test submission:

curl -u username:password -F "[email protected]" -F "[email protected]" "https://wwwdev.ebi.ac.uk/ena/submit/drop-box/submit/"

If results are good, you can do the real submission:

curl -u username:password -F "[email protected]" -F "[email protected]" "https://www.ebi.ac.uk/ena/submit/drop-box/submit/" >> samples_ena_submission.res.xml

Important: You need to keep the receipt, it contains the ENA samples ID.

You can also use the WebIn portal as described in the doc.

Submit manifest files

This is the last step. You first need to create your manifest file fore each MAG following the guideline. Do not forget the fasta field where you put the path of the fasta file which must be compressed. You also need to check that your sequences are valid for ENA (assembly names and sequence patterns) as provided by the guideline. Once this is done you can either validate and submit each sequence individually (see below) or try a bulk submission (I did not try this).

You first need to validate and then submit your sequence using the Webin-CLI tool based on Java. This process may take a while (around a day for my 4.5K sequences with 8 CPUs on Ronin). Here is a snakemake pipeline which could be used:

reports = glob_wildcards('mags/{sample}.fa.manifest.txt')

rule all:
	input:
		expand('mags/{sample}.fa.manifest.txt.log.validate.txt', sample=reports.sample)

rule validate:
	input:
		'mags/{sample}.fa.manifest.txt'
	output:
		'mags/{sample}.fa.manifest.txt.log.validate.txt'
	shell:
		'java -jar webin-cli-5.0.0.jar -username USERNAME -password PASSWORD -context genome -manifest {input} -validate > {output}'

rule submit:
	input:
		'mags/{sample}.fa.manifest.txt'
	output:
		'mags/{sample}.fa.manifest.txt.log.submit.txt'
	shell:
		'java -jar webin-cli-5.0.0.jar -username USERNAME -password PASSWORD -context genome -manifest {input} -submit > {output}'

Where you have all your manifest files inside a directory like this mags/{sample}.fa.manifest.txt. Replace validate to submit in the input from the all rule if you want to submit. Sometimes one MAG might fail because of server issue, therefore I do recommend using snakemake --keep-going option and relaunch the pipeline once finished for the remaining samples which failed. Once uploaded, your sequences are still displayed as private and you should one or two days for the samples and assemblies to be automatically changed to public and available.

You're finally done. You should receive a long email from ENA with all your submission IDs. Once updated in the ENA website, you should see your MAGs in the Related ENA records sections as Analysis.