EGA data upload - McGranahanLab/Guidebook GitHub Wiki

Introduction

At the time of writing (early 2024) information on the EGA is incomplete or not precise. It might have changed so check the official website first.

Getting the EGA ID:

To obtain EGA ID the whole project needs to be finalised and validates, including file upload - it is no longer possible to get ID for an empty project and upload files later.

Uploading files:

  1. Ask cluster staff about the appropriate node. On CS cluster files should be uploaded from large [ssh gamble; ssh large], but without using qsub.
  2. Connect to EGA Inbox via sftp sftp [email protected]
  3. Get credentials from /nemo/lab/swantonc/working/_SHARE/EGA_UPLOAD/submission_login.json
  4. cd to-encrypt : files placed in this directory will be automatically encrypted
  5. use put to upload files, e.g. put /path/to/fastq/files/*.fastq.gz

Creating the project:

  1. Log in to submitter portal: https://submission.ega-archive.org/ with the same credentials
  2. Create new submission and fill out two first sections: Info, Studies, Samples, Experiment, Runs
  3. Once you have linked the files (by creating runs), these files will change status from 'Files in Inbox' to 'Files in Processing' and finally to 'Files Ingested'. The files with the status "Files ingested" means that they are correctly linked to the metadata and you cannot re-link the files so you should delete them from the SFTP INBOX.
  4. For large submissions (above 12 TB) you need to create runs in batches of 10-15TB each (website says the limit is 12TB but Helpdesk told me up to 15 should be ok). So upload files, create runs for these files, delete from inbox after ingestion, repeat.
  5. Once you have all your files associated with a run, create the dataset, add all the registered runs and complete the submission.

Modifying release date:

To change the release date of the dataset:

  1. go to My Submission
  2. Search for your project with EGAnumber (e.g. EGA50000000301) and click on Edit (small orange pop-up icon - need to point at the record containing the dataset to manifest it).
  3. Click on Dataset and modify a title or description in order to reopen the submission. Add "Release date updated" or a similar statement to the description.
  4. Once the submission is in status "open", the finalise button will then appear and you will be able to finalise and choose a different expected release date