Collaborator data transfer - Gibbons-Lab/wiki GitHub Wiki

:truck: Data transfer instructions for collaborators

Hi there :wave:, thanks so much for collaborating with the Gibbons Lab! We are excited to work with you and hope this document will help you and us to easily maneuver the common hurdles in getting started. In short, we will go through the following topics:

  1. ISB requirements for data sharing
  2. How to transfer large amounts of data to or from us
  3. Metadata and Protocols

ISB requirements for data sharing

The Institute for Systems Biology requires that all unpublished data and material transfers are covered by a material transfer agreement. If you are receiving data from us our legal department will initiate this and we will pass the agreement to your administrative department. If we are receiving data from you this will have to be initiated by your institution and the MTA will be received by the ISB legal department.

How to transfer data

We provide an AWS S3 bucket where you can upload or download data for collaborations. You will not need an AWS account and you will not need to pay any costs for data transfer. This data transfer setup will work for any amount of data, be it a few hundred MBs or several TBs :nerd_face:

You will receive a file called credentials from us that sets up a private access profile that you can use. This file should be treated as private and not be shared :detective:

To get started you can do the following:

If you don't have an existing AWS CLI setup

Install the AWS CLI interface. The easiest way is to use conda if you already have it:

conda install -c conda-forge awscli

Alternatively you can use any other of the supported installation methods.

Create a folder called .aws in your home directory (including the dot) and copy the credentials file into it:

mkdir $HOME/.aws
cp /path/to/the/credentials $HOME/.aws

If you do have an existing AWS CLI installation

Open the file $HOME/.aws/credentials and append the credentials file you got from us to the end. So it would probably look like this afterwards:

[default]
aws_access_key_id = [random letters and strings]
aws_secret_access_key = [random letters and strings]

# there may be other profiles here

[gibbons_collab]
region = us-east-2
aws_access_key_id = [random letters and strings]
aws_secret_access_key = [random letters and strings]

Check your setup

Check if everything worked:

aws s3 --profile gibbons_collab ls s3://gibbons-data-transfer

Which should give you:

...
2024-04-23 09:05:03          0 if_you_see_this_it_worked

Copy files

You can now copy files to the bucket. Choose a descriptive project name to substitute MY_PROJECT with and copy the files:

aws s3 --profile gibbons_collab cp --recursive /path/to/MY_PROJECT s3://gibbons-data-transfer/MY_PROJECT

Or copy files from the bucket:

aws s3 --profile gibbons_collab cp --recursive s3://gibbons-data-transfer/MY_PROJECT/ /local/path/to/MY_PROJECT

Make sure to only download or upload files covered by the MTA.

What should I upload?

Not everything here is set in stone but this is what has worked for us. If you receive data from us you will always get what is listed below.

Raw data

This depends on the project but we usually receive sequencing files in the format of FASTA or FASTQ files. However, we minimally need either

  1. a consistent naming scheme, for instance:
    • SRA format: SAMPLE_1.fastq.gz for forward and SAMPLE_2.fastq.gz for reverse.
    • Illumina format: SAMPLE_S0X_L0Y_R1_001.fastq.gz for forward and SAMPLE_S0X_L0Y_R2_001.fastq.gz for reverse
  2. or a manifest file that maps raw file names to samples

Metadata

To make sense of the sample we usually also require a metadata file that maps individual samples to additional data (for instance BMI, age, anonymized patient ID, etc.). Only include anonymized information here.

This can be a tab-separated (*.tsv), comma-separated (*.csv), or - if not available otherwise - an excel sheet (however, no relevant information should be provided in formatting, e.g. cell colors, fonts, etc).

Protocols

For raw data uploads, we would love to get a short summary of the wet-lab protocols used to generate these data. This is important for preprocessing and later analysis.

For processed data we would love a short summary of the wet-lab part, but we will definitely require a summary of the computational processing steps (used software + arguments for each step). If you have a pipeline definition in Snakemake or Nextflow with a conda environment file or docker/singularity images we will :heart: it!