Preparing_your_data - trinityrnaseq/BerlinTrinityWorkshop2018 GitHub Wiki

Ideal target RNA-Seq data sets for the Trinity training workshop

To get the most out of the Trinity training workshop, it's best to have:

RNA-Seq data sets most relevant to your research interests (and biological questions)
At least 2 sample types (experimental conditions, tissue types, etc.)
Read lengths at least 75 bases per read
All paired-end or single-end reads (not mixed, to keep it simple)
At least 2 biological replicates for each sample (for differential expression analysis)

Homework:

If you brought your own data, prepare the subsets below
If you need to find publicly available data to use, search SRA for samples of interest that meet the criteria above.

Below are instructions for capturing data for use with the workshop, either using a sampling of your own data, or using publicly available samples of interest.

Aim to put your data into a ~/mydata directory, instead of the ~/workspace area. This way, we can try to keep our target data sets and analyses separate from each other in separate workspaces.

Running Trinity using your own data

To ensure that we have sufficient computational resources and time available to generate an assembly that will be useful for training in this workshop, let's limit the number of reads that we assemble to 100k total reads per biological replicate.

If you have many replicates and would end up accumulating more than 5M PE reads, then let's divide the 10M reads among all replicates so the total number of reads to assemble will be not greater than 5M.

How to extract a subset of reads

If starting from gzipped fastq files, we would want to run the following per sample, extracting the top (100k * 4) or (n * 4) number of lines, where n is the number of reads we would want to extract from the fastq file, with 4 lines per fastq sequence record. An example command is as follows:

%  zcat file.fastq.gz | head -n 400000  | gzip > file.100k.fastq.gz

Ideally, the above would be done on the computer that has the full data stored on it. Once you have the subsetted files (ie. file.100k.fastq.gz), you can then upload them to the cloud, as described below.

Upload your data to your cloud instance:

Using your sftp_url, you can upload your data to the cloud. It's perhaps easiest to use the tool FileZilla to upload to the cloud. From the student_AWS_instances_table you can find your sftp_url value formatted like so: IP_address:port_number Separate these values and use them according to the FileZilla link provided.

Alternatively, if you want to use a command-line based tool, you can use 'sftp' directly:

%  sftp -P ${port_number} training@${IP_ADDRESS}

when prompted for the password, enter our super-secret password.

This step of uploading your data can only be done during class time. Instances are not available outside of class time for most of the workshop.

Alternatively, Downloading RNA-Seq data of interest from SRA:

Find a data set of interest in SRA

At SRA, try searching for and identifying RNA-Seq data sets of interest, as per the criteria mentioned at top.

For example, try search string: "rna-seq AND salamander"

Extract a sampling of the data

The command below will generate an 'interleaved' fastq file, where record 1 is followed immediately by record 2, and we'll extract the top 1M read pairs (which = 8M top lines due to the interleaving).

below, we retrieve RNA-seq data for SRA accession: SRR390728, and save only the top 100k reads into the paired output files:

% sra_acc=SRR390728  # assign SRA accession variable
  
  fastq-dump --defline-seq '@$sn[_$rn]/$ri' --split-files -Z  $sra_acc | \
  paste - - - - - - - - | \
  head -n100000 | \
  tee >(cut -f 1-4 | tr '\t' '\n' | gzip > ${sra_acc}_1.fastq.gz) | \
  cut -f 5-8 | tr '\t' '\n' | gzip -c > ${sra_acc}_2.fastq.gz

above is adapted from: https://biowize.wordpress.com/2015/03/26/the-fastest-darn-fastq-decoupling-procedure-i-ever-done-seen

For your own data, just replace the accession with your one of interest.