Finding the FastQ Files - ccsstudentmentors/tutorials GitHub Wiki

There are a lot of different ways that your raw RNA-Seq data can look when you get it back. This depends mostly on who performed the sequencing and on what type of machine. I only know enough about Illumina machines to write a guide about them, so I won't be discussing any other brands of machines here (although fortunately Illumina is the most commonly used right now).

Starting out with the group that (hopefully) most readers of this guide will fall into: users of the University of Miami sequencing core facility (the one led by Bill Hulme). Our core facility takes the information generated by the sequencing machines and automatically processes it using a program known as CASAVA. This processing step resolves the bases read by the machine for each of the hundreds of millions of reads it generated and de-multiplexes the samples (so multiple samples that were pooled together in the same lane get their reads put in separate fastq files). All you have to do with this type of output data is to find the fastq files buried deep within the hard drive they give you. Instructions on how to do that are below.

If you used a different sequencing service (such as a company or a core facility at another university), your output data may or may not look like what is described here. If the files you have end with .fastq (or .txt) when uncompressed, open them up with a text editor (best done on whatever computing platform you are going to use to process the data) and see if they conform to the guidelines for a fastq file described here: https://en.wikipedia.org/wiki/FASTQ_format. If your files conform to that format, great! Head on to the section about aligning the reads to the genome.

If you have any other type of files, ask us for help. It's not that this is uncharted territory, it's just that there are way too many possibilities for what format your data could be in (and what needs to be done to it) for us to write a comprehensive guide on the subject.

Now, let's find those FastQ Files!

When you plug in the hard drive that contains your RNA-Seq results, you will most likely be greeted with several folders, one of which will have either your or your PI's name. When you enter that folder, you will see one or more folders with names like: 150123_SN674_0303_BC5CFWACXX. These folders each contain the results of an RNA-Seq run from a single machine (on a particular day).

Enter into this folder and then enter the 'Unaligned' folder. Within the 'Unaligned' folder, you should see a folder called 'Project_XX' where the XX is either your or your PI's initials. Within that folder should be a folder for each of your samples that were run on this machine. The folder for each sample will have one or more fastq files in it.

There are several things to keep in mind here:

  • There may be more than one fastq file for each sample (we will handle that when we align them to the genome)
  • The fastq files may be zipped up in an archive that ends in .gz or .tar or something like that
  • If you multiplexed your samples and ran all the samples out on all of your lanes and more than one machine was used to handle your project, you may have folders for the same sample in more than one machine (this is less common though -- just let us know if you think this might be true for you).

Anyway, so the location of your fastq files should be something like this: E:\mdanzi\150123_SN674_0303_BC5CFWACXX\Unaligned\Project_MD\Sample_Control1
And the names of the fastq files will probably be something like this: Control1_GAGTGG_L002_R1_001.fastq.gz

Now that you have found your fastq files, head over to the page 'Loading files onto the server' and put them on Pegasus!