Filetypes and File Compression - Green-Biome-Institute/AWS GitHub Wiki

Regarding some of the smaller details regarding the actual sequencing data, there are several main filetypes that you will be dealing with:

FASTQ

First is the .fastq filetype (also uses .fq), which stores data by repeating the following 4 lines for each sequencing read:

@SEQ_ID 
GATCGATCGATCGATGATCGATCGATCGATCGATCGATCATCGATCGATCG
+
!!''**(())++%%!!''**(())++%%!!''**(()++FCC556%%!!''

These 4 lines represent

  1. A sequence identifier preceded by @
  2. The raw sequence
  3. Optional further notes / description
  4. Quality values for the raw sequencing data

These, along with FASTA files, are probably the most common file format that you will work with.

FASTA

Next is another of the most common file formats (and the simplest), .fasta (which also uses .fa, .fnt, .fna (nucleotide), .faa (amino acid), .fas). This file contains only the an identifier for each read after a > symbol and then the sequencing information for that read on the following line.

>SEQ_ID 
GATCGATCGATCGATGATCGATCGATCGATCGATCGATCATCGATCGATCG

Many analysis softwares accept FASTA as inputs and can work with them. One example where FASTA files are necessary is for using BLAST. Before doing a BLAST query, you’ll need to convert your FASTQ files into FASTA (this is shown in the online GitHub workshop).

FAST5

The FAST5 format is one way Oxford Nanopore Technology (ONT) stores data using HDF5 format (Hierarchal Data Format). It is stored in a binary format, so you’ll need to use HDF5 tools to work with it. (https://support.hdfgroup.org/products/hdf5_tools/). Since ONT also produces FASTQ or FASTA files, you probably won’t work with this filetype, but good to be aware of it.

SAM/BAM

SAM and BAM files stand for Sequence Alignment Map and Binary Alignment Map and both store sequence alignment information. BAM files are compressed into a format that is smaller size-wise and more efficient for your computer to work with, however they are not human-readable. SAM files are their direct equivalent (they contain the same information) in ascii format, meaning they can be read. In order to look at a BAM file, you would need to convert it back into a SAM file.

These files will only be relevant downstream of the analysis talked about in this document. For more information, you can check out the documentation: https://samtools.github.io/hts-specs/SAMv1.pdf

File Compression

Something you’ll note is that most sequencing data commonly has multiple file extensions, for example, “data.fastq.gz”. This means a file is “compressed,” which means that it is not immediately accessible until you uncompress it. Data compression is used in order to store large data files in a smaller format, therefore lessening the total amount of storage capacity needed for each file. If you would like to look at/read the data, there are a series of tools for uncompressing files depending on the compression format used, which are described in the command line and AWS workshops. Most sequencing data analysis softwares accept these sequencing files in their compressed format however, so there isn’t a need to uncompress them. Your data may actually be more protected in its compressed format, so it is generally best to keep it compressed unless you have a need for it to be uncompressed.

Common file compression types along with tools to work with them are: .gz ; use gunzip to uncompress a .gz file and gzip to recompress it back into a .gz file. .bz2 ; use bunzip2 to uncompress a .bz2 file and bzip2 to recompress it back into a .bz2 file. .zip ; use unzip to uncompress a .zip file and zip to recompress it back into a .zip file. .tar and .tar.gz ; use tar xvzf [file] to uncompress a .tar.gz or .tar file.