Filetypes and File Compression - Green-Biome-Institute/AWS GitHub Wiki
Regarding some of the smaller details regarding the actual sequencing data, there are several main filetypes that you will be dealing with:
FASTQ
First is the .fastq
filetype (also uses .fq
), which stores data by repeating the following 4 lines for each sequencing read:
@SEQ_ID
GATCGATCGATCGATGATCGATCGATCGATCGATCGATCATCGATCGATCG
+
!!''**(())++%%!!''**(())++%%!!''**(()++FCC556%%!!''
These 4 lines represent
- A sequence identifier preceded by @
- The raw sequence
- Optional further notes / description
- Quality values for the raw sequencing data
These, along with FASTA files, are probably the most common file format that you will work with.
FASTA
Next is another of the most common file formats (and the simplest), .fasta
(which also uses .fa
, .fnt
, .fna
(nucleotide), .faa
(amino acid), .fas
). This file contains only the an identifier for each read after a >
symbol and then the sequencing information for that read on the following line.
>SEQ_ID
GATCGATCGATCGATGATCGATCGATCGATCGATCGATCATCGATCGATCG
Many analysis softwares accept FASTA as inputs and can work with them. One example where FASTA files are necessary is for using BLAST. Before doing a BLAST query, you’ll need to convert your FASTQ files into FASTA (this is shown in the online GitHub workshop).
FAST5
The FAST5 format is one way Oxford Nanopore Technology (ONT) stores data using HDF5 format (Hierarchal Data Format). It is stored in a binary format, so you’ll need to use HDF5 tools to work with it. (https://support.hdfgroup.org/products/hdf5_tools/). Since ONT also produces FASTQ or FASTA files, you probably won’t work with this filetype, but good to be aware of it.
SAM/BAM
SAM and BAM files stand for Sequence Alignment Map and Binary Alignment Map and both store sequence alignment information. BAM files are compressed into a format that is smaller size-wise and more efficient for your computer to work with, however they are not human-readable. SAM files are their direct equivalent (they contain the same information) in ascii format, meaning they can be read. In order to look at a BAM file, you would need to convert it back into a SAM file.
These files will only be relevant downstream of the analysis talked about in this document. For more information, you can check out the documentation: https://samtools.github.io/hts-specs/SAMv1.pdf
File Compression
Something you’ll note is that most sequencing data commonly has multiple file extensions, for example, “data.fastq.gz”. This means a file is “compressed,” which means that it is not immediately accessible until you uncompress it. Data compression is used in order to store large data files in a smaller format, therefore lessening the total amount of storage capacity needed for each file. If you would like to look at/read the data, there are a series of tools for uncompressing files depending on the compression format used, which are described in the command line and AWS workshops. Most sequencing data analysis softwares accept these sequencing files in their compressed format however, so there isn’t a need to uncompress them. Your data may actually be more protected in its compressed format, so it is generally best to keep it compressed unless you have a need for it to be uncompressed.
Common file compression types along with tools to work with them are:
.gz
; use gunzip
to uncompress a .gz
file and gzip
to recompress it back into a .gz
file.
.bz2
; use bunzip2
to uncompress a .bz2
file and bzip2
to recompress it back into a .bz2
file.
.zip
; use unzip
to uncompress a .zip
file and zip
to recompress it back into a .zip
file.
.tar
and .tar.gz
; use tar xvzf [file]
to uncompress a .tar.gz
or .tar
file.