Formats_fast5 - aechchiki/SIB_LongReadsWorkshop_Zurich18 GitHub Wiki
Section: Data [1/5].
Sequencing calls on a MinION platform are based on the detection of electric signal recorded through the nanopores of the flowcell, as the DNA/RNA fragment pass through it. Signal measurements are called "events".
The nature of the event depends on the nature of the nucleobases of the fragment entering the pore at a given time. Thus, the change of signal through time reflects the changes in nucleotide composition as the fragment passes through.
The electric signal information is stored in fast5 files (a type of HDF), one fast5 file per sequenced molecule. To read this signal, the raw signal must be "basecalled".
Basecalling is achieved using algorithms based on HMM (Hidden Markov Model) or RNN (Recurrent Neural Nets). This is done by specialized software, which can be run locally (e.g., using Albacore for local basecalling directly from raw data) or on the ONT cloud (e.g., using Metrichor for basecalling through a step of event detection). The basecaller produces then one file per read, in pass/fail category if the basecalling was respectively successful or not, including info about the nature of the read (template/complement/consensus 2D).
Note: if the data is not basecalled, the extraction will fail.
We will practice extraction on RNA sequencing data, specifically ONT 2D RNA data.
Remember to move to the suitable $DIR
so that you won't lose the organisation.
For info, this is a subset of D. melanogaster whole-transcriptome RNA experiment (R9 chemistry):
wget https://drive.switch.ch/index.php/s/184F1Pc5LO3lczS/download -O subsetf5.tar.gz
The aproximate size of the archive should be 280M. Check with:
ls -lh <file_name>
You can extract the archive you just downloaded using tar -xzvf
utility:
- x for: eXtract
- z for: Zip
- v for: Verbose
- f for: Finalize
The reads will be extracted to a folder.
tar -xzvf <file>.tar.gz
The aproximate size of the extract folder should be 544M. Check with:
du -sh <folder_name>
We will use poretools to extract the reads from basecalled fast5 to fasta/q format. Poretools is a toolkit for working with sequencing data from MinION. The usage is detailed in the documentation.
You can access local "man-like" documentation with:
poretools --help
We will practice extraction to fastq using subcommand fastq
. From the manual:
Extract FASTQ sequences from a set of FAST5 files
Poretools, as most programs when using interactive shell, uses the "screen" as stdout (standard output). To save the content to a file (our fastq file!), we need to use redirection (>
symbol):
poretools fastq <path/to/fast5_folder/>*.fast5 > <poretools_out.fastq>
The aproximate size of the output should be 5.6M. You can visualize the file content with:
head -<n> <poretools_out.fastq>
# n = first n lines you wish to see
If all went well, this will look like a standard fastq file.
In real life, the output file will be much bigger though. A good practice is to compress your data to archive after processing, in order to save storage space on the disk.
You can still:
- visualize compressed files in a terminal with
zcat
- go back to the uncompressed data using
gzip -d <file.gz>
If you wish to test this:
gzip -9 <poretools_out>.fastq
zcat <poretools_out>.fastq.gz | head
Cool, now your data are "usable" for most software.
We provide a backup in case anything went wrong:
# backup ONT RNA 2D reads, fastq format
wget https://drive.switch.ch/index.php/s/rSJQIzwEonObKa8/download -O ONT_RNA_2D.fastq.gz
No need to extract those (boring and repetitive, as it is exactly using the same toolkit/commands), we already provide two fastq subsets to download:
For info, this is a subset of M. musculus RNA experiment (R9.5 chemistry):
Note that 1D² data that is called by Albacore generates two overlapping sets of files: one for the 1D basecalling script and one for the 1D² script. As for the 2D protocol, there are overlapping reads among the 1D and 1D² directories (namely, those 1D reads that generated 1D² reads).
# ONT RNA 1D² reads
wget https://drive.switch.ch/index.php/s/NUReEZpS9PvYKyt/download -O ONT_RNA_1D2_ncorr.fastq.gz
wget https://drive.switch.ch/index.php/s/af5fz2MwzeGpHBF/download -O ONT_RNA_1D2.fastq.gz
Those reads are a subset from the very few 1D² datasets made accessible online until now. Go thank David - Research Fellow (bioinformatics), Malaghan Institute of Medical Research - for sharing this dataset, and made it possible to use it in this tutorial!