MinION HDF: fast5 - aechchiki/SIB_LongReadsWorkshop_Zurich17 GitHub Wiki

How raw data is generated

Sequencing calls on a MinION platform are based on the detection of electric signal recorded through the nanopores of the flowcell, as the DNA/RNA fragment pass through it. Signal measurements are called events. The nature of the event depends on the nature of the nucleobases of the fragment entering the pore at a given time. Thus, the change of signal through time reflects the changes in nucleotide composition as the fragment passes through.

Content of fast5 files

The electric signal information is stored in fast5 files (a type of HDF), one fast5 file per sequenced molecule. Basecalling is then achieved using algorithms based on HMM (Hidden Markov Model) or RNN (Recurrent Neural Nets). This is done by specialized software, which can be run locally (e.g., using Albacore for local basecalling directly from raw data) or on the ONT cloud (e.g., using Metrichor for basecalling through a step of event detection). The basecaller produces then one file per read, in pass/fail category if the basecalling was respectively successfull or not, including info about the nature of the read (template/complement/consensus 2D).

Poretools: extract fastq from fast5

We will practice extraction on RNA sequencing data.

Get the data:

cd $minion_rna
wget https://drive.switch.ch/index.php/s/184F1Pc5LO3lczS/download -O subsetf5.tar.gz

You can extract the *tar.gz file you just downloaded with the command (the reads will be extracted to: ./subsetf5/):

tar -xzvf subsetf5.tar.gz
# x for: eXtract
# z for: Zip
# v for: Verbose
# f for: Finalize

You can extract the reads from fast5 to fastq format using poretools, a toolkit for working with sequencing data from MinION. The usage is detailed in the documentation.

Good software generally comes with good documentation. To access the (not always comprehensive) command-line documentation, invoke the command using the -h / --help flag :

poretools --help

If not otherwise specified, when a command is executed via interactive shell, output is written to stdout (standard output), which by default consists of the text terminal. To write the output to a file, use redirection with > ( greater-then) symbol:

poretools fastq <path/to/fast5/>*.fast5 > <poretools_out>.fastq

A good practice is to compress your data to archive after processing, in order to save storage space on the disk. You can still visualize compressed files in a terminal with zcat or go back to the uncompressed data using gzip -d <file.gz>:

gzip -9 <poretools_out>.fastq
zcat <poretools_out>.fastq.gz | head

Next

Go to next tutorial Extraction of PacBio reads.

Finish this section, go to Checkpoint.

Go back to Table of content.

⚠️ **GitHub.com Fallback** ⚠️