PacBio HDF: h5 - aechchiki/SIB_LongReadsWorkshop_Zurich17 GitHub Wiki

How raw data is generated

In PacBio sequencing, the basecalling is done according to the corresponding base-level incorporation events, calculated based on the type of fluorescent dye (unique per nucleotide), over time.

Optical raw data per SMRTcell is stored in a bas.h5 (a type of HDF) and three associated bax.h5 (each one containing a consecutive part of the nucleotide incorporation movie).

Content of bax.h5 and bas.h5 files

In the bax.h5, you can find the basecalling information, alongside with metadata about the sequencing run and instrument settings. The bas.h5 file is basically there to link the bax.h5 files and contains run metadata. You can find extensive documentation about bas.h5 archive layout here.

This reads format is no longer used to store the basecalling information in newer instruments from PacBio, such as Sequel, in which data are stored in a classical bam format. However, all data produced by a RSII instrument will be still in the format bas.h5 and bax.h5 files, you are going to work with.

OPTIONAL

We will practice extraction on RNA sequencing data. This step is NOT compulsory, and will take some time for the download on your machine, but you are welcome to try it. Don't worry, it won't affect the final outcome of this TP, as we already pre-processed data and will provide a link to the fastq for assemblies computation.

Download the data

mkdir $pacbio_rna/test && cd $pacbio_rna/test
wget https://drive.switch.ch/index.php/s/GWPawZQIl18xsKt/download -O s1_p0.1.bax.h5.gz
wget https://drive.switch.ch/index.php/s/E2b6AUXtFv5UfVV/download -O s1_p0.2.bax.h5.gz
wget https://drive.switch.ch/index.php/s/CHAlwWD23vcjI6T/download -O s1_p0.3.bax.h5.gz
wget https://drive.switch.ch/index.php/s/jogtBHHmlJJMiYa/download -O s1_p0.bas.h5.gz

Pbh5tools: extract fastq from bas.h5

You can extract this dataset using pbh5tools, a set of python scripts to extract fasta/q from h5 reads. The usage is detailed in the documentation. You will need bash5tools.py script to extract the reads in fastq format.

bash5tools.py <file.bas.h5> --outFilePrefix <prefix> --readType subreads --outType fastq
gzip -9 <prefix>.fastq

Next

Finish this section, go to Checkpoint.

Go back to tutorial Extraction of MinION reads.

Go back to Table of content.

⚠️ **GitHub.com Fallback** ⚠️