PacBio RSII format: HDF

Section: Data [2/5].

Generation of raw data

In PacBio, the basecalling is done according to the corresponding base-level incorporation events, calculated based on the type of fluorescent dye (unique per nucleotide), over time.

Content of bax.h5 and bas.h5 files

Optical raw data per RSII SMRTcell is stored in:

three associated bax.h5, each one containing a consecutive part of the nucleotide incorporation movie
a bas.h5, now optional
a meta-file xml, including sequencing run log

Such h5 files are a type of HDF). In the bax.h5, you can find the basecalling information, alongside with metadata about the sequencing run and instrument settings. The bas.h5 file is basically there to link the bax.h5 files. You can find here extensive documentation about PacBio .h5 archive layout .

You can see how a "real" run looks like here.

For a given movie (1 flowcell) in the run:

[   ] m141129_125831_42161_c100698142550000001823143403261592_s1_p0.1.bax.h5      2014-11-29 13:05  7.8G
[   ] m141129_125831_42161_c100698142550000001823143403261592_s1_p0.2.bax.h5      2014-11-29 13:06  7.9G
[   ] m141129_125831_42161_c100698142550000001823143403261592_s1_p0.3.bax.h5      2014-11-29 13:08  9.2G
[TXT] m141129_125831_42161_c100698142550000001823143403261592_s1_p0.metadata.xml  2014-11-29 13:14  4.0K

! Practical note: if you don't want/can't to dive in the "Optional" part(s), you can directly jump at the end of this page and get the fastq reads you need for later on. If you want to continue, be aware that we will practice extraction on RNA sequencing data (RSII).

The following steps are NOT compulsory, because they will take some time, especially for data download. Also, sorry but .h5 PacBio data won't be possible to be subset, nor rename (since the bax/bas files are linked to each other), thus you'll get a "real life" experience here ;)

You are very welcome to try extraction if you have time or just curious (or let this go during a break), especially the Bioconda part, very fashionable lately.

Extraction: pbh5tools

Get data

wget https://drive.switch.ch/index.php/s/uurG2IlbzgqgBeH/download -O m160615_134828_42182_c101000182550000001823232709161602_s1_p0.1.bax.h5.gz
wget https://drive.switch.ch/index.php/s/0PtGcftWhosEUam/download -O m160615_134828_42182_c101000182550000001823232709161602_s1_p0.2.bax.h5.gz
wget https://drive.switch.ch/index.php/s/LHy0dsz7YPQMRLt/download -O m160615_134828_42182_c101000182550000001823232709161602_s1_p0.3.bax.h5.gz
wget https://drive.switch.ch/index.php/s/NOvA8JPvPPcZSHe/download -O m160615_134828_42182_c101000182550000001823232709161602_s1_p0.bas.h5.gz

Extract data

First, extract the data from the zipped format (reminder: gzip -d).

We will use a tool in pbh5tools for extraction. This is a set of python scripts to extract fasta/q from h5 reads. Usage is detailed in the documentation. For extraction, we will need bash5tools.py utility.

The software is already installed in the docker image you have.

To access usage and local documentation use flag -h. From manual:

The bash5tools.py provides mechanisms to extract basecall information from bas.h5 files.

To extract all the subreads:

bash5tools.py --outFilePrefix <subreads_prefix> --outType fastq --readType subreads <input.bas.h5>

Not in our dataset!

You'll have to have run the SMRT pipeline to get at this point

To extract only the CCS reads (! if any):

bash5tools.py --outFilePref <ccs_prefix> --outType fastq --readType ccs --minLength 100 <input.bas.h5>

Extraction: bax2bam (via PacBio bioconda)

A workaround when you have those *.bax files is to convert them to bam, then fasta/q. This is very easily done using PacBio's bioconda packages. You'll learn more about PacBio's BAM in the next section

Same data as above!

Bioconda recipe: bax2bam

You need to:

install conda (3.7):
- get the installer:
- wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
follow instructions:
- launch it! bash Miniconda3-latest-Linux-x86_64.sh -f
- press "enter" until end of License terms
- enter "yes" (to accept the License terms)
- enter a non-existing location for installation (you can keep the suggested one - for example: /home/training/miniconda3/)
- enter "yes" (to prepend the install location to PATH in the .bashrc)
- open a new terminal -> tadah!
- check with: which conda
  - this should point to your installation location (following the example: /home/training/miniconda3/bin)
  - if not, prepend the location to EVERY conda command here below! (e.g. conda -> /home/training/miniconda3/bin/conda)
setup the necessary channels:
- conda config --add channels defaults
- conda config --add channels bioconda
- conda config --add channels conda-forge
install your favourite PacBiotool, in this specific case bax2bam:
- conda install bax2bam
  - enter "yes" (when asked to proceed to installation)
- check with: which bax2bam
  - this should point to your installation location (following the example: /home/training/miniconda3/bin)
  - if not, prepend the location to EVERY conda command here below! (e.g. bax2bam -> /home/training/miniconda3/bin/bax2bam)

To access usage and local documentation use flag -h (bax2bam -h).

Extract data

From manual, this utility allows you to extract:

bax2bam converts the legacy PacBio basecall format (bax.h5) into the BAM basecall format.

extract subsections:

Output read types (mutually exclusive): --subread Output subreads (default) --hqregion Output HQ regions --polymeraseread Output full polymerase read --ccs Output CCS sequences (requires ccs.h5 input)

and generate a fully-compatible PacBio bam file:

Pulse feature options: Configure pulse features in the output BAM. Supported features include: Pulse Feature: BAM tag: Default: DeletionQV dq Y DeletionTag dt Y InsertionQV iq Y IPD ip Y PulseWidth pw Y* MergeQV mq Y SubstitutionQV sq Y SubstitutionTag st N

Usage:

bax2bam <file_bax.h5> -o <bax2bam_prefix>

Extraction(obsolete?): pls2fasta via blasr

Same data as above!

Get pls2fasta

In some publications, you can still see pls2fasta as extraction method. This utility lays in the blasr/utils github.

This utility is not embedded in the current PacBio's bioconda blasr package, but you can find it in:

older forks of Blasr github
in older versions of Blasr in your cluster (if you're a Vital-IT user)

However, pls2fasta is especially useful for data which do not contain any referenced bas file in the run (newer versions of data delivery), if you just want the fasta/q out with no intermediate passage via bam format.

To access usage and local documentation use flag -h. From manual, this utility allows you to:

pls2fasta Converts plx.h5/bax.h5/fofn files to fasta or fastq files. Although fasta files are provided with every run, they are not trimmed nor split into subreads. This program takes additional annotation information, such as the subread coordinates and high quality regions and uses them to create fasta sequences that are substrings of all bases called. Most of the time you will want to trim low quality reads, so you should specify -trimByRegion.

Usage then is simply:

pls2fasta <input.bax.h5> <output.fastq> -fastq

You're done!

Cool, now your data are "usable" for most software.

In case you didn't have time to try, or anything went wrong, we provide a backup of PacBio subreads and CCS reads:

# extracted fastq
# subreads
wget https://drive.switch.ch/index.php/s/sdGx759jCo7BpA0/download -O RNAPBsub_fq.tar.gz
# ccs
wget https://drive.switch.ch/index.php/s/cGsicVmlpcnT6WB/download -O RNAPBccs_fq.tar.gz

Format: PacBio Sequel

Back

Table of content

Formats_RSII - aechchiki/SIB_LongReadsWorkshop_Zurich18 GitHub Wiki

PacBio RSII format: HDF

Generation of raw data

Content of bax.h5 and bas.h5 files

Extraction: pbh5tools

Get data

Extract data

Not in our dataset!

Extraction: bax2bam (via PacBio bioconda)

Same data as above!

Bioconda recipe: bax2bam

Extract data

Extraction(obsolete?): pls2fasta via blasr

Same data as above!

Get pls2fasta

You're done!

Next

Back

⚠️ GitHub.com Fallback ⚠️

Formats_RSII - aechchiki/SIB_LongReadsWorkshop_Zurich18 GitHub Wiki

PacBio RSII format: HDF

Generation of raw data

Content of bax.h5 and bas.h5 files

Extraction: pbh5tools

Get data

Extract data

Not in our dataset!

Extraction: bax2bam (via PacBio bioconda)

Same data as above!

Bioconda recipe: bax2bam

Extract data

Extraction(obsolete?): pls2fasta via blasr

Same data as above!

Get pls2fasta

You're done!

Next

Back

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️