Formats_RSII - aechchiki/SIB_LongReadsWorkshop_Zurich18 GitHub Wiki
Section: Data [2/5].
In PacBio, the basecalling is done according to the corresponding base-level incorporation events, calculated based on the type of fluorescent dye (unique per nucleotide), over time.
Optical raw data per RSII SMRTcell is stored in:
- three associated
bax.h5
, each one containing a consecutive part of the nucleotide incorporation movie - a
bas.h5
, now optional - a meta-file
xml
, including sequencing run log
Such h5
files are a type of HDF). In the bax.h5
, you can find the basecalling information, alongside with metadata about the sequencing run and instrument settings. The bas.h5
file is basically there to link the bax.h5
files. You can find here extensive documentation about PacBio .h5
archive layout .
You can see how a "real" run looks like here.
For a given movie (1 flowcell) in the run:
[ ] m141129_125831_42161_c100698142550000001823143403261592_s1_p0.1.bax.h5 2014-11-29 13:05 7.8G
[ ] m141129_125831_42161_c100698142550000001823143403261592_s1_p0.2.bax.h5 2014-11-29 13:06 7.9G
[ ] m141129_125831_42161_c100698142550000001823143403261592_s1_p0.3.bax.h5 2014-11-29 13:08 9.2G
[TXT] m141129_125831_42161_c100698142550000001823143403261592_s1_p0.metadata.xml 2014-11-29 13:14 4.0K
! Practical note: if you don't want/can't to dive in the "Optional" part(s), you can directly jump at the end of this page and get the fastq reads you need for later on. If you want to continue, be aware that we will practice extraction on RNA sequencing data (RSII).
The following steps are NOT compulsory, because they will take some time, especially for data download. Also, sorry but .h5
PacBio data won't be possible to be subset, nor rename (since the bax/bas files are linked to each other), thus you'll get a "real life" experience here ;)
You are very welcome to try extraction if you have time or just curious (or let this go during a break), especially the Bioconda part, very fashionable lately.
wget https://drive.switch.ch/index.php/s/uurG2IlbzgqgBeH/download -O m160615_134828_42182_c101000182550000001823232709161602_s1_p0.1.bax.h5.gz
wget https://drive.switch.ch/index.php/s/0PtGcftWhosEUam/download -O m160615_134828_42182_c101000182550000001823232709161602_s1_p0.2.bax.h5.gz
wget https://drive.switch.ch/index.php/s/LHy0dsz7YPQMRLt/download -O m160615_134828_42182_c101000182550000001823232709161602_s1_p0.3.bax.h5.gz
wget https://drive.switch.ch/index.php/s/NOvA8JPvPPcZSHe/download -O m160615_134828_42182_c101000182550000001823232709161602_s1_p0.bas.h5.gz
First, extract the data from the zipped format (reminder: gzip -d
).
We will use a tool in pbh5tools for extraction. This is a set of python scripts to extract fasta/q from h5 reads. Usage is detailed in the documentation. For extraction, we will need bash5tools.py
utility.
The software is already installed in the docker image you have.
To access usage and local documentation use flag -h
. From manual:
The bash5tools.py provides mechanisms to extract basecall information from bas.h5 files.
To extract all the subreads:
bash5tools.py --outFilePrefix <subreads_prefix> --outType fastq --readType subreads <input.bas.h5>
You'll have to have run the SMRT pipeline to get at this point
To extract only the CCS reads (! if any):
bash5tools.py --outFilePref <ccs_prefix> --outType fastq --readType ccs --minLength 100 <input.bas.h5>
A workaround when you have those *.bax files is to convert them to bam, then fasta/q. This is very easily done using PacBio's bioconda packages. You'll learn more about PacBio's BAM in the next section
You need to:
-
install
conda
(3.7):- get the installer:
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
-
follow instructions:
- launch it!
bash Miniconda3-latest-Linux-x86_64.sh -f
- press "enter" until end of License terms
- enter "yes" (to accept the License terms)
- enter a non-existing location for installation (you can keep the suggested one - for example:
/home/training/miniconda3/
) - enter "yes" (to prepend the install location to PATH in the .bashrc)
- open a new terminal -> tadah!
- check with:
which conda
- this should point to your installation location (following the example:
/home/training/miniconda3/bin
) - if not, prepend the location to EVERY conda command here below! (e.g.
conda
->/home/training/miniconda3/bin/conda
)
- this should point to your installation location (following the example:
- launch it!
-
setup the necessary channels:
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
-
install your favourite PacBiotool, in this specific case
bax2bam
:-
conda install bax2bam
- enter "yes" (when asked to proceed to installation)
- check with:
which bax2bam
- this should point to your installation location (following the example:
/home/training/miniconda3/bin
) - if not, prepend the location to EVERY conda command here below! (e.g.
bax2bam
->/home/training/miniconda3/bin/bax2bam
)
- this should point to your installation location (following the example:
-
To access usage and local documentation use flag -h
(bax2bam -h
).
From manual, this utility allows you to extract:
bax2bam converts the legacy PacBio basecall format (bax.h5) into the BAM basecall format.
extract subsections:
Output read types (mutually exclusive): --subread Output subreads (default) --hqregion Output HQ regions --polymeraseread Output full polymerase read --ccs Output CCS sequences (requires ccs.h5 input)
and generate a fully-compatible PacBio bam file:
Pulse feature options: Configure pulse features in the output BAM. Supported features include: Pulse Feature: BAM tag: Default: DeletionQV dq Y DeletionTag dt Y InsertionQV iq Y IPD ip Y PulseWidth pw Y* MergeQV mq Y SubstitutionQV sq Y SubstitutionTag st N
Usage:
bax2bam <file_bax.h5> -o <bax2bam_prefix>
In some publications, you can still see pls2fasta
as extraction method. This utility lays in the blasr/utils github.
This utility is not embedded in the current PacBio's bioconda blasr package, but you can find it in:
- older forks of Blasr github
- in older versions of Blasr in your cluster (if you're a Vital-IT user)
However, pls2fasta
is especially useful for data which do not contain any referenced bas file in the run (newer versions of data delivery), if you just want the fasta/q out with no intermediate passage via bam format.
To access usage and local documentation use flag -h
. From manual, this utility allows you to:
pls2fasta Converts plx.h5/bax.h5/fofn files to fasta or fastq files. Although fasta files are provided with every run, they are not trimmed nor split into subreads. This program takes additional annotation information, such as the subread coordinates and high quality regions and uses them to create fasta sequences that are substrings of all bases called. Most of the time you will want to trim low quality reads, so you should specify -trimByRegion.
Usage then is simply:
pls2fasta <input.bax.h5> <output.fastq> -fastq
Cool, now your data are "usable" for most software.
In case you didn't have time to try, or anything went wrong, we provide a backup of PacBio subreads and CCS reads:
# extracted fastq
# subreads
wget https://drive.switch.ch/index.php/s/sdGx759jCo7BpA0/download -O RNAPBsub_fq.tar.gz
# ccs
wget https://drive.switch.ch/index.php/s/cGsicVmlpcnT6WB/download -O RNAPBccs_fq.tar.gz