Formats_Sequel - aechchiki/SIB_LongReadsWorkshop_Zurich18 GitHub Wiki
Section: Data [3/5].
The new standard for PacBio is using a "classical" BAM format (alignment file) to store the sequences. However, this BAM has a very particular and specific PacBio tags and possibly header, which you can look up in detail on the dedicated website.
Data from Sequel format don't need extraction (lucky you!) so you can use them directly in (most) downstream PacBio-specific software.
Here you can see how a "real" run looks like.
For a given movie, three files are reported.
[ ] m54006_170729_232022.subreads.bam 2017-07-30 09:28 13G
[ ] m54006_170729_232022.subreads.bam.pbi 2017-07-30 09:28 22M
[TXT] m54006_170729_232022.subreadset.xml 2017-07-30 09:26 13K
The file .bam.pbi
simply contains a table of semantic information about each read and its alignment ("index"), necessary for some PacBio downstream software. For your information, you can generate such an index with the pbindex
utility. As usual, the xml
file contains sequencing run metadata.
Some programs though need the input in fasta/q format instead of bam. As future reference, please refer to the PacBio BAM manipulation manual.
You can also convert your "old" basecalled RSII files into Sequel-like bam (for pipeline compatibility), by converting the basecalled bax.h5 (bax2bam
)[https://github.com/PacificBiosciences/PacBioFileFormats/wiki/BAM-recipes] then align it (pbalign
). Both these utilities are also available through Bioconda! ;) check them out .
If you have time or want to try that, you can convert this format also using utilities embedded in PacBio's bioconda (as in the previous page). As exercise, for example, you can try to convert a bam file into a fastq. For this you need a PacBio bam file and its corresponding index file:
# a tiny subset of the Avian dataset
wget https://drive.switch.ch/index.php/s/rmVRnGXbfmuzTfx/download -O PBbam.tar.gz
(if didn't do it before) You need to:
-
install
conda
(3.7):- get the installer:
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
-
follow instructions:
- launch it!
bash Miniconda3-latest-Linux-x86_64.sh -f
- press "enter" until end of License terms
- enter "yes" (to accept the License terms)
- enter a non-existing location for installation (you can keep the suggested one - for example:
/home/training/miniconda3/
) - enter "yes" (to prepend the install location to PATH in the .bashrc)
- open a new terminal -> tadah!
- check with:
which conda
- this should point to your installation location (following the example:
/home/training/miniconda3/bin
) - if not, prepend the location to EVERY conda command here below! (e.g.
conda
->/home/training/miniconda3/bin/conda
)
- this should point to your installation location (following the example:
- launch it!
-
setup the necessary channels:
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
! NEW STUFF
- install your favourite PacBiotool, in this specific case
bam2fastx
:-
conda install bam2fastx
- enter "yes" (when asked to proceed to installation)
- check with:
which bam2fastx
- this should point to your installation location (following the example:
/home/training/miniconda3/bin
) - if not, prepend the location to EVERY conda command here below! (e.g.
bam2fastx
->/home/training/miniconda3/bin/bam2fastx
)
- this should point to your installation location (following the example:
-
To access usage and local documentation use flag -h
(bam2fastx -h
).
For example, conversion to (default zipped) fastq is simply:
bam2fastq -o <output> <file.pb.bam>
Note: when downloading real data, make sure to also download the index! otherwise the command will fail. Little question, how to generate the index? Find out ;)
If you had issues converting, or just didn't have time for that, here is a subset of the original dataset, converted to fastq:
# pacbio FASTQ from BAM
wget https://drive.switch.ch/index.php/s/LzXCP94TanTXaF0/download -O PBbam_subset.fastq.gz