PacBio: the Iso Seq pipeline - aechchiki/SIB_LongReadsWorkshop_Zurich17 GitHub Wiki

Introduction

The Iso-Seq pipeline is a method to get a high-confidence transcriptome from PacBio RNA-seq data. Here, we will just give you a flavor on how data from Iso-Seq look like, and describe the workflow we went through to get the data you'll be working with.

If you ever need a full tutorial, here is a really good one - directly from PacBio.

Get a high confidence transcripts set

To get the high-confidence transcripts set, we started with an experimental pipeline with size selection (BluePippin Size Selection System), resulting in libraries for three size fractions (1-2kb, 2-3kb and 3-7kb). The Iso-Seq pipeline was run independently on the three size fractions, using merged raw reads from the two SMRTcells per fraction as input.

Size selection is recommended to get the best out from your libraries: it allows to detect a broader and accurate range of transcripts, but also leads to a depletion in very short (for the shortest fraction) and very long transcripts (for the longest fraction). So consider this before starting your library preparation.

We then run the Iso-Seq pipeline on the three fractions, separately, and then merged the final files to get the final isoforms set. This behaviour could reduce the number of full-length isoforms in output. However, we chose this approach since it is advised to obtain a better polished consensus at the end. Two potential pitfalls of this approach are that, (1) the quality of the consensus obtained by ICE itself could be reduced, and (2) the runtime of the whole Iso-Seq pipeline could be longer.

But, what's the Iso-Seq pipeline?

The Iso-Seq pipeline is a set of software developed by PacBio that allows to go from raw reads (in bax.h5/bas.h5 format) and generate the high-confidence transcriptome.

Here's an overview of it:

iso-seq

In the first step, subreads are assembled generating Reads Of Insert (ROI), the single highest quality consensus per insert, and saved to fastq format.
In the second step ("Classify"), the ROI were determined as full-length non-chimeric (FL) or non full-length (nFL), depending on the presence or not of both 5' and 3' primers at the read ends.
In the third step ("Cluster"), the FL ROIs were fed into ICE (Iterative Clustering and Error correction) to generate clusters of isoforms, while nFL reads were attributed to the clusters a posteriori.
In the fourth step, the clusters were polished using Quiver: the final transcript isoform consensus was generated, and saved in fastq format.

What is the output of the Iso-Seq pipeline?

So, let's take a look of how a set of high-quality isoforms from PacBio looks like. We generated for you a subset of the file we got from the Iso-Seq:

cd $pacbio_rna
wget https://drive.switch.ch/index.php/s/fhH7oLIQ95EGuPt/download -O pacbioIS_subset.tar.gz

Go to tutorial Error correcting MinION 2D reads with Canu .

Go to Checkpoint.

Go back to Table of content .