BCR ABL Fusion Protein - PacificBiosciences/DevNet GitHub Wiki

This page describes the analysis results for the BCR-ABL project.

Download the example: http://pacb.com/devnet/files/how-tos/bcr-abl/1.0/bcr-abl-1.0.zip

Summary

Using PacBio 855 basepair CCS long reads, we observe amino acid variants in the classic BCR-ABL fusion protein of Chronic Myelogenous Leukemia. The long reads from single molecules allow us to observe the co-occurrence of amino acid variants at different points in the same protein coding mRNA (haplotypes). This is important as current strategies rely on the identity of these variants to guide drug treatment (ARIAD Pharmaceuticals, Shah Lab UCSF). PacBio long reads allow us to observe the mutational landscape of these variants.

A focused view of these variants in two patient samples show many different observed variants:

Methods

We isolate the BCR-ABL fusion product using a two-step amplification process on peripheral blood samples from two patients where the forward primer is in BCR and the reverse primer is in ABL yielding a 800bp product of the BCR-ABL fusion. We then do a standard PacBio SMRTbell prep and run RS sequencing. We concentrate on PacBio CCS reads which combine multiple sequencing passes into a single read estimate.

We take CCS reads and align them against a canonical BCR-ABL sequence reference. Using codonMutAnalysis.py on these alignments, we characterize codon differences between the reference and the observed read at the following amino acid positions of interest: 253, 255, 315, 319, and 355. For amino acid positional summaries, we give the fraction of counts of observing that variant at that position and the fraction of counts of observing that variant not at that position. The later quantity is computed to use in a simple null model. For read summaries, we give counts of the number of reads for which we observe different patterns of amino acid variants.

Example calling sequence: python codonMutAnalysis.py BCRABL.ccsreads.cmp.h5 > codon_151628.txt

For a more global characterization, we align all CCS reads for a run and cluster them using simple agglomerative clustering on binary data vectors that encode the alignment.

Results and Discussion

Patient sample 1 has 2011 CCS reads with median accuracy of 95.91% and median read length of 845. Patient sample 2 has 1461 CCS reads with median accuracy of 96.13% and median read length of 855. The sequencing yielded long 850bp reads with 96.02% accuarcy.

For the positional analysis we give the following data:

one : an analysis flag
poi : amino acid position of interest
coi : codon of interest
Qcodon : the Query observed codon
fracQNotP : fraction of times the variant (codon of interest->observed codon) was observed NOT at the position of interest
countQNotP : the numerator of fracQNotP
SQNotP : the denomenator of fracQNotP
fracQAtP : fraction of times the variant (codon of interest->observed codon) was observed at the position of interest
countQAtP : the numerator of fracQAtP
SQAtP : the denomenator of fracQAtP
Poisson : a simple p-value of seeing the variant at the position given the rate it was observed NOT at the position.

Codon positional analysis shows for sample 1:

one poi coi Qcodon fracQNotP countQNotP SQNotP fracQAtP countQAtP SQAtP Poisson
1 315 act act 0.99718 1769 1774 0.50920 332 652 1.00e+00
1 315 act att 0.00056 1 1774 0.47853 312 652 0.00e+00
1 255 gag gag 0.99391 10122 10184 0.58213 241 414 1.00e+00
1 255 gag gtg 0.00029 3 10184 0.38164 158 414 0.00e+00
1 253 tac tac 0.99561 6578 6607 0.98018 445 454 6.18e-01
1 355 gag gag 0.97717 9845 10075 0.99234 518 522 3.52e-01
1 319 acc acc 0.99300 3688 3714 0.99705 675 677 4.48e-01

Codon read analysis summary (note there are some differences from the plot in the Summary because of code differences. The summary plot is nicer visually, whereas this plot has more detailed information):

Codon positional analysis shows for sample 2:

one poi coi Qcodon fracQNotP countQNotP SQNotP fracQAtP countQAtP SQAtP Poisson
1 315 act act 0.99336 3141 3162 0.03226 34 1054 1.00e+00
1 315 act gct 0.00032 1 3162 0.48956 516 1054 0.00e+00
1 315 act tct 0.00032 1 3162 0.02562 27 1054 1.04e-43
1 315 act att 0.00095 3 3162 0.11954 126 1054 1.23e-214
1 315 act gtt 0.00032 1 3162 0.03510 37 1054 1.02e-63
1 315 act ttt 0.00032 1 3162 0.27704 292 1054 0.00e+00
1 255 gag aag 0.00163 27 16613 0.02183 16 733 1.79e-14
1 255 gag gag 0.95955 15941 16613 0.96453 707 733 4.35e-01
1 253 tac cac 0.00026 3 11397 0.07152 54 755 1.65e-112
1 253 tac tac 0.99746 11368 11397 0.91523 691 755 9.88e-01
1 355 gag gag 0.99363 16546 16652 0.14761 102 691 1.00e+00
1 355 gag ggg 0.00066 11 16652 0.84226 582 691 0.00e+00
1 319 acc acc 0.99034 6050 6109 0.90633 1016 1121 9.98e-01
1 319 acc gcc 0.00016 1 6109 0.08475 95 1121 1.71e-221

Codon read analysis summary:

The Shah lab was looking to confirm the following variants, all of which were observed in the data.

Sample 1 :
T315I = 29%
E255V = 15%

Sample 2 :
315F+Y253H: = 0.2% (4 counts)
T315A: = 17%
T315A+Y253H+T319A+E355G: = 0.3% (3 counts)
T315A+T319A+E355G : = 2%

The result of clustering all reads from the patient sample 1:

We appear to observe many stable isoforms (different clusters) in the sample.

Looking at an alignment of a read against the genome we see deleted exons clearly (read on top):

This is an example of larger variation than amino-acid variants.

⚠️ **GitHub.com Fallback** ⚠️