Find Barcodes in CCS Reads - PacificBiosciences/DevNet GitHub Wiki

For officially supported barcoding analysis in SMRT Analysis 1.4, please refer to the instructions found at the PacificBioSciences barcoding wiki.

GOAL: Provide R+D code for identifying barcodes in PacBio® CCS sequencing reads along with a working example.

This code, along with proper flanking-insert barcoded sample products and PacBio CCS sequencing data, will label each read with the most-likely barcode, give the barcode score, and show where in the read the barcode occurred (so you might trim it out if desired in downstream analysis).

Downloads

PacBioBarcodeCCSDist.zip: contains the barcode sequences, Python scripts and examples

Workflow Overview

This software is executed using the calling sequence:

PacBioBarcodeIDCCS.py "CCSReads.fasta" "barcodeIdentities.fasta" "outputDirectory"

where

"CCSReads.fasta" is the file of CCS reads from the PacBio sequencer
"barcodeIdentities.fasta" are the identities of the barcodes to look for
"outputDirectory" is where all the results are stored

This software assumes flanking-insert barcodes. This means that the sample has been prepped such that every PacBio CCS read has the form (in either forward or reverse-complement): (BarcodeLEFT + sequenceOfInterest + BarcodeRIGHT) See our barcoding technical report for sample prep methods to achieve this. The barcode identites are given in a fasta file with a simple naming convention to show pairing: fasta ids "F+barcodeid" and "R+barcodeid" indicate the LEFT and RIGHT barcodes respectively. See below for a worked example.

The result of processing is a file: "outputDirectory/all.RESULT.reduceMax". This gives for all reads that had identifiable barcode pairs: the fasta id of the read, the id of the barcode pair, the score of the most-likely barcode hit, and the location in the reads where the paired barcodes were identified. See below for a worked example. NOTE: this software assumes barcoded-flanked inserts and CCS reads. Methods for barcode identification in raw PacBio reads (not CCS) or for non-flanking barcode topologies are not covered here explicity.

Installation

This software is distributed as a simple collection of Python scripts in the bin/ directory. The code depends on the HMMer sequencing modeling package, http://hmmer.janelia.org/ Version 2 of HMMer is need. The newer 3.0 version is NOT compatible. Debian maintains a stable hmmer package (2.3.2) and the source is available: http://hmmer.janelia.org/software/archive After HMMer installation, where hmmbuild, hmmsearch of the HMMer package are available on the command line, the code can be used.

Complete Working Example

Here we run through a toy example. I am using the barcodes: barcode96searchpair8.fa . This is a subset of barcode96searchpair.fa Here is the first barcode pair in that file:

>F0 gcgctctgtgtgcagc >R0 agagtactacatatga

This indicates that all properly barcoded CCS reads containing this pair should have the form: (gcgctctgtgtgcagc + sequenceOfInterest + agagtactacatatga) I constructed several test sequences that use this barcode and others as well as noisy indels: testseqs.fasta

>fwExact0 gcgctctgtgtgcagcACGTTGCAagagtactacatatga
>rcExact1 gcgatctatgcacacgTGCAACGTtagtgtcgactcatga
>fwMut2 tatctatcgtacacgcACGTTGCAatgtatctcgctgca
>rcMut3 gactctgctcgagtcTGCAACGTtcagatgtcagtgtgat
>noise aaccggttaaccggttACGTTGCAaaccggttaaccggtt

Exectute barcode identifcation: PacBioBarcodeIDCCS.py testseqs.fasta barcode96searchpair8.fa barcodeOutput This takes a bit less than 8 seconds on my machine. Full PacBio chips of CCS reads take a couple minutes of processing (depending on CCS chip yield). Now all of your results are in the barcodeOutput/ directory. I've also included a pre-computed version in barcodeOutputKEEP The main result is in barcodeOutput/all.RESULT.reduceMax:

fastaid barcode shortid scoreAll scoreRight leftExtent rightExtent fwExact0 F0 s0 38.800000 19.400000 1:16 25:40 rcExact1 [revcomp] F1 s1 38.800000 19.400000 1:16 25:40 fwMut2 F2 s2 29.600000 14.100000 1:16 25:39 rcMut3 [revcomp] F3 s3 32.300000 16.800000 1:17 26:40

The first line indicates that the first sequence, "fwExact0", contains the barcode pair F0 with a total barcode score of 38.8 bits with the LEFT barcode occupying bases 1:16 (one-based, inclusive indexing) and the RIGHT barcode occupying bases 25:40. We can dive into the results to see the alignments. Looking at the first result line, we see that the hit occured in the forward sense. The LEFT barcode alignment is contained in barcodeOutput/fw/bps.F0.LEFT.hmm.hmmsearch Searching for the "s0" shortid shows:

s0: domain 1 of 1, from 1 to 16: score 19.4, E = 7.3e-06 ->GCGCTCTGTGTGCAGC<- GCGCTCTGTGTGCAGC s0 1 GCGCTCTGTGTGCAGC 16

The RIGHT is in bps.R0.RIGHT.hmm.hmmsearch:

s0: domain 1 of 1, from 25 to 40: score 19.4, E = 7.2e-06 ->AGAGTACTACATATGA<- AGAGTACTACATATGA s0 25 AGAGTACTACATATGA 40

Both LEFT and RIGHT are perfect. The last result in the table occurs on the reverse-complement sense ("[revcomp]"). Note that all results are reported on the reverse-complemented sequence. The results are located in barcodeOutput/rc. bps.F3.LEFT.hmm.hmmsearch:

s3: domain 1 of 1, from 1 to 17: score 16.8, E = 4.3e-05 ->ATCACACTG.CATCTGA<- ATCACACTG CATCTGA s3 1 ATCACACTGaCATCTGA 17

bps.R3.RIGHT.hmm.hmmsearch:

s3: domain 1 of 1, from 26 to 40: score 15.5, E = 0.00011 ->GACTCGACGCAGAGTC<- GACTCGA GCAGAGTC s3 26 GACTCGA-GCAGAGTC 40

This shows a one-base insertion and a one-base deletion as designed.

Conclusion

We have provided python code to identify barcodes in PacBio CCS reads and have run through a complete example.