Iso Seq dataset - PacificBiosciences/DevNet GitHub Wiki

The dataset linked from this page contains the polished results of transcriptome sequencing for the human MCF7 breast cancer cell line using PacBio(R) SMRT(R)Sequencing. The libraries were prepared using the full-length cDNA protocol [1]. Size selection was performed using agarose gel cutting at 1-2 kb, 2-3 kb, and > 3 kb. Sequencing was done using P4-C2 chemistry and 2-hour movies.

To obtain a non-redundant, high-quality, full-length set of transcripts, we applied an isoform-level clustering algorithm followed by consensus calling using Quiver. High-quality consensus sequences were then mapped back to the human genome (hg19) and redundant transcripts were collapsed to create the polished dataset below. Additional processing was done to identify fusion gene candidates. For a schematic of the bioinformatics process, see [2].

A public UCSC browser track containing the GFF files from below is available: http://tinyurl.com/l6fg74f

The entire polished dataset is available at http://datasets.pacb.com.s3.amazonaws.com/2013/IsoSeqHumanMCF7Transcriptome/list.html

DESCRIPTION OF FILES

IsoSeq_MCF7_polished.unimapped.fasta - Polished fasta sequences, non-chimeric only.

IsoSeq_MCF7_polished.unimapped.gff - Alignment of the above to hg19.

IsoSeq_MCF7_polished.fusion.fasta - Polished fasta sequences.

IsoSeq_MCF7_polished.fusion.gff - Alignment of the above to hg19. Each fusion candidate is named using the format + followed by the suffix _1, _2, to allow proper loading the UCSC browser track.

IsoSeq_MCF7_polished.fusion.details.xlsx - An excel sheet describing the fusion candidates and literature support.

REFERENCES

[1] http://www.smrtcommunity.com/Share/Protocol?id=a1q70000000HqSvAAK&strRecordTypeName=Protocol

[2] https://www.dropbox.com/s/lpb0ov3xxkg8hef/IsoSeq_bioinformatics_draft_schematic.pdf

⚠️ **GitHub.com Fallback** ⚠️