Drosophila sequence and assembly - PacificBiosciences/DevNet GitHub Wiki

Instrument:  PacBio RS II
Chemistry:  C3
Enzyme: P5

Summary

The data released here is associated with the data release announcement on the PacBio blog. In collaboration with Dr. Casey Bergman at the University of Manchester and Drs. Susan Celniker and Roger Hoskins of the Berkeley Drosophila Genome Project (BDGP) at Lawrence Berkeley National Laboratory, we have sequenced adult males from a subline of the ISO1 (y; cn, bw, sp) strain of D. melanogaster. This is the same stock used in the official BDGP reference assemblies since the first genome sequence release in 2000. The DNA was size-selected for >15 kb elution using BluePippinTM (Sage Sciences), and in total, ~15 Gb (105.8X) of sequence was generated from a 20kb library using P5-C3 sequencing chemistry on the PacBio® RS II:

Total number of bases: 15,208,567,933 bp
Total number of reads: 1,514,730
Average read length: 10,040 bp
Half of sequenced bases in reads greater than:  14,214 bp
PacBio RS II instrument time for sequencing: 6 days
Number of SMRT Cells: 42
Number of Instrument Runs: 6

Download Raw Dataset

Preliminary analyses and step-by-step instructions for downloading, mapping, and visualizing the raw data are described on the Bergman lab blog. The raw data can be downloaded in 6 tarballs from the PacBio AWS site:

https://s3.amazonaws.com/datasets.pacb.com/2014/Drosophila/raw/Dro1_24NOV2013_398.tgz
https://s3.amazonaws.com/datasets.pacb.com/2014/Drosophila/raw/Dro2_25NOV2013_399.tgz
https://s3.amazonaws.com/datasets.pacb.com/2014/Drosophila/raw/Dro3_26NOV2013_400.tgz
https://s3.amazonaws.com/datasets.pacb.com/2014/Drosophila/raw/Dro4_28NOV2013_401.tgz
https://s3.amazonaws.com/datasets.pacb.com/2014/Drosophila/raw/Dro5_29NOV2013_402.tgz
https://s3.amazonaws.com/datasets.pacb.com/2014/Drosophila/raw/Dro6_1DEC2013_403.tgz

Alternatively, you can download the raw data from the NCBI Short Read Archive under accession SRX499318.

Please cite the following publication is you use this raw dataset in your research:

Kim KE, Peluso P, Babayan P, Yeadon PJ, Yu C, Fisher WW, Chin CS, Rapicavoli NA, Rank DR, Li J, Catcheside DE, Celniker SE, Phillippy AM, Bergman CM, Landolin JM. Long-read, whole-genome shotgun sequence data for five model organisms. Sci Data. 2014 1:140045. http://www.nature.com/articles/sdata201445

Download Assemblies

Falcon Assembly

You can download the preassembled reads as well as the final diploid assembly contigs file from the FALCON diploid assembler here:

https://s3.amazonaws.com/datasets.pacb.com/2014/Drosophila/reads/dmel_FALCON_diploid_assembly.tgz

Please cite the following publication is you use the Falcon datasets in your research:

Chin CS, Peluso P, Sedlazeck FJ, Nattestad M, Concepcion GT, Clum A, Dunn C, O'Malley R, Figueroa-Balderas R, Morales-Cruz A, Cramer GR, Delledonne M, Luo C, Ecker JR, Cantu D, Rank DR, Schatz MC. Phased diploid genome assembly with single-molecule real-time sequencing. Nat Methods. 2016 13:1050-1054. http://www.nature.com/nmeth/journal/v13/n12/full/nmeth.4035.html

PBcR-CA Assembly

Preassembled reads, plus unpolished and polished assemblies of the 25X longest preassembled reads generated using PBcR and Celera Assembler 8.1 can be downloaded from the University of Maryland Center for Bioinformatics and Computational Biology website or directly via the following URLs:

ftp://cbcb.umd.edu/pub/data/sergek/dros_corrected.fastq.bz2
http://cbcb.umd.edu/software/pbcr/dmel_cons_asm.tar.gz

Please cite the following publication is you use the PBcR-CA datasets in your research:

Koren S, Schatz MC, Walenz BP, Martin J, Howard JT, Ganapathy G, Wang Z, Rasko DA, McCombie WR, Jarvis ED, Adam M Phillippy. Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat Biotechnol. 2012 30:693-700. http://www.nature.com/nbt/journal/v30/n7/full/nbt.2280.html

MHAP-CA Assembly

Pre-assembled reads, unpolished assemblies, and polished assemblies of the complete 90X dataset generated using MHAP and Celera Assembler 8.2 can be downloaded from the University of Maryland Center for Bioinformatics and Computational Biology website or directly via the following URLs:

http://gembox.cbcb.umd.edu/mhap/data/dmel.polished.fastq.gz
http://gembox.cbcb.umd.edu/mhap/asm/dmel.ctg.fasta.gz
http://gembox.cbcb.umd.edu/mhap/asm/dmel.all.fasta.gz
http://gembox.cbcb.umd.edu/mhap/asm/dmel.quiver.ctg.fasta.gz
http://gembox.cbcb.umd.edu/mhap/asm/dmel.quiver.all.fasta.gz

The Quiver polished assembly of MHAP-CA contigs can also be directly downloaded from NCBI:

https://www.ncbi.nlm.nih.gov/nuccore/JSAE00000000.1/

Please cite the following publication is you use the MHAP-CA datasets in your research:

Berlin K, Koren S, Chin CS, Drake JP, Landolin JM, Phillippy AM. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat Biotechnol. 2015 33:623-30. http://www.nature.com/nbt/journal/v33/n6/full/nbt.3238.html