Arabidopsis Sequence and Assembly (_Arabidopsis thaliana_ Ler 0) - PacificBiosciences/DevNet GitHub Wiki

This is the first Arabidopsis thaliana (Ler-0) dataset and de novo genome assembly generated with the Sequel System, using two SMRT Cells and 12 hours of runtime. Only three years ago, we released our first genome assembly for Arabidopsis produced on the PacBio RS II using P4-C2 chemistry, 85 SMRT Cells and 255 hours of runtime. Four months later, we released a second Arabidopsis dataset dataset using the improved P5-C3 chemistry, which reduced the number of SMRT Cells to 46 and runtime to 138 hours.

We produced this Sequel dataset using our latest chemistry enhancements which significantly reduce the amount of DNA required. Prior to these chemistry improvements, the amount of DNA needed to run many large genome projects on the Sequel System was prohibitive. These modifications enable the use of loading concentrations equivalent to PacBio RS II levels.

Details of the Library Protocol, Data Generation, and Assembly Process

Purified Arabidopsis (Ler-0) genomic DNA was sheared to an average size of 32 kb and converted to SMRTbell templates, followed by a 20 kb size selection performed on a BluePippin system (Sage Science). Each SMRT Cell was loaded at an on-plate concentration of 144 pM of library and run for 6 hours on the Sequel System using the modified chemistry. Collectively, the two SMRT Cells produced 10.8 Gb of data, contained in 1.1 million reads, with half of the data in reads greater than 16,400 bp in length. The data were assembled with HGAP4 in SMRT Link.

Results of Sequel System Arabidopsis genome assembly

Sequencing metrics

  • Raw Data combined 2 SMRT Cells
  • Number of Reads 1,135,065
  • Number of Bases 10.8 Gb
  • Average Read Length 9,474 bp
  • Read N50 15,377 bp
  • Mapped Read Length N50 16,411 bp
  • Mapped Subread Length N50 14,852 bp
  • Mapped Read Length Max 53,610 bp
  • Mapped Concordance (Mode) 0.88
  • Mapped Concordance (Mean) 0.84

Assembly

Data were processed using HGAP4 with a development version of SMRT Link v3.2, using a seed read cutoff of 6,000 bp for preassemble reads. Assembly was polished using the Arrow consensus caller. Post assembly processing included the following filtering: Contigs were filtered out if greater than 10% of contig bases are not polished due to low coverage.

  • Assembly size 122.9 Mb
  • Polished Contigs 238
  • Contig N50 10.4 Mb
  • Max Contig Length 15.0 Mb
  • Busco Complete Single copy genes 97.7%
  • Busco Fragmented genes 0.6%
  • Busco Missing genes 1.7%

The raw and assembled data is publicly available for download.

References:

  1. Kim, K. E. et al. (2014) Long-read, whole-genome shotgun sequence data for five model organisms. Scientific Data. 1, 140045.