0 Checking and Processing Reads - coopermkr/sdepressaAssembly GitHub Wiki

The golden (and hopefully intuitive) rule of genome assembly is that the quality of your starting material is king. This applies all the way back to the tissue collected for DNA extraction and will continue to be a theme throughout the assembly process. At this point, we’ve moved past the wet lab portion of the pipeline and have the following data available for our assembly process:

  • Illumina whole genome shotgun reads from non-reference individuals
  • PacBio SMRT reads from the reference individual
  • Illumina Hi-C paired end reads from the reference
  • Nanopore long reads from the reference

In this step, we want to accomplish two goals:

  1. Trim out adaptor sequences from reads where necessary
  2. Check our reads for quality

First we use Cutadapt (version: ) to trim adaptor sequences specific to our sequencing platform:

#Reference documentation here: #https://cutadapt.readthedocs.io/en/stable/`
#start writing a script

cutadapt -a adaptor1 -a adaptor2 -o output.fastq input.fastq

Then we use Fastqc to assess quality of our reads:

file=FILENAME.fastq

fastqc data/$file

Finally we use Nanoplot to create graphs of the quality and length of our long reads

Note here we set the -p parameter to either pac or nano depending on the input type of our long reads.

file=FILENAME.fastq

conda activate nanoplot

NanoPlot --verbose --store --fastq data/$file -p pac --N50 --title pacbioReads --plots hex dot kde