0 Checking and Processing Reads - coopermkr/sdepressaAssembly GitHub Wiki
The golden (and hopefully intuitive) rule of genome assembly is that the quality of your starting material is king. This applies all the way back to the tissue collected for DNA extraction and will continue to be a theme throughout the assembly process. At this point, we’ve moved past the wet lab portion of the pipeline and have the following data available for our assembly process:
- Illumina whole genome shotgun reads from non-reference individuals
- PacBio SMRT reads from the reference individual
- Illumina Hi-C paired end reads from the reference
- Nanopore long reads from the reference
In this step, we want to accomplish two goals:
- Trim out adaptor sequences from reads where necessary
- Check our reads for quality
First we use Cutadapt (version: ) to trim adaptor sequences specific to our sequencing platform:
#Reference documentation here: #https://cutadapt.readthedocs.io/en/stable/`
#start writing a script
cutadapt -a adaptor1 -a adaptor2 -o output.fastq input.fastq
Then we use Fastqc to assess quality of our reads:
file=FILENAME.fastq
fastqc data/$file
Finally we use Nanoplot to create graphs of the quality and length of our long reads
Note here we set the -p parameter to either pac or nano depending on the input type of our long reads.
file=FILENAME.fastq
conda activate nanoplot
NanoPlot --verbose --store --fastq data/$file -p pac --N50 --title pacbioReads --plots hex dot kde