Analysis 1: Pre Trim Fast QC - cecilia-andersson/Genome_Analysis_Project GitHub Wiki

Date: 2023-29-03

Methods

1. Copied soft link to data into folder 2. Ran fastqc with default parameters for paired end reads, took about 5 minutes

Results

  • DNA Quality:
  • Screen Shot 2023-05-16 at 3 25 28 PM
    DNA from site D1

    Screen Shot 2023-05-16 at 3 31 52 PM

    DNA from site D3

  • RNA (untrimmed) Quality:
Screen Shot 2023-05-16 at 3 36 46 PM

RNA from site D1



Screen Shot 2023-05-16 at 3 37 42 PM

RNA from site D3


Discussion

For visualization, I've attached a quick summary of the FASTQC outputs for the forward reads at both sites, for both DNA and RNA. The reverse reads have similar quality scores in both RNA and DNA at both sites, and the entire reports can be found in the analysis tab. For the trimmed DNA data, the Per Base Sequence Quality Plot has good average quality scores (above 28) across the entire length of each sequence. The untrimmed RNA data, on the other hand, has some quality degradation at the ends of the sequences. This is expected, however, because noise due to phasing increases the more the plate is washed and re-flushed with new bases. Phasing occurs when a base in a cluster isn't fully washed, so it remains fluorescing the color of the previous base, as the other reads in the cluster fluoresce the color of the newly added base. The sequences in which this occurs will be out of step from the rest of the cluster by one (or more, if it happens multiple times in the same strand) base, which creates noise and makes the correct base more difficult to 'call'. The trimming step in analysis 2 will reduce some of this effect.

The RNA data is also flagged for having adapter sequences and "overrepresented sequences," which upon inspection are mostly adapter sequences. Though keeping adapters can have an impact on analyses down the line, I researched when trimming adapters is appropriate and decided not to do much trimming. This is because the purpose of the RNA-seq data in this study is to perform some expression analyses, in which case it would be detrimental to erroneously remove sequences which are overrepresented and not just adapters. Additionally, BWA-MEM performs some adapter trimming itself. (SORUCE: https://dnatech.genomecenter.ucdavis.edu/faqs/when-should-i-trim-my-illumina-reads-and-how-should-i-do-it/#:~:text=In%20case%20you%20are%20sequencing,pseudo%2Daligners%20should%20be%20used. , https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7671312/).
  • What is the structure of a FASTQ file?
  • A FASTQ file contains a header with the sequence ID, the raw sequence (ATCG), and quality values indicating the confidence level of each called base.
  • How is the quality of the data stored in the FASTQ files? How are paired reads identified?
  • The quality of the data is stored as Phred +33 encoded quality scores. Paired reads are identified by two FASTQ files for the sequence, usually noted as R1 (forward) and R2 (reverse). The reverse read has the same header as the forward read.
  • What can generate the issues you observe in your data? Can these cause any problems during subsequent analyses?
  • With illumina reads, sequencing issues can be generated by phasing (as discussed above), signal decay (also toward the ends of reads), and physical problems like overclustering and instrument errors.
⚠️ **GitHub.com Fallback** ⚠️