CREU 2022 Raw data quality and Genome assembly - DR-genomics/Genomics-pipelines Wiki

Revised from

1. Assessing data quality with FastQC

  1. Login to amazon instance and go to your home directory. Type ls
  2. Change directory to CREU_RawData_Alignment: cd CREU_RawData_Alignment
  3. List files using ls command. ls -lh You should see a list of raw paired end sequencing files ending with .fastq extension.
  4. Lets see the contents of a file: less CHN-Yunnan_R1.fastq
  5. To count the number of reads: grep -c "@VH00177" CHN-Yunnan_R1.fastq
  6. To assess the read quality, use the tool FastQC. fastqc CHN-Yunnan_R1.fastq You should see something similar to below.
Started analysis of CHN-Yunnan_R1.fastq
Approx 5% complete for CHN-Yunnan_R1.fastq
Approx 10% Started analysis of CHN-Yunnan_R1.fastq

To run fastqc for multiple files in the same directory, you can use "for loop" as below.

Open a new text file using nano In the newly created text file, copy paste the below script

### make an output directory for fastqc output within the dir CREU_RawData_Alignment 
mkdir ~/CREU_RawData_Alignment/fastqc_output

### use an alias to identify the output dir

### run the loop
for file in ~/CREU_RawData_Alignment/*.fastq
fastqc -o ${output} ${file}

Save the script using Ctrl X followed by typing Y To run the script, type bash

Wait until it completes for all fastq files. It will create a few files that you can now view, e.g. with firefox or a web browser (the .html file).

Download the html to your local desktop using the command scp, to have a look at the results. Open a new terminal window,

cd /path/to/desktop
mkdir fastqc_output
cd fastqc_output
### Command scp (secure copy) <space> [email protected]_amazon_instance:path/of/html/files <space> .(dot, which means copy the files in the current directory)
scp [email protected]:~/CREU_RawData_Alignment/fastqc_output/*.html .

Provide password. The html files will be downloaded. Now open the directory fastqc_output in your Desktop and click CHN-Yunnan_R1_fastqc.html.

2. Quality and adapter trimming/filtering with BBtools: bbduk

BBDuk = "BB" - Author name Brian Bushnell and “Duk” stands for Decontamination Using Kmers. It is capable of quality-trimming and filtering, adapter-trimming, contaminant-filtering via kmer matching, sequence masking, GC-filtering, length filtering, etc.

Running bbduk

### Adapter trimming (do this first). Requires input of the adapter sequence, OR (here) a fasta file of the adapter(s) in=CHN-Yunnan_R1.fastq in2=CHN-Yunnan_R2.fastq  out=CHN-Yunnan_clean_R1.fastq out2=CHN-Yunnan_clean_R2.fastq ref=adapters.fa qtrim=rl trimq=20

This is fine for one pair of reads, but what if you want to trim/filter an entire directory of reads?

Use a FOR Loop!

for file in `ls -1 *R1.fastq | sed 's/R1.fastq//'`
do -Xmx1g in=$file\R1.fastq in2=$file\R2.fastq out=$file\clean_R1.fastq out2=$file\clean_R2.fastq ref=adapters.fa qtrim=rl trimq=10

### Move the cleaned reads to a new dir
mkdir clean_output
mv *clean* clean_output

Now, you can run FastQC again on the cleaned reads to compare with unprocessed reads.

cd clean_output
fastqc CHN-Yunnan_clean_R1.fastq 

3. Alignment to a reference genome using bbmap.

BBmap - global aligner for DNA and RNA sequencing reads.

for i in `ls -1 *R1.fastq | sed 's/R1.fastq//'`
do t=2 ref=../JS_chr23.fa in=$i\R1.fastq in2=$i\R2.fastq out=$i\mapped.sam nodisk 

4. Coverage

To generate coverage information, use the program from bbtools. It takes sam or bam (sorted or unsorted) as input and calculate the coverage CHN-Yunnan_mapped.sam


You can use samtools to do the same.

samtools flagstat CHN-Yunnan_mapped.sam

Both will report the percentage of reads mapped to the reference genome in addition to other details such as percent of proper pairs mapped, avg coverage, singletons, etc.

⚠️ ** Fallback** ⚠️