CREU 2022 Raw data quality and Genome assembly - DR-genomics/Genomics-pipelines Wiki

Revised from https://github.com/barrettlab/2021-Genomics-bootcamp/wiki/II.-Illumina-data-quality

1. Assessing data quality with FastQC

  1. Login to amazon instance and go to your home directory. Type ls
  2. Change directory to CREU_RawData_Alignment: cd CREU_RawData_Alignment
  3. List files using ls command. ls -lh You should see a list of raw paired end sequencing files ending with .fastq extension.
  4. Lets see the contents of a file: less CHN-Yunnan_R1.fastq
  5. To count the number of reads: grep -c "@VH00177" CHN-Yunnan_R1.fastq
  6. To assess the read quality, use the tool FastQC. fastqc CHN-Yunnan_R1.fastq You should see something similar to below.
Started analysis of CHN-Yunnan_R1.fastq
Approx 5% complete for CHN-Yunnan_R1.fastq
Approx 10% Started analysis of CHN-Yunnan_R1.fastq
........

To run fastqc for multiple files in the same directory, you can use "for loop" as below.

Open a new text file using nano for_loop_fastqc.sh In the newly created text file, copy paste the below script

### make an output directory for fastqc output within the dir CREU_RawData_Alignment 
mkdir ~/CREU_RawData_Alignment/fastqc_output

### use an alias to identify the output dir
output=~/CREU_RawData_Alignment/fastqc_output

### run the loop
for file in ~/CREU_RawData_Alignment/*.fastq
do
fastqc -o ${output} ${file}
done

Save the script using Ctrl X followed by typing Y To run the script, type bash for_loop_fastqc.sh

Wait until it completes for all fastq files. It will create a few files that you can now view, e.g. with firefox or a web browser (the .html file).

Download the html to your local desktop using the command scp, to have a look at the results. Open a new terminal window,

cd /path/to/desktop
mkdir fastqc_output
cd fastqc_output
### Command scp (secure copy) <space> [email protected]_amazon_instance:path/of/html/files <space> .(dot, which means copy the files in the current directory)
scp [email protected]:~/CREU_RawData_Alignment/fastqc_output/*.html .

Provide password. The html files will be downloaded. Now open the directory fastqc_output in your Desktop and click CHN-Yunnan_R1_fastqc.html.

2. Quality and adapter trimming/filtering with BBtools: bbduk

BBDuk = "BB" - Author name Brian Bushnell and “Duk” stands for Decontamination Using Kmers. It is capable of quality-trimming and filtering, adapter-trimming, contaminant-filtering via kmer matching, sequence masking, GC-filtering, length filtering, etc.

Running bbduk

### Adapter trimming (do this first). Requires input of the adapter sequence, OR (here) a fasta file of the adapter(s)
bbduk.sh in=CHN-Yunnan_R1.fastq in2=CHN-Yunnan_R2.fastq  out=CHN-Yunnan_clean_R1.fastq out2=CHN-Yunnan_clean_R2.fastq ref=adapters.fa qtrim=rl trimq=20

This is fine for one pair of reads, but what if you want to trim/filter an entire directory of reads?

Use a FOR Loop!

for file in `ls -1 *R1.fastq | sed 's/R1.fastq//'`
do 
   bbduk.sh -Xmx1g in=$file\R1.fastq in2=$file\R2.fastq out=$file\clean_R1.fastq out2=$file\clean_R2.fastq ref=adapters.fa qtrim=rl trimq=10
done

### Move the cleaned reads to a new dir
mkdir clean_output
mv *clean* clean_output

Now, you can run FastQC again on the cleaned reads to compare with unprocessed reads.

cd clean_output
fastqc CHN-Yunnan_clean_R1.fastq 

3. Alignment to a reference genome using bbmap.

BBmap - global aligner for DNA and RNA sequencing reads. https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/bbmap-guide/

for i in `ls -1 *R1.fastq | sed 's/R1.fastq//'`
do 
   bbmap.sh t=2 ref=../JS_chr23.fa in=$i\R1.fastq in2=$i\R2.fastq out=$i\mapped.sam nodisk 
done

4. Coverage

To generate coverage information, use the program pileup.sh from bbtools. It takes sam or bam (sorted or unsorted) as input and calculate the coverage

pileup.sh CHN-Yunnan_mapped.sam

or

You can use samtools to do the same.

samtools flagstat CHN-Yunnan_mapped.sam

Both will report the percentage of reads mapped to the reference genome in addition to other details such as percent of proper pairs mapped, avg coverage, singletons, etc.

⚠️ **GitHub.com Fallback** ⚠️