CREU 2022 Raw data quality and Genome assembly - DR-genomics/Genomics-pipelines GitHub Wiki
- Login to amazon instance and go to your home directory. Type
ls
- Change directory to CREU_RawData_Alignment:
cd CREU_RawData_Alignment
- List files using ls command.
ls -lh
You should see a list of raw paired end sequencing files ending with .fastq extension. - Lets see the contents of a file:
less CHN-Yunnan_R1.fastq
- To count the number of reads:
grep -c "@VH00177" CHN-Yunnan_R1.fastq
- To assess the read quality, use the tool FastQC.
fastqc CHN-Yunnan_R1.fastq
You should see something similar to below.
Started analysis of CHN-Yunnan_R1.fastq
Approx 5% complete for CHN-Yunnan_R1.fastq
Approx 10% Started analysis of CHN-Yunnan_R1.fastq
........
Open a new text file using nano for_loop_fastqc.sh
In the newly created text file, copy paste the below script
### make an output directory for fastqc output within the dir CREU_RawData_Alignment
mkdir ~/CREU_RawData_Alignment/fastqc_output
### use an alias to identify the output dir
output=~/CREU_RawData_Alignment/fastqc_output
### run the loop
for file in ~/CREU_RawData_Alignment/*.fastq
do
fastqc -o ${output} ${file}
done
Save the script using Ctrl X
followed by typing Y
To run the script, type bash for_loop_fastqc.sh
Wait until it completes for all fastq files. It will create a few files that you can now view, e.g. with firefox or a web browser (the .html file).
Download the html to your local desktop using the command scp, to have a look at the results. Open a new terminal window,
cd /path/to/desktop
mkdir fastqc_output
cd fastqc_output
### Command scp (secure copy) <space> your_username@the_amazon_instance:path/of/html/files <space> .(dot, which means copy the files in the current directory)
scp [email protected]:~/CREU_RawData_Alignment/fastqc_output/*.html .
Provide password. The html files will be downloaded. Now open the directory fastqc_output in your Desktop and click CHN-Yunnan_R1_fastqc.html.
BBDuk = "BB" - Author name Brian Bushnell and “Duk” stands for Decontamination Using Kmers. It is capable of quality-trimming and filtering, adapter-trimming, contaminant-filtering via kmer matching, sequence masking, GC-filtering, length filtering, etc.
Running bbduk
### Adapter trimming (do this first). Requires input of the adapter sequence, OR (here) a fasta file of the adapter(s)
bbduk.sh in=CHN-Yunnan_R1.fastq in2=CHN-Yunnan_R2.fastq out=CHN-Yunnan_clean_R1.fastq out2=CHN-Yunnan_clean_R2.fastq ref=adapters.fa qtrim=rl trimq=20
This is fine for one pair of reads, but what if you want to trim/filter an entire directory of reads?
for file in `ls -1 *R1.fastq | sed 's/R1.fastq//'`
do
bbduk.sh -Xmx1g in=$file\R1.fastq in2=$file\R2.fastq out=$file\clean_R1.fastq out2=$file\clean_R2.fastq ref=adapters.fa qtrim=rl trimq=10
done
### Move the cleaned reads to a new dir
mkdir clean_output
mv *clean* clean_output
Now, you can run FastQC again on the cleaned reads to compare with unprocessed reads.
cd clean_output
fastqc CHN-Yunnan_clean_R1.fastq
BBmap - global aligner for DNA and RNA sequencing reads. https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/bbmap-guide/
for i in `ls -1 *R1.fastq | sed 's/R1.fastq//'`
do
bbmap.sh t=2 ref=../JS_chr23.fa in=$i\R1.fastq in2=$i\R2.fastq out=$i\mapped.sam nodisk
done
To generate coverage information, use the program pileup.sh from bbtools. It takes sam or bam (sorted or unsorted) as input and calculate the coverage
pileup.sh CHN-Yunnan_mapped.sam
or
You can use samtools to do the same.
samtools flagstat CHN-Yunnan_mapped.sam
Both will report the percentage of reads mapped to the reference genome in addition to other details such as percent of proper pairs mapped, avg coverage, singletons, etc.