QualityControl - BGIGPD/BestPractices4Pathogenomics GitHub Wiki

1. login to the server (Bastion Host)

2. enable conda

The easist way is to source my installed conda:

source /home/fangchao/conda_init.sh

3. Activate virtual environment

By loading this virtual envrionment, you will obtained following tools and packages without installing them:

conda activate QC

4. clone our practice repository if you haven't done so

git clone https://github.com/BGIGPD/BestPractices4Pathogenomics.git

Enter the subfolder:

cd BestPractices4Pathogenomics/PRJNA636748

5. Download our selected public dataset

The SRA dataset stored on NCBI can be downloaded by using their offically released tool: sra-toolkit

We need to do some configuration about it:

vdb-config -i

Change cache to /home/fangchao/sra-cache

vdb-config

Then you can 'download' the SRA file:

fastq-dump -O SRA SRR27676316 SRR22019447 SRR22019449 SRR16298563 SRR16298722 SRR17006792 SRR17712415 SRR16298506 SRR14736722 SRR19051760 SRR24152024 SRR19051863 SRR16298805 SRR14707730 SRR23318860

Take a look at the downloaded file and know more about the fastq format:

 $ head -n 4 SRA/SRR22019447.fastq
@SRR22019447.1 1 length=186
CAGTTTACTCACACCTTTTGCTCGTTGCTCCTGGCCTTGAAGCCCCTTTTCTCTACCTTTATTCTTTAGTCTACTTCTTGCAGAGCATAAACAAGTTTATACTCTGCAAGAAGTAGACTAAATCATAAAGATAGAGAAAAGGGGCTTCAAGGTCAGGAGCAACGAGCAAAAGGTGTGAGTAAACTG
+SRR22019447.1 1 length=186
CCCCCFEFF9@,;CC,6FA9FC<FG<CCAFG<CECCGD@F8,,CF,:6EDF<FFG,C<,,,,,CC,C<,6C9@<@EFE,,B,C<F,B,FFFG,CCCC86@9CFGGGF<C9EF8EG,CE9,;,,66C@,6CF9,,6,,FEC8@7B@GGGCEA6,;,@6CB<FC,+:@@:FFGFGGCGE<,F9,,9@@

Where:

  1. The 1st line begins with a '@' indicating the sequence id
  2. The 2nd line contained the raw sequence of bases A,T,C or G. Sometime 'N' for an unknown base.
  3. The 3rd line begins with a '+' character is an optionall line for notes or commnets. We can ignore it in most of the case.
  4. The 4th line encodes the Phred score quality values for the sequence in the 2nd line.

Phred Score Visualization

6. Quality control

Run fastp to discard reads with low quality and too many Ns.

mkdir -p fastp
fastp -i SRA/SRR22019447.fastq -o fastp/SRR22019447.fastq

the output:

 $ fastp -i SRA/SRR22019447.fastq -o fastp/SRR22019447.fastq
Detecting adapter sequence for read1...
No adapter detected for read1

Read1 before filtering:
total reads: 21674
total bases: 5715494
Q20 bases: 5563899(97.3476%)
Q30 bases: 5293797(92.6219%)

Read1 after filtering:
total reads: 21607
total bases: 5706155
Q20 bases: 5560525(97.4478%)
Q30 bases: 5291800(92.7385%)

Filtering result:
reads passed filter: 21607
reads failed due to low quality: 67
reads failed due to too many N: 0
reads failed due to too short: 0
reads with adapter trimmed: 0
bases trimmed due to adapters: 0

Duplication rate (may be overestimated since this is SE data): 2.93439%

JSON report: fastp.json
HTML report: fastp.html

fastp -i SRA/SRR22019447.fastq -o fastp/SRR22019447.fastq
fastp v0.23.4, time used: 2 seconds

7. Run kraken2

mkdir -p kraken2

kraken2 --db /home/fangchao/database/viral fastp/SRR22019447.fastq --output kraken2/SRR22019447.out --report kraken2/SRR22019447.rpt

the output:

Loading database information... done.
70530 sequences (19.24 Mbp) processed in 1.760s (2404.4 Kseq/m, 655.78 Mbp/m).
  11083 sequences classified (15.71%)
  59447 sequences unclassified (84.29%)

Extended reading

1. How to install conda and create a virtual environment myself?

Searching conda and find out their website:

https://docs.anaconda.com/miniconda/

Choose the one fitting your OS and download it. I'll recommend Linux OS.

Then check this Getting started with conda page to pracitce the basic usage of conda.

How to create the virtual environment we used for today?
Do it like this:

conda create -n QC -c bioconda fastqc kraken2 sra-tools git megahit

2. How to build a kraken2 custom database

2.1 Download the covid-19 reference genomes from NCBI Datasets

Refer to Install NCBI Datasets command-line tools

conda install -c conda-forge ncbi-datasets-cli

Download a SARS-CoV-2 genome data package

datasets download virus genome taxon SARS-CoV-2 --filename sars_cov_2.zip

2.2 Build a custom database for kraken2

Refer to https://github.com/DerrickWood/kraken2/wiki/Manual#custom-databases

kraken2-build --download-taxonomy --db viral
kraken2-build --download-library viral --db viral

If the download of above viral database failed. You can manually download from benlangmead aws-indexes

wget https://genome-idx.s3.amazonaws.com/kraken/k2_viral_20240605.tar.gz

2.3 Rename the custom sequences so the kraken2 can recognize it

perl -e ' my %NAME; my $psudotaxid_start = 4000000; $count =0;
  open I, "< ncbi_dataset/data/data_report.jsonl";
  while(<I>){
    if($_ =~/"accession":"([\w.]+)".*"pangolinClassification":"(\S+)","taxId"/){
        $count ++;
        $NAME{$1}{pangolinClassification} = $2;
        $NAME{$1}{taxid} = $psudotaxid_start + $count;
    }
  }
  close I;
  open F,"< ncbi_dataset/data/genomic.fna";
  while(<F>){
    if($_ =~/^>(\S+)/){
        $id = $1;
        $taxid = $NAME{$id}{taxid};
        $name = ">$id|kraken:taxid|$taxid $NAME{$id}{pangolinClassification}";
        print $name,"\n";
    }else{
        print $_;
    }
  };close F;
' > ncbi_dataset/data/rename_genomic.fna

grep ">" ncbi_dataset/data/rename_genomic.fna | awk -F '\\|| ' '{print $3"\t\|\t"$4"\t|\t\t|\tpangolinClassification\t|"}' > taxonomy/added.name.dmp

grep ">" ncbi_dataset/data/rename_genomic.fna | awk -F '\\|| ' '
{print $3"\t\|\t2697049\t\|\tsubspecies\t\|\t\t\|\t\t9\t\t\|\t1\t\|\t1\t\|\t1\t\|\t\t\t\|\t1\t\|\t\|\t\|\t\|\t\|\t\|"}' > taxonomy/added.node.dmp

2.3 Add custom sequences to the database

Add custom sequences

kraken2-build --add-to-library  ncbi_dataset/data/rename_genomic.fna --db viral

2.4 Finally, building the database

kraken2-build --build --db viral