QualityControl - BGIGPD/BestPractices4Pathogenomics GitHub Wiki
1. login to the server (Bastion Host)
2. enable conda
The easist way is to source my installed conda:
source /home/fangchao/conda_init.sh
3. Activate virtual environment
By loading this virtual envrionment, you will obtained following tools and packages without installing them:
conda activate QC
4. clone our practice repository if you haven't done so
git clone https://github.com/BGIGPD/BestPractices4Pathogenomics.git
Enter the subfolder:
cd BestPractices4Pathogenomics/PRJNA636748
5. Download our selected public dataset
The SRA dataset stored on NCBI can be downloaded by using their offically released tool: sra-toolkit
We need to do some configuration about it:
vdb-config -i
Change cache to /home/fangchao/sra-cache
Then you can 'download' the SRA file:
fastq-dump -O SRA SRR27676316 SRR22019447 SRR22019449 SRR16298563 SRR16298722 SRR17006792 SRR17712415 SRR16298506 SRR14736722 SRR19051760 SRR24152024 SRR19051863 SRR16298805 SRR14707730 SRR23318860
Take a look at the downloaded file and know more about the fastq format:
$ head -n 4 SRA/SRR22019447.fastq
@SRR22019447.1 1 length=186
CAGTTTACTCACACCTTTTGCTCGTTGCTCCTGGCCTTGAAGCCCCTTTTCTCTACCTTTATTCTTTAGTCTACTTCTTGCAGAGCATAAACAAGTTTATACTCTGCAAGAAGTAGACTAAATCATAAAGATAGAGAAAAGGGGCTTCAAGGTCAGGAGCAACGAGCAAAAGGTGTGAGTAAACTG
+SRR22019447.1 1 length=186
CCCCCFEFF9@,;CC,6FA9FC<FG<CCAFG<CECCGD@F8,,CF,:6EDF<FFG,C<,,,,,CC,C<,6C9@<@EFE,,B,C<F,B,FFFG,CCCC86@9CFGGGF<C9EF8EG,CE9,;,,66C@,6CF9,,6,,FEC8@7B@GGGCEA6,;,@6CB<FC,+:@@:FFGFGGCGE<,F9,,9@@
Where:
- The 1st line begins with a '@' indicating the sequence id
- The 2nd line contained the raw sequence of bases A,T,C or G. Sometime 'N' for an unknown base.
- The 3rd line begins with a '+' character is an optionall line for notes or commnets. We can ignore it in most of the case.
- The 4th line encodes the Phred score quality values for the sequence in the 2nd line.
6. Quality control
Run fastp to discard reads with low quality and too many Ns.
mkdir -p fastp
fastp -i SRA/SRR22019447.fastq -o fastp/SRR22019447.fastq
the output:
$ fastp -i SRA/SRR22019447.fastq -o fastp/SRR22019447.fastq
Detecting adapter sequence for read1...
No adapter detected for read1
Read1 before filtering:
total reads: 21674
total bases: 5715494
Q20 bases: 5563899(97.3476%)
Q30 bases: 5293797(92.6219%)
Read1 after filtering:
total reads: 21607
total bases: 5706155
Q20 bases: 5560525(97.4478%)
Q30 bases: 5291800(92.7385%)
Filtering result:
reads passed filter: 21607
reads failed due to low quality: 67
reads failed due to too many N: 0
reads failed due to too short: 0
reads with adapter trimmed: 0
bases trimmed due to adapters: 0
Duplication rate (may be overestimated since this is SE data): 2.93439%
JSON report: fastp.json
HTML report: fastp.html
fastp -i SRA/SRR22019447.fastq -o fastp/SRR22019447.fastq
fastp v0.23.4, time used: 2 seconds
7. Run kraken2
mkdir -p kraken2
kraken2 --db /home/fangchao/database/viral fastp/SRR22019447.fastq --output kraken2/SRR22019447.out --report kraken2/SRR22019447.rpt
the output:
Loading database information... done.
70530 sequences (19.24 Mbp) processed in 1.760s (2404.4 Kseq/m, 655.78 Mbp/m).
11083 sequences classified (15.71%)
59447 sequences unclassified (84.29%)
Extended reading
1. How to install conda and create a virtual environment myself?
Searching conda and find out their website:
https://docs.anaconda.com/miniconda/
Choose the one fitting your OS and download it. I'll recommend Linux OS.
Then check this Getting started with conda page to pracitce the basic usage of conda.
How to create the virtual environment we used for today?
Do it like this:
conda create -n QC -c bioconda fastqc kraken2 sra-tools git megahit
2. How to build a kraken2 custom database
2.1 Download the covid-19 reference genomes from NCBI Datasets
Refer to Install NCBI Datasets command-line tools
conda install -c conda-forge ncbi-datasets-cli
Download a SARS-CoV-2 genome data package
datasets download virus genome taxon SARS-CoV-2 --filename sars_cov_2.zip
2.2 Build a custom database for kraken2
Refer to https://github.com/DerrickWood/kraken2/wiki/Manual#custom-databases
kraken2-build --download-taxonomy --db viral
kraken2-build --download-library viral --db viral
If the download of above viral database failed. You can manually download from benlangmead aws-indexes
wget https://genome-idx.s3.amazonaws.com/kraken/k2_viral_20240605.tar.gz
2.3 Rename the custom sequences so the kraken2 can recognize it
perl -e ' my %NAME; my $psudotaxid_start = 4000000; $count =0;
open I, "< ncbi_dataset/data/data_report.jsonl";
while(<I>){
if($_ =~/"accession":"([\w.]+)".*"pangolinClassification":"(\S+)","taxId"/){
$count ++;
$NAME{$1}{pangolinClassification} = $2;
$NAME{$1}{taxid} = $psudotaxid_start + $count;
}
}
close I;
open F,"< ncbi_dataset/data/genomic.fna";
while(<F>){
if($_ =~/^>(\S+)/){
$id = $1;
$taxid = $NAME{$id}{taxid};
$name = ">$id|kraken:taxid|$taxid $NAME{$id}{pangolinClassification}";
print $name,"\n";
}else{
print $_;
}
};close F;
' > ncbi_dataset/data/rename_genomic.fna
grep ">" ncbi_dataset/data/rename_genomic.fna | awk -F '\\|| ' '{print $3"\t\|\t"$4"\t|\t\t|\tpangolinClassification\t|"}' > taxonomy/added.name.dmp
grep ">" ncbi_dataset/data/rename_genomic.fna | awk -F '\\|| ' '
{print $3"\t\|\t2697049\t\|\tsubspecies\t\|\t\t\|\t\t9\t\t\|\t1\t\|\t1\t\|\t1\t\|\t\t\t\|\t1\t\|\t\|\t\|\t\|\t\|\t\|"}' > taxonomy/added.node.dmp
2.3 Add custom sequences to the database
Add custom sequences
kraken2-build --add-to-library ncbi_dataset/data/rename_genomic.fna --db viral
2.4 Finally, building the database
kraken2-build --build --db viral