VirusGenomeConsensus - BGIGPD/BestPractices4Pathogenomics GitHub Wiki
Consensus Sequence Generation for Virus Genome using Alignment-based Methods
1.Overview
In virology, a consensus sequence refers to a sequence determined by comparing a set of related sequences, representing the most commonly occurring nucleotides or amino acids at specific positions. This method is particularly important in virus research because viruses have high mutation rates, and their genomic sequences can vary significantly between different strains.
2.Objectives
Generate a consensus sequence for the SARS-CoV-2 genome
3.Softwares and Datas
Softwares
- bowtie2 v2.5.4
- samtools v1.21
- freebayes v1.3.8
- sra-tools v3.1.1
- ncbi-datasets-cli v16.33.0
- unzip v6.0
- bcftools v1.21
Datas
- SARS-CoV-2 ref genome NC_045512.2
- SARS-CoV-2 sequencing sequence SRR30872537
4.Steps
0.1. Software installation
First, we'll create a new Conda environment named virus_con
that will contain all the necessary bioinformatics tools.Navigate into the newly created environment by activating it.
conda create --name virus_con bowtie2 samtools freebayes sra-tools bcftools
conda activate virus_con
0.2. Download Viral Genome Data
Use the ncbi-datasets-cli
tool to download the viral genome dataset of interest. In this example, we're downloading the genome with accession number NC_045512.2.
datasets download virus genome accession NC_045512.2
Once the dataset is downloaded, unzip the file to access the data. Copy the genomic sequence data to a new file named NC_045512.2.fna for easier access and manipulation.
unzip ncbi_dataset.zip
mkdir data
cp ncbi_dataset/data/genomic.fna ./data/NC_045512.2.fna
0.3. Download Sequencing Data using SRA Tools
You can use fastq-dump
from the SRA Tools package to download the raw sequencing data from NCBI's Sequence Read Archive (SRA). The following command downloads the data for the accession number SRR30872537 and saves it in the data
directory.
fastq-dump -O data SRR30872537
If the automated download fails due to network problems, you can use wget
to manually download the .fastq.gz
files from the European Bioinformatics Institute (EBI)'s FTP server.
cd data
wget https://ftp.sra.ebi.ac.uk/vol1/fastq/SRR308/037/SRR30872537/SRR30872537_1.fastq.gz
wget https://ftp.sra.ebi.ac.uk/vol1/fastq/SRR308/037/SRR30872537/SRR30872537_2.fastq.gz
1. Build the Bowtie2 Index
First, you need to build an index of the reference genome for Bowtie2.
mkdir database
bowtie2-build NC_045512.2.fna ./database/NC_045512.2_idx
This command creates an index named NC_045512.2_idx
that Bowtie2 will use for the alignment process.
2. Align Reads with Bowtie2
Next, align your paired-end reads to the reference genome using Bowtie2.
bowtie2 -x ./database/NC_045512.2_idx -1 SRR30872537_1.fastq.gz -2 SRR30872537_2.fastq.gz --very-sensitive-local --rg-id "SRR30872537" --rg "SM:SRR30872537" -S aligned.sam
This command aligns the reads and outputs an uncompressed SAM file named aligned.sam
.
3. Convert SAM to BAM
Use Samtools to convert the SAM file to a sorted BAM file.
samtools view -bS aligned.sam > aligned.bam
samtools sort aligned.bam -o sorted_aligned.bam
The first command converts the SAM file to a BAM file, and the second command sorts the BAM file by coordinate order.
4. Index the BAM File
After sorting, index the BAM file for rapid random access.
samtools index sorted_aligned.bam
5. Call Variants with FreeBayes
Use FreeBayes to call variants from the sorted and indexed BAM file. -m 20: Use only reads with a mapping quality score of 20 or higher for variant calling. -p 1: Assume the sample is haploid (each chromosome has only one copy).
freebayes -f NC_045512.2.fna -m 20 -p 1 sorted_aligned.bam > variants.vcf
This command generates a VCF file named variants.vcf
containing the called variants.
bgzip
6: Compress the VCF file with First, you need to compress your VCF file using bgzip
. This tool is part of the htslib
library, which is also installed with bcftools
.
bgzip -c variants.vcf > variants.vcf.gz
The -c
option tells bgzip
to write the compressed output to standard output, which can be redirected to a file.
tabix
7: Index the compressed VCF file with After compressing the file, you should index it with tabix
to allow bcftools
to quickly access specific regions of the file.
tabix variants.vcf.gz
bcftools
8: Generate the consensus sequence with Now that your VCF file is compressed and indexed, you can use bcftools
to generate the consensus sequence.
bcftools consensus -f NC_045512.2.fna -o consensus.fasta variants.vcf.gz
This command tells bcftools
to read the compressed VCF file (variants.vcf.gz
) and use the reference sequence file (NC_045512.2.fna
) to generate the consensus sequence, which is output to consensus.fasta
.