VirusGenomeConsensus - BGIGPD/BestPractices4Pathogenomics GitHub Wiki

Consensus Sequence Generation for Virus Genome using Alignment-based Methods

1.Overview

In virology, a consensus sequence refers to a sequence determined by comparing a set of related sequences, representing the most commonly occurring nucleotides or amino acids at specific positions. This method is particularly important in virus research because viruses have high mutation rates, and their genomic sequences can vary significantly between different strains.

2.Objectives

Generate a consensus sequence for the SARS-CoV-2 genome

3.Softwares and Datas

Softwares

bowtie2 v2.5.4
samtools v1.21
freebayes v1.3.8
sra-tools v3.1.1
ncbi-datasets-cli v16.33.0
unzip v6.0
bcftools v1.21

Datas

SARS-CoV-2 ref genome NC_045512.2
SARS-CoV-2 sequencing sequence SRR30872537

4.Steps

0.1. Software installation

First, we'll create a new Conda environment named virus_con that will contain all the necessary bioinformatics tools.Navigate into the newly created environment by activating it.

conda create --name virus_con bowtie2 samtools freebayes sra-tools bcftools

conda activate virus_con

0.2. Download Viral Genome Data

Use the ncbi-datasets-cli tool to download the viral genome dataset of interest. In this example, we're downloading the genome with accession number NC_045512.2.

datasets download virus genome accession NC_045512.2

Once the dataset is downloaded, unzip the file to access the data. Copy the genomic sequence data to a new file named NC_045512.2.fna for easier access and manipulation.

unzip ncbi_dataset.zip
mkdir data
cp ncbi_dataset/data/genomic.fna ./data/NC_045512.2.fna

0.3. Download Sequencing Data using SRA Tools

SRR30872537

You can use fastq-dump from the SRA Tools package to download the raw sequencing data from NCBI's Sequence Read Archive (SRA). The following command downloads the data for the accession number SRR30872537 and saves it in the data directory.

fastq-dump -O data SRR30872537

If the automated download fails due to network problems, you can use wget to manually download the .fastq.gz files from the European Bioinformatics Institute (EBI)'s FTP server.

cd data
wget https://ftp.sra.ebi.ac.uk/vol1/fastq/SRR308/037/SRR30872537/SRR30872537_1.fastq.gz
wget https://ftp.sra.ebi.ac.uk/vol1/fastq/SRR308/037/SRR30872537/SRR30872537_2.fastq.gz

1. Build the Bowtie2 Index

First, you need to build an index of the reference genome for Bowtie2.

mkdir database
bowtie2-build NC_045512.2.fna ./database/NC_045512.2_idx

This command creates an index named NC_045512.2_idx that Bowtie2 will use for the alignment process.

2. Align Reads with Bowtie2

Next, align your paired-end reads to the reference genome using Bowtie2.

bowtie2 -x ./database/NC_045512.2_idx -1 SRR30872537_1.fastq.gz -2 SRR30872537_2.fastq.gz --very-sensitive-local  --rg-id "SRR30872537" --rg "SM:SRR30872537" -S aligned.sam

This command aligns the reads and outputs an uncompressed SAM file named aligned.sam.

3. Convert SAM to BAM

Use Samtools to convert the SAM file to a sorted BAM file.

samtools view -bS aligned.sam > aligned.bam
samtools sort aligned.bam -o sorted_aligned.bam

The first command converts the SAM file to a BAM file, and the second command sorts the BAM file by coordinate order.

4. Index the BAM File

After sorting, index the BAM file for rapid random access.

samtools index sorted_aligned.bam

5. Call Variants with FreeBayes

Use FreeBayes to call variants from the sorted and indexed BAM file. -m 20: Use only reads with a mapping quality score of 20 or higher for variant calling. -p 1: Assume the sample is haploid (each chromosome has only one copy).

freebayes -f NC_045512.2.fna -m 20 -p 1 sorted_aligned.bam > variants.vcf

This command generates a VCF file named variants.vcf containing the called variants.

6: Compress the VCF file with `bgzip`

First, you need to compress your VCF file using bgzip. This tool is part of the htslib library, which is also installed with bcftools.

bgzip -c variants.vcf > variants.vcf.gz

The -c option tells bgzip to write the compressed output to standard output, which can be redirected to a file.

7: Index the compressed VCF file with `tabix`

After compressing the file, you should index it with tabix to allow bcftools to quickly access specific regions of the file.

tabix  variants.vcf.gz

8: Generate the consensus sequence with `bcftools`

Now that your VCF file is compressed and indexed, you can use bcftools to generate the consensus sequence.

bcftools consensus -f NC_045512.2.fna -o consensus.fasta variants.vcf.gz

This command tells bcftools to read the compressed VCF file (variants.vcf.gz) and use the reference sequence file (NC_045512.2.fna) to generate the consensus sequence, which is output to consensus.fasta.