NGS IV: Exome - bcfgothenburg/HT23 GitHub Wiki

Course: HT23 Analysis-of-next-generation-sequencing-data (SC00204)

The purpose of this exercise for you to become familiar with the workflow used to analyze targeted sequencing data and identify disease causing mutations.

The Data

Our data involves exome sequencing data from an affected individual with Alpers syndrome (which is a progressive neurodegenerative disorder) and one control sample from a healthy individual. The data was taken from Sofou K, et al., "Whole exome sequencing reveals mutations in NARS2 and PARS2, encoding the mitochondrial asparaginyl-tRNA synthetase and prolyl-tRNA synthetase, in patients with Alpers syndrome". Mol Genet Genomic Med. 2015.

Sample	Description
AS_2	Affected
AS_ctrl	Unaffected

The server

Connect to the server using MobaXterm (PC users) or your local Terminal (MACS users):
```
ssh -Y your_account@remote_server 
```

Load the following modules:

annovar/20200608
bcftools/1.17
bwa-mem2/2.2.1
delly/1.1.6
fastqc/0.12.1
gatk/4.4.0.0
igv/2.16.1
multiqc/1.14
R/4.3.1
samtools/1.17
picard/3.0.0
trimgalore/0.6.10

Create a directory called Exome
Go to this directory
Create a soft link of the entire /home/ftp_courses/NGS/Exome/Fastq
Create another soft link of the entire /home/ftp_courses/NGS/Exome/db

NOTE: if you need to remove a soft link, use:
```
unlink <link_name>
```

Remember that you can create bash scripts so your jobs use the resources of the cluster rather than only the login node. Although it is not compulsory for this practical, it would be a good practice. Besides if you submit a job and you get disconnected from the cluster (for whatever reason) the job will keep on running/queued!

QC, filtering and mapping

Due to limit in time, we will be working with data from chr11 (so some results may look bad).

Run FastQC on your samples and evaluate the quality of the data

NOTE: Remember that we can use a loop to run a command on multiple samples. Here an example on how to index all bam files in a directory
```
# running in the command line
for i in *bam; do samtools index $i; done

# running as a job
for i in *bam; do qsub runSamtools.sh $i;done
```
Run TrimGalore! as you see fit and check that your approach improved your data quality. Hint: You can use multiqc to gather your results

Q. How does the quality look like, before and after trimming? How many reads were trimmed for both the control and the patient sample? Have a look under the FastQC folder for the raw fastq files, and under the TrimGalore for the filtered fastq files.

Map the filtered reads towards chr11.fa using bwa-mem2
- Save the alignment as: YOUR_SAMPLE.bwa.bam
- The reference genome is under the db directory

Run AddOrReplaceReadGroups from picard using the following parameters (you can more if you want!):

-I             -> YOUR_SAMPLE.bwa.bam 
-O             -> YOUR_SAMPLE.rg.sort.bwa.bam 
-SM            -> AS_2 or AS_ctrl 
-LB            -> TruSeq 
-PL            -> Illumina 
-PU            -> CGG 
-SO            -> coordinate
--CREATE_INDEX -> true

Generate the flagstat and idxstats for each alignment file

Q. What is the mapping rate in the sample (% of mapped reads)? Is it really sequencing data from chr11?

On-target coverage

Since the data was from a targeted approach, let's inspect the on-target coverage plot. GATK's DepthOfCoverage can help us in this task. The tool is labeled as being in BETA version however, this tool has been part of GATK since the beginning and works just fine.

Run the following code for each alignment file:

gatk DepthOfCoverage \
   -R db/chr11.fa \
   -I YOUR_SAMPLE.rg.sort.bwa.bam  \
   -O YOUR_SAMPLE_NAME.metrics \
   -L chr11.exon.bed \
   --output-format TABLE \
   --summary-coverage-threshold 10 \
   --summary-coverage-threshold 30 \
   --summary-coverage-threshold 50

Several files are generated:

Have a look at *.sample_summary files

Q. What is the mean coverage per sample? Do you think it is enough sequencing?

Inspect the *.sample_interval_summary files.
- Each row shows the coverage of each targeted region (exons) at different depths (last three columns)

Q. For each sample, which exon (if any) is covered at least 300 times in average?
What percentage of the exons are fully covered at least 50 times?

The *.sample_cumulative_coverage_proportions files have two rows (really long rows!). Showing the first five columns (this values are for the entire dataset, not only chr21):
```
        gte_0   gte_1   gte_2   gte_3
AS_ctrl 1.00    0.96    0.94    0.93
```
- Each row is the coverage information per sample (only one in our case)
- Each column shows the aggregated proportion of the regions with >= of the corresponding coverage. In other words,:
  - gte_0 - 1.00 tells us that 100% of the regions are covered cero or more times;
  - gte_1 - 0.96 tells us that 96% of the regions are covered one or more times;
  - gte_2 - 0.94 tells us that 94% of the regions are covered two or more times;
  - gte_3 - 0.93 tells us that 93% of the regions are covered three or more times;

To better understand this data, let's make a plot with this information.

First we transpose our file (changing columns to rows) and then, we convert it to percentage
Remember you can use your own code or directly use R, this is just one way of doing it:

tail -1 SAMPLE_NAME.sample_cumulative_coverage_proportions | \
   sed 's/\t/\n/g' | \
   tail -n +2 | \
   awk 'OFS="\t" {print $1*100,num++}' > SAMPLE_NAME.proportions

The first lines should look like:
```
100     0
29      1
15      2
10      3
```

Open R in the terminal by typing R

Type in the following code
Remember to change the name of the file you want to use:

# reading the data
data <- read.table("SAMPLE_NAME.proportions")

# setting name of figure
png("SAMPLE_NAME.coverage.png")

# scatter plot
plot (data$V2,data$V1,type="l",log="x",ylim=c(0,100),
      ylab ="% On-target",xlab="Depth", main="On-target coverage SAMPLE_NAME")

# adding lines at different coveraged
abline(v=c(10,20,30,50),col="gray",lty=2)

# adding lines at different % of on-target
abline(h=c(50,70,80),col=c(51,"gray",51),lty=2)

# adding X and % within the graph
text(c(8,17,26,43),c(100,100,100,100),c("10x","20x","30x","50x"),col="gray")
text(c(1,1,1),c(47,67,77),c("50%","70%","80%"),col=c(51,"gray",51))

# closing graphical device
dev.off()

Type q() to exit the R session

Q. Looking at the plots, do we need to be worried about under-sequencing?

PCR duplicates

When dealing with variant calling, an important step is to identify possible PCR duplicates, this to remove any bias when we are doing the actual calling.

Run MarkDuplicates from picard using (at least) these arguments:

-I             -> YOUR_SAMPLE.bwa.sort.fix.bam 
-O             -> YOUR_SAMPLE.bwa.sort.fix.mkdup.bam
-M             -> YOUR_SAMPLE.metrics 
--CREATE_INDEX -> TRUE

Generate flagstat and idxstats metrics for this new alignment file
Run multiqc to include these statistics

Q. How many duplicates do the samples have? Which sample has more sequencing reads?

Base Quality Score Recalibration (BQSR)

This pre-processing step detects systematic errors made by the sequencing machine when it estimates the accuracy of each base call. Most of the short variant calling tools rely on the base quality score, so it is important to correct these systematic technical errors.

First, we will produce a recalibration file from our input data using the BaseRecalibrator tool. To speed up the processing time, let's generate a file containing only SNPs from chr11. (You can use your own code if you prefer):

# keeping the header of the VCF file
grep ^#  db/dbsnp_138.b37.vcf | \
sed 's/ID=11/ID=chr11/' | \
grep -P  "contig=<ID=[1-9,A-Z]" -v | \
grep -P  "<ID=GL" -v | \
sed 's/135006516/57311868/' > title  

# extracting SNPs in chr11
grep ^11 -w  db/dbsnp_138.b37.vcf | \
awk '$2 < 57311868 {print $0}' | \
sed 's/^11/chr11/'  > 11

# merging title and data, formatting to add "chr", removing unwanted rows
cat title 11 > dbsnp_138.b37.chr11.vcf

Index the file using gatk IndexFeatureFile and:
```
-I  -> dbsnp_138.b37.chr11.vcf 
```

Run the gatk BaseRecalibrator for both samples, with :

-R            -> chr11_reference_file 
-I            -> YOUR_SAMPLE.bwa.sort.fix.mkdup.bam 
-O            -> YOUR_SAMPLE.table 
--known-sites -> THE_VCF_FILE_YOU_JUST_CREATED

Q. What are the --known_sites?

Now let's correct the scores of the sample.

Run gatk ApplyBQSR for both alignments, with:

-R                -> chr11_reference_file 
-I                -> YOUR_SAMPLE.bwa.sort.fix.mkdup.bam 
-O                -> YOUR_SAMPLE.sort.fix.mkdup.recal.bam  
--bqsr-recal-file -> THE_TABLE_FILE_YOU_JUST_CREATED

Variant discovery (GATK)

The Genome Analysis Toolkit or GATK is a software package developed to analyze next-generation resequencing data, focusing on variant discovery and genotyping. For the variant calling we will use the HaplotypeCaller, which is an SNP/indel caller that uses a Bayesian genotype likelihood model to estimate simultaneously the most likely genotypes and allele frequency in a population of N samples.

Run gatk HaplotypeCaller to do the variant calling, with these parameters:

-R -> chr11_reference_file 
-I -> YOUR_SAMPLE.bwa.sort.fix.mkdup.recal.bam 
-O -> YOUR_SAMPLE.gatk.raw.vcf

Inspect the VCF files you just created and get familiar with their information.

Q. How do you differentiate between a homozygous mutations from an heterozygous?
How many SNPs do you have in each sample?
Are there any insertions/deletions?

Let's convert the VCF files into a tab delimited files.

Use gatk VariantsToTable with:

-O  -> YOUR_SAMPLE.gatk.raw.table 
-V  -> YOUR_SAMPLE.gatk.raw.vcf 
-F  -> CHROM  
-F  -> POS 
-F  -> ID 
-F  -> REF 
-F  -> ALT 
-F  -> QUAL 
-F  -> FILTER 
-GF -> AC 
-GF -> GT 
-GF -> AD 
-GF -> DP

We will use this file later. Have a look and understand the output.

SNP-callers emit a number of false-positive mutations, as any prediction program. Some of these false-positives can be detected and rejected by various statistical tests. Filtering or flagging the variant calls in the VCF file is a useful step to detect such false-positives.

Run gatk VariantFiltration with:

-O                  -> YOUR_SAMPLE.gatk.raw.flag.vcf 
-V                  -> YOUR_SAMPLE.gatk.raw.vcf 
--filter-name       -> "LowDP" 
--filter-expression -> "DP < 10.0"

These are hard filters that are applied when a variant callset is too small or when training sets are not available. Try to understand the filtering done above. You may find some info at the beginning of the VCF file.

Q. How many variants have passed our filtering criteria for both files?

Variant annotation (Annovar)

Now we need to annotate our variants in terms of genomic context, aminoacid change and effect, previous mutation knowledge, etc. so we can evaluate and filter mutations that are irrelevant.

ANNOVAR is a tool that annotates genetic variants. The overall procedure is to create an input file from the VCF file and then annotate the variant positions with information from several databases.

Reformat both alignment files with convert2annovar.pl using:

-format  -> vcf4 
-outfile -> YOUR_FILE.gatk.raw.flag.annovar
YOUR_FILE.gatk.raw.flag.vcf

Add the annotation to the alignments using table_annovar.pl:

  -buildver  -> hg19 
  -protocol  -> dbsnp_137_chr11_gatk,refGene,ljb2_all,1000g2012apr_all,clinvar_20200316 
  -operation -> f,g,f,f,f 
  -outfile   -> YOUR_FILE.gatk.raw.flag.annovar.table 
  --nastring -> - 
  --remove 
  YOUR_FILE.gatk.raw.flag.annovar 
  /home/ftp_courses/NGS/Exome/db/humandb

Q. What kind of information are we adding to our mutations? Have a look at the file with .hg19_multianno.txt extension. Here you can find some information.

Merge the *hg19_multianno.txt file with the *gatk.raw.flag.table , and save it in a file with extension .tsv. Hint: you can use paste
- The file should look like:

This file now has statistics and annotations for the mutations. If you have a look here is a description of some columns:

Column	Description
AC	Allele count
GT	Genotype
AD	Allele depth
DP	Depth
SIFT	D = damaging, T = tolerated
HDIV	D = probably damaging, P = possibly damaging, B = benign
HVAR	D = probably damaging, P = possibly damaging, B = benign
LRT	D = deleterious, N = Neutral , U = Unknown
MutationTaster	A=disease_causing_automatic, D=disease_causing, N=polymorphism, P=polymorphism_automatic
phyloP100way	the higher the value the more conserved is the aminoacid in other organisms

This is how you read the AAChange column:

PGM1:NM_001172818:exon2:c.A316G:p.I106V,PGM1:NM_002633:exon2:c.A262G:p.I88V

PGM1 -> Gene name
	NM_001172818 => Transcript ID_1
		exon2 => exon where the mutation is found
			c.A316G => in the DNA, at position 316 the reference base A is changed to an G
			p.I106V => in the protein, at positon 106 the reference aminoacid I is changed to a V
PGM1 => Gene name
	NM_002633 => Transcript ID_2
		exon2 => exon where the mutation is found	
			A262G => in the DNA, at position 262 the reference base A is changed to a G
			p.I88V => in the protein, at position 88 the reference aminoacid I is changed to a V

Variant filtering

With all this information, it is possible to narrow down the list of mutations to a handful of SNPs. Think of a filtering strategy and filter your SNPs. Don't forget to look at both samples (Affected and Unaffected).

Q. Which are your candidate genes? What was your filtering strategy?

Once you have your list, visual inspection of your candidates helps in assessing them further. Have a look at your candidates in IGV, you may want to load both of your VCF files as well as the BAM files.

Automating visualization

When you have multiple files and positions to investigate in IGV you can save time by creating a bat script. The following script take snapshots of the coordinate chr11:1017135 across AS_2 and Control using both their bams and mutation files.

Copy this script to a file called bashIGV.bat, make sure that the paths to the bam and vcf files are correct.

new 
genome hg19 
load PATH_TO_FILE/AS_2.bwa.sort.fix.mkdup.recal.bam 
load PATH_TO_FILE/AS_2.gatk.raw.flag.vcf
load PATH_TO_FILE/AS_ctrl.bwa.sort.fix.mkdup.recal.bam 
load PATH_TO_FILE/AS_ctrl.gatk.raw.flag.vcf
snapshotDirectory . 
squish 
goto chr11:1017135 
snapshot 1017135_Exomedata.png 
exit

Run the script with the following command (don't forget to make it executable!):
```
igv.sh -b bashIGV.bat
```
Examine the resulting png 1017135_Exomedata.png
Change the script to zoom into other mutations of interest, don't forget to change the output name of the png file to the corresponding position.

Q. How are the mutations distributed? Are they all in exons?
How do you differentiate between heterozygous and homozygous mutations?
Some mutations are slightly faded, what does this mean?

Q. Could you filter out some of your candidates with your visual inspection? Which are you final candidate genes?

Structural variants (OPTIONAL)

Now let's analyse WGS data to see if we can identify any structural variants. When analyzing tumor samples you normally have two datasets; one normal (germline) sample and one tumor sample (somatic). To save time we will only work with two tumor samples from chromosome 17, these are breast cancer cell lines: HCC1143 and HCC1954.

Create a soft link to the entire folder /home/ftp_courses/NGS/Exome/SV

Before running any variant calling, we need to mark duplicates.

Run MarkDuplicates for each alignment file, with the following parameters:

-I -> YOUR_SAMPLE.tumor.bam 
-O -> YOUR_SAMPLE.tumor.mkdup.bam 
-M -> YOUR_SAMPLE.tumor.metrics 
--CREATE_INDEX -> true

Run the variant calling with delly call to find any structural variants:

delly call \
-g Homo_sapiens_assembly19.fasta_WITHIN_/db
-o ALL.tumor.bcf
list_of_alignments_separated_by_space

The default output is in bcf format.

Convert the file to vcf format using bcftools:

bcftools view ALL.tumor.bcf > ALL.tumor.vcf

Take a look at the vcf file.

Q. How many structural variant did DELLY find?

Filter your vcf file, keeping only PASS quality variants.

Q. How many variants are left after filtering for quality PASS?

Remove all variants annotates as IMPRECISE

Q. How many variant are left after filtering based on PRECISE. Which variants are unique to sample HCC1143? and for HCC1954?

From the filtered vcf file, choose two regions (one with a deletion (DEL) and one with an inversion (INV)) to visualize in IGV.

Load the bam and vcf files in IGV, and zoom in to the regions you decided to look at.

Q. Visually, do you agree with DELLY? Is the variant classified correct as DEL and INV? And what about the zygousity, is the variant correctly classified as homozygous or heterozygous (look at GT field in vcf)?

Home: Analysis of next generation sequencing data (SC00024)

Developed by Marcela Dávila, 2017. Modified by Marcela Davila, Sanna Abrahamsson and Vanja Börjesson, 2021. Updated by Marcela Dávila, 2023.