NGS IV: Exome - bcfgothenburg/HT23 GitHub Wiki

Course: HT23 Analysis-of-next-generation-sequencing-data (SC00204)


The purpose of this exercise for you to become familiar with the workflow used to analyze targeted sequencing data and identify disease causing mutations.



The Data

Our data involves exome sequencing data from an affected individual with Alpers syndrome (which is a progressive neurodegenerative disorder) and one control sample from a healthy individual. The data was taken from Sofou K, et al., "Whole exome sequencing reveals mutations in NARS2 and PARS2, encoding the mitochondrial asparaginyl-tRNA synthetase and prolyl-tRNA synthetase, in patients with Alpers syndrome". Mol Genet Genomic Med. 2015.

Sample Description
AS_2 Affected
AS_ctrl Unaffected

The server

  • Connect to the server using MobaXterm (PC users) or your local Terminal (MACS users):

    ssh -Y your_account@remote_server 
  • Load the following modules:

    annovar/20200608
    bcftools/1.17
    bwa-mem2/2.2.1
    delly/1.1.6
    fastqc/0.12.1
    gatk/4.4.0.0
    igv/2.16.1
    multiqc/1.14
    R/4.3.1
    samtools/1.17
    picard/3.0.0
    trimgalore/0.6.10
  • Create a directory called Exome

  • Go to this directory

  • Create a soft link of the entire /home/ftp_courses/NGS/Exome/Fastq

  • Create another soft link of the entire /home/ftp_courses/NGS/Exome/db

    NOTE: if you need to remove a soft link, use:

    unlink <link_name>

Remember that you can create bash scripts so your jobs use the resources of the cluster rather than only the login node. Although it is not compulsory for this practical, it would be a good practice. Besides if you submit a job and you get disconnected from the cluster (for whatever reason) the job will keep on running/queued!

QC, filtering and mapping

Due to limit in time, we will be working with data from chr11 (so some results may look bad).

  • Run FastQC on your samples and evaluate the quality of the data

    NOTE: Remember that we can use a loop to run a command on multiple samples. Here an example on how to index all bam files in a directory

    # running in the command line
    for i in *bam; do samtools index $i; done
    
    # running as a job
    for i in *bam; do qsub runSamtools.sh $i;done
  • Run TrimGalore! as you see fit and check that your approach improved your data quality. Hint: You can use multiqc to gather your results

Q. How does the quality look like, before and after trimming? How many reads were trimmed for both the control and the patient sample? Have a look under the FastQC folder for the raw fastq files, and under the TrimGalore for the filtered fastq files.

  • Map the filtered reads towards chr11.fa using bwa-mem2

    • Save the alignment as: YOUR_SAMPLE.bwa.bam
    • The reference genome is under the db directory
  • Run AddOrReplaceReadGroups from picard using the following parameters (you can more if you want!):

    -I             -> YOUR_SAMPLE.bwa.bam 
    -O             -> YOUR_SAMPLE.rg.sort.bwa.bam 
    -SM            -> AS_2 or AS_ctrl 
    -LB            -> TruSeq 
    -PL            -> Illumina 
    -PU            -> CGG 
    -SO            -> coordinate
    --CREATE_INDEX -> true
  • Generate the flagstat and idxstats for each alignment file

Q. What is the mapping rate in the sample (% of mapped reads)? Is it really sequencing data from chr11?

On-target coverage

Since the data was from a targeted approach, let's inspect the on-target coverage plot. GATK's DepthOfCoverage can help us in this task. The tool is labeled as being in BETA version however, this tool has been part of GATK since the beginning and works just fine.

  • Run the following code for each alignment file:
    gatk DepthOfCoverage \
       -R db/chr11.fa \
       -I YOUR_SAMPLE.rg.sort.bwa.bam  \
       -O YOUR_SAMPLE_NAME.metrics \
       -L chr11.exon.bed \
       --output-format TABLE \
       --summary-coverage-threshold 10 \
       --summary-coverage-threshold 30 \
       --summary-coverage-threshold 50    

Several files are generated:

  • Have a look at *.sample_summary files

Q. What is the mean coverage per sample? Do you think it is enough sequencing?

  • Inspect the *.sample_interval_summary files.
    • Each row shows the coverage of each targeted region (exons) at different depths (last three columns)

Q. For each sample, which exon (if any) is covered at least 300 times in average?
What percentage of the exons are fully covered at least 50 times?

  • The *.sample_cumulative_coverage_proportions files have two rows (really long rows!). Showing the first five columns (this values are for the entire dataset, not only chr21):

            gte_0   gte_1   gte_2   gte_3
    AS_ctrl 1.00    0.96    0.94    0.93
    • Each row is the coverage information per sample (only one in our case)
    • Each column shows the aggregated proportion of the regions with >= of the corresponding coverage. In other words,:
      • gte_0 - 1.00 tells us that 100% of the regions are covered cero or more times;
      • gte_1 - 0.96 tells us that 96% of the regions are covered one or more times;
      • gte_2 - 0.94 tells us that 94% of the regions are covered two or more times;
      • gte_3 - 0.93 tells us that 93% of the regions are covered three or more times;

To better understand this data, let's make a plot with this information.

  • First we transpose our file (changing columns to rows) and then, we convert it to percentage
    Remember you can use your own code or directly use R, this is just one way of doing it:

    tail -1 SAMPLE_NAME.sample_cumulative_coverage_proportions | \
       sed 's/\t/\n/g' | \
       tail -n +2 | \
       awk 'OFS="\t" {print $1*100,num++}' > SAMPLE_NAME.proportions
    • The first lines should look like:

      100     0
      29      1
      15      2
      10      3
  • Open R in the terminal by typing R

  • Type in the following code
    Remember to change the name of the file you want to use:

    # reading the data
    data <- read.table("SAMPLE_NAME.proportions")
    
    # setting name of figure
    png("SAMPLE_NAME.coverage.png")
    
    # scatter plot
    plot (data$V2,data$V1,type="l",log="x",ylim=c(0,100),
          ylab ="% On-target",xlab="Depth", main="On-target coverage SAMPLE_NAME")
    
    # adding lines at different coveraged
    abline(v=c(10,20,30,50),col="gray",lty=2)
    
    # adding lines at different % of on-target
    abline(h=c(50,70,80),col=c(51,"gray",51),lty=2)
    
    # adding X and % within the graph
    text(c(8,17,26,43),c(100,100,100,100),c("10x","20x","30x","50x"),col="gray")
    text(c(1,1,1),c(47,67,77),c("50%","70%","80%"),col=c(51,"gray",51))
    
    # closing graphical device
    dev.off()
  • Type q() to exit the R session

Q. Looking at the plots, do we need to be worried about under-sequencing?

PCR duplicates

When dealing with variant calling, an important step is to identify possible PCR duplicates, this to remove any bias when we are doing the actual calling.

  • Run MarkDuplicates from picard using (at least) these arguments:

    -I             -> YOUR_SAMPLE.bwa.sort.fix.bam 
    -O             -> YOUR_SAMPLE.bwa.sort.fix.mkdup.bam
    -M             -> YOUR_SAMPLE.metrics 
    --CREATE_INDEX -> TRUE
  • Generate flagstat and idxstats metrics for this new alignment file

  • Run multiqc to include these statistics

Q. How many duplicates do the samples have? Which sample has more sequencing reads?

Base Quality Score Recalibration (BQSR)

This pre-processing step detects systematic errors made by the sequencing machine when it estimates the accuracy of each base call. Most of the short variant calling tools rely on the base quality score, so it is important to correct these systematic technical errors.

First, we will produce a recalibration file from our input data using the BaseRecalibrator tool. To speed up the processing time, let's generate a file containing only SNPs from chr11. (You can use your own code if you prefer):

# keeping the header of the VCF file
grep ^#  db/dbsnp_138.b37.vcf | \
sed 's/ID=11/ID=chr11/' | \
grep -P  "contig=<ID=[1-9,A-Z]" -v | \
grep -P  "<ID=GL" -v | \
sed 's/135006516/57311868/' > title  

# extracting SNPs in chr11
grep ^11 -w  db/dbsnp_138.b37.vcf | \
awk '$2 < 57311868 {print $0}' | \
sed 's/^11/chr11/'  > 11

# merging title and data, formatting to add "chr", removing unwanted rows
cat title 11 > dbsnp_138.b37.chr11.vcf
  • Index the file using gatk IndexFeatureFile and:

    -I  -> dbsnp_138.b37.chr11.vcf 
  • Run the gatk BaseRecalibrator for both samples, with :

    -R            -> chr11_reference_file 
    -I            -> YOUR_SAMPLE.bwa.sort.fix.mkdup.bam 
    -O            -> YOUR_SAMPLE.table 
    --known-sites -> THE_VCF_FILE_YOU_JUST_CREATED

Q. What are the --known_sites?

Now let's correct the scores of the sample.

  • Run gatk ApplyBQSR for both alignments, with:

    -R                -> chr11_reference_file 
    -I                -> YOUR_SAMPLE.bwa.sort.fix.mkdup.bam 
    -O                -> YOUR_SAMPLE.sort.fix.mkdup.recal.bam  
    --bqsr-recal-file -> THE_TABLE_FILE_YOU_JUST_CREATED 

Variant discovery (GATK)

The Genome Analysis Toolkit or GATK is a software package developed to analyze next-generation resequencing data, focusing on variant discovery and genotyping. For the variant calling we will use the HaplotypeCaller, which is an SNP/indel caller that uses a Bayesian genotype likelihood model to estimate simultaneously the most likely genotypes and allele frequency in a population of N samples.

  • Run gatk HaplotypeCaller to do the variant calling, with these parameters:

    -R -> chr11_reference_file 
    -I -> YOUR_SAMPLE.bwa.sort.fix.mkdup.recal.bam 
    -O -> YOUR_SAMPLE.gatk.raw.vcf  
  • Inspect the VCF files you just created and get familiar with their information.

Q. How do you differentiate between a homozygous mutations from an heterozygous?
How many SNPs do you have in each sample?
Are there any insertions/deletions?

Let's convert the VCF files into a tab delimited files.

  • Use gatk VariantsToTable with:

    -O  -> YOUR_SAMPLE.gatk.raw.table 
    -V  -> YOUR_SAMPLE.gatk.raw.vcf 
    -F  -> CHROM  
    -F  -> POS 
    -F  -> ID 
    -F  -> REF 
    -F  -> ALT 
    -F  -> QUAL 
    -F  -> FILTER 
    -GF -> AC 
    -GF -> GT 
    -GF -> AD 
    -GF -> DP 

We will use this file later. Have a look and understand the output.

SNP-callers emit a number of false-positive mutations, as any prediction program. Some of these false-positives can be detected and rejected by various statistical tests. Filtering or flagging the variant calls in the VCF file is a useful step to detect such false-positives.

  • Run gatk VariantFiltration with:

    -O                  -> YOUR_SAMPLE.gatk.raw.flag.vcf 
    -V                  -> YOUR_SAMPLE.gatk.raw.vcf 
    --filter-name       -> "LowDP" 
    --filter-expression -> "DP < 10.0"

These are hard filters that are applied when a variant callset is too small or when training sets are not available. Try to understand the filtering done above. You may find some info at the beginning of the VCF file.

Q. How many variants have passed our filtering criteria for both files?

Variant annotation (Annovar)

Now we need to annotate our variants in terms of genomic context, aminoacid change and effect, previous mutation knowledge, etc. so we can evaluate and filter mutations that are irrelevant.

ANNOVAR is a tool that annotates genetic variants. The overall procedure is to create an input file from the VCF file and then annotate the variant positions with information from several databases.

  • Reformat both alignment files with convert2annovar.pl using:

    -format  -> vcf4 
    -outfile -> YOUR_FILE.gatk.raw.flag.annovar
    YOUR_FILE.gatk.raw.flag.vcf 
  • Add the annotation to the alignments using table_annovar.pl:

      -buildver  -> hg19 
      -protocol  -> dbsnp_137_chr11_gatk,refGene,ljb2_all,1000g2012apr_all,clinvar_20200316 
      -operation -> f,g,f,f,f 
      -outfile   -> YOUR_FILE.gatk.raw.flag.annovar.table 
      --nastring -> - 
      --remove 
      YOUR_FILE.gatk.raw.flag.annovar 
      /home/ftp_courses/NGS/Exome/db/humandb

Q. What kind of information are we adding to our mutations? Have a look at the file with .hg19_multianno.txt extension. Here you can find some information.

  • Merge the *hg19_multianno.txt file with the *gatk.raw.flag.table , and save it in a file with extension .tsv. Hint: you can use paste

    • The file should look like:

This file now has statistics and annotations for the mutations. If you have a look here is a description of some columns:

Column Description
AC Allele count
GT Genotype
AD Allele depth
DP Depth
SIFT D = damaging, T = tolerated
HDIV D = probably damaging, P = possibly damaging, B = benign
HVAR D = probably damaging, P = possibly damaging, B = benign
LRT D = deleterious, N = Neutral , U = Unknown
MutationTaster A=disease_causing_automatic, D=disease_causing, N=polymorphism, P=polymorphism_automatic
phyloP100way the higher the value the more conserved is the aminoacid in other organisms

This is how you read the AAChange column:

PGM1:NM_001172818:exon2:c.A316G:p.I106V,PGM1:NM_002633:exon2:c.A262G:p.I88V

PGM1 -> Gene name
	NM_001172818 => Transcript ID_1
		exon2 => exon where the mutation is found
			c.A316G => in the DNA, at position 316 the reference base A is changed to an G
			p.I106V => in the protein, at positon 106 the reference aminoacid I is changed to a V
PGM1 => Gene name
	NM_002633 => Transcript ID_2
		exon2 => exon where the mutation is found	
			A262G => in the DNA, at position 262 the reference base A is changed to a G
			p.I88V => in the protein, at position 88 the reference aminoacid I is changed to a V

Variant filtering

With all this information, it is possible to narrow down the list of mutations to a handful of SNPs. Think of a filtering strategy and filter your SNPs. Don't forget to look at both samples (Affected and Unaffected).

Q. Which are your candidate genes? What was your filtering strategy?

Once you have your list, visual inspection of your candidates helps in assessing them further. Have a look at your candidates in IGV, you may want to load both of your VCF files as well as the BAM files.

Automating visualization

When you have multiple files and positions to investigate in IGV you can save time by creating a bat script. The following script take snapshots of the coordinate chr11:1017135 across AS_2 and Control using both their bams and mutation files.

  • Copy this script to a file called bashIGV.bat, make sure that the paths to the bam and vcf files are correct.

    new 
    genome hg19 
    load PATH_TO_FILE/AS_2.bwa.sort.fix.mkdup.recal.bam 
    load PATH_TO_FILE/AS_2.gatk.raw.flag.vcf
    load PATH_TO_FILE/AS_ctrl.bwa.sort.fix.mkdup.recal.bam 
    load PATH_TO_FILE/AS_ctrl.gatk.raw.flag.vcf
    snapshotDirectory . 
    squish 
    goto chr11:1017135 
    snapshot 1017135_Exomedata.png 
    exit  
  • Run the script with the following command (don't forget to make it executable!):

    igv.sh -b bashIGV.bat
  • Examine the resulting png 1017135_Exomedata.png

  • Change the script to zoom into other mutations of interest, don't forget to change the output name of the png file to the corresponding position.

Q. How are the mutations distributed? Are they all in exons?
How do you differentiate between heterozygous and homozygous mutations?
Some mutations are slightly faded, what does this mean?

Q. Could you filter out some of your candidates with your visual inspection? Which are you final candidate genes?

Structural variants (OPTIONAL)

Now let's analyse WGS data to see if we can identify any structural variants. When analyzing tumor samples you normally have two datasets; one normal (germline) sample and one tumor sample (somatic). To save time we will only work with two tumor samples from chromosome 17, these are breast cancer cell lines: HCC1143 and HCC1954.

  • Create a soft link to the entire folder /home/ftp_courses/NGS/Exome/SV

Before running any variant calling, we need to mark duplicates.

  • Run MarkDuplicates for each alignment file, with the following parameters:

    -I -> YOUR_SAMPLE.tumor.bam 
    -O -> YOUR_SAMPLE.tumor.mkdup.bam 
    -M -> YOUR_SAMPLE.tumor.metrics 
    --CREATE_INDEX -> true
  • Run the variant calling with delly call to find any structural variants:

    delly call \
    -g Homo_sapiens_assembly19.fasta_WITHIN_/db
    -o ALL.tumor.bcf
    list_of_alignments_separated_by_space 

The default output is in bcf format.

  • Convert the file to vcf format using bcftools:

    bcftools view ALL.tumor.bcf > ALL.tumor.vcf

Take a look at the vcf file.

Q. How many structural variant did DELLY find?

  • Filter your vcf file, keeping only PASS quality variants.

Q. How many variants are left after filtering for quality PASS?

  • Remove all variants annotates as IMPRECISE

Q. How many variant are left after filtering based on PRECISE. Which variants are unique to sample HCC1143? and for HCC1954?

From the filtered vcf file, choose two regions (one with a deletion (DEL) and one with an inversion (INV)) to visualize in IGV.

  • Load the bam and vcf files in IGV, and zoom in to the regions you decided to look at.

Q. Visually, do you agree with DELLY? Is the variant classified correct as DEL and INV? And what about the zygousity, is the variant correctly classified as homozygous or heterozygous (look at GT field in vcf)?



Developed by Marcela Dávila, 2017. Modified by Marcela Davila, Sanna Abrahamsson and Vanja Börjesson, 2021. Updated by Marcela Dávila, 2023.

⚠️ **GitHub.com Fallback** ⚠️