NGS IV: Exome - bcfgothenburg/HT24 GitHub Wiki

Course: HT24 Analysis-of-next-generation-sequencing-data (SC0024)


The purpose of this exercise for you to become familiar with the workflow used to analyze targeted sequencing data and identify disease causing mutations.



The Data

Our data involves exome sequencing data from an affected individual with Alpers syndrome (which is a homozygous progressive neurodegenerative disorder) and one control sample from a healthy individual. The data was taken from Sofou K, et al., "Whole exome sequencing reveals mutations in NARS2 and PARS2, encoding the mitochondrial asparaginyl-tRNA synthetase and prolyl-tRNA synthetase, in patients with Alpers syndrome". Mol Genet Genomic Med. 2015.

Sample Description
AS_2 Affected
AS_ctrl Unaffected

The server

  • Connect to the server using MobaXterm (PC users) or your local Terminal (MACS users):

    ssh -Y your_account@remote_server 
  • Load the following modules:

    annovar/20200608
    bcftools/1.17
    bwa-mem2/2.2.1
    delly/1.1.6
    fastqc/0.12.1
    gatk/4.4.0.0
    igv/2.16.1
    multiqc/1.14
    R/4.3.1
    samtools/1.17
    picard/3.0.0
    trimgalore/0.6.10
  • Create a directory called Exome

  • Go to this directory

  • Create a soft link of the entire /home/courses/NGS/Exome/Fastq

    NOTE: if you need to remove a soft link, use:

    unlink <link_name>

Remember that you can create bash scripts so your jobs use the resources of the cluster rather than only the login node. Although it is not compulsory for this practical, it would be a good practice. Besides if you submit a job and you get disconnected from the cluster (for whatever reason) the job will keep on running/queued!

QC, filtering and mapping

Due to limit in time, we will be working with data from chr11 (so some results may look bad).

  • Run FastQC on your samples and evaluate the quality of the data

    NOTE: Remember that we can use a loop to run a command on multiple samples. Here an example on how to index all bam files in a directory

    # running in the command line
    for i in *bam; do samtools index $i; done
    
    # running as a job
    for i in *bam; do qsub runSamtools.sh $i;done
  • Run TrimGalore! as you see fit and check that your approach improved your data quality. Hint: You can use multiqc to gather your results

Q. How does the quality look like, before and after trimming? How many reads were trimmed for both the control and the patient sample? Have a look under the FastQC html files.

  • Map the filtered reads towards chr11.fa using bwa-mem2:

    • Save the alignment as: YOUR_SAMPLE.bwa.bam
    • The reference genome (chr11.fa) is under the /home/courses/NGS/Exome/db directory

NOTE: If you run bwa-mem2 mem you will get help to know how to run the tool.

In case you need help
Usage: bwa-mem2 mem [options] idxbase in1.fq in2.fq

% bwa-mem2 mem  /home/courses/NGS/Exome/db/chr11.fa Exome/AS_ctrl.chr11.R1.fastq Exome/AS_ctrl.chr11.R2.fastq | samtools view -Sb -o AS_ctrl.bwa.bam - &
  • Run AddOrReplaceReadGroups from picard:

     % java -jar /apps/bio/software/picard/3.0.0/picard.jar AddOrReplaceReadGroups -h
  • Use the following parameters (you can more if you want!):

    -I             -> YOUR_SAMPLE.bwa.bam 
    -O             -> YOUR_SAMPLE.bwa.sort.rg.bam 
    -SM            -> AS_2 or AS_ctrl 
    -LB            -> TruSeq 
    -PL            -> Illumina 
    -PU            -> CGG 
    -SO            -> coordinate
    --CREATE_INDEX -> true
In case you need help
% java -jar /apps/bio/software/picard/3.0.0/picard.jar AddOrReplaceReadGroups -I AS_ctrl.bwa.bam -O AS_ctrl.bwa.sort.rg.bam -SM AS_ctrl -LB TruSeq -PL Illumina, -PU CGG -SO coordinate --CREATE_INDEX true
  • Generate the flagstat and idxstats for each alignment file

Q. What is the mapping rate in the sample (% of mapped reads)? Is it really sequencing data from chr11?

On-target coverage (OPTIONAL)

Since the data was from a targeted approach, let's inspect the on-target coverage plot. GATK's DepthOfCoverage can help us in this task. The tool is labeled as being in BETA version however, this tool has been part of GATK since the beginning and works just fine.

  • Run the following code for each alignment file:
    gatk DepthOfCoverage \
       -R /home/courses/NGS/Exome/db/chr11.fa \
       -I YOUR_SAMPLE.bwa.sort.rg.bam  \
       -O YOUR_SAMPLE_NAME.metrics \
       -L chr11.exon.bed \
       --output-format TABLE \
       --summary-coverage-threshold 10 \
       --summary-coverage-threshold 30 \
       --summary-coverage-threshold 50    

Several files are generated:

  • Have a look at *.sample_summary files

Q. What is the mean coverage per sample? Do you think it is enough sequencing?

  • Inspect the *.sample_interval_summary files.
    • Each row shows the coverage of each targeted region (exons) at different depths (last three columns)

Q. For each sample, which exon (if any) is covered at least 300 times in average?
What percentage of the exons are fully covered at least 50 times?

  • The *.sample_cumulative_coverage_proportions files have two rows (really long rows!). Showing the first five columns (this values are for the entire dataset, not only chr21):

            gte_0   gte_1   gte_2   gte_3
    AS_ctrl 1.00    0.96    0.94    0.93
    • Each row is the coverage information per sample (only one in our case)
    • Each column shows the aggregated proportion of the regions with >= of the corresponding coverage. In other words,:
      • gte_0 - 1.00 tells us that 100% of the regions are covered cero or more times;
      • gte_1 - 0.96 tells us that 96% of the regions are covered one or more times;
      • gte_2 - 0.94 tells us that 94% of the regions are covered two or more times;
      • gte_3 - 0.93 tells us that 93% of the regions are covered three or more times;

To better understand this data, let's make a plot with this information.

  • First we transpose our file (changing columns to rows) and then, we convert it to percentage
    Remember you can use your own code or directly use R, this is just one way of doing it:

    tail -1 SAMPLE_NAME.sample_cumulative_coverage_proportions | \
       sed 's/\t/\n/g' | \
       tail -n +2 | \
       awk 'OFS="\t" {print $1*100,num++}' > SAMPLE_NAME.proportions
    • The first lines should look like:

      100     0
      29      1
      15      2
      10      3
  • Open R in the terminal by typing R

  • Type in the following code
    Remember to change the name of the file you want to use:

    # reading the data
    data <- read.table("SAMPLE_NAME.proportions")
    
    # setting name of figure
    png("SAMPLE_NAME.coverage.png")
    
    # scatter plot
    plot (data$V2,data$V1,type="l",log="x",ylim=c(0,100),
          ylab ="% On-target",xlab="Depth", main="On-target coverage SAMPLE_NAME")
    
    # adding lines at different coveraged
    abline(v=c(10,20,30,50),col="gray",lty=2)
    
    # adding lines at different % of on-target
    abline(h=c(50,70,80),col=c(51,"gray",51),lty=2)
    
    # adding X and % within the graph
    text(c(8,17,26,43),c(100,100,100,100),c("10x","20x","30x","50x"),col="gray")
    text(c(1,1,1),c(47,67,77),c("50%","70%","80%"),col=c(51,"gray",51))
    
    # closing graphical device
    dev.off()
  • Type q() to exit the R session

Q. Looking at the plots, do we need to be worried about under-sequencing?

PCR duplicates

When dealing with variant calling, an important step is to identify possible PCR duplicates, this to remove any bias when we are doing the actual calling.

  • Run MarkDuplicates from picard using (at least) these arguments:
    -I             -> YOUR_SAMPLE.bwa.sort.rg.bam 
    -O             -> YOUR_SAMPLE.bwa.sort.rg.mkdup.bam
    -M             -> YOUR_SAMPLE.metrics 
    -ASO           -> coordinate
    --CREATE_INDEX -> TRUE
In case you need help
% java -jar /apps/bio/software/picard/3.0.0/picard.jar MarkDuplicates -M AS_ctrl.metrics -I AS_ctrl.bwa.sort.rg.bam -O AS_ctrl.bwa.sort.rg.mkdup.bam  -ASO coordinate --CREATE_INDEX true

  • Generate flagstat and idxstats metrics for this new alignment file
  • Run multiqc to include these statistics

Q. How many duplicates do the samples have? Which sample has more sequencing reads?

Base Quality Score Recalibration (BQSR)

This pre-processing step detects systematic errors made by the sequencing machine when it estimates the accuracy of each base call. Most of the short variant calling tools rely on the base quality score, so it is important to correct these systematic technical errors.

First, we will produce a recalibration file from our input data using the BaseRecalibrator tool. To speed up the processing time, let's generate a file containing only SNPs from chr11. (You can use your own code if you prefer):

# keeping the header of the VCF file
grep ^#  /home/courses/NGS/Exome/db/dbsnp_138.b37.vcf | \
sed 's/ID=11/ID=chr11/' | \
grep -P  "contig=<ID=[1-9,A-Z]" -v | \
grep -P  "<ID=GL" -v > title  

# extracting SNPs in chr11, formatting to add "chr", removing unwanted rows
grep ^11 -w  /home/courses/NGS/Exome/db/dbsnp_138.b37.vcf | \
#awk '$2 < 57311868 {print $0}' | \
sed 's/^11/chr11/'  > 11

# merging title and data 
cat title 11 > dbsnp_138.b37.chr11.vcf
  • Index the file using gatk IndexFeatureFile and:
    -I  -> dbsnp_138.b37.chr11.vcf 
In case you need help
% gatk IndexFeatureFile -I dbsnp_138.b37.chr11.vcf 
  • Run the gatk BaseRecalibrator for both samples, with :

    -R            -> chr11_reference_file 
    -I            -> YOUR_SAMPLE.bwa.sort.rg.mkdup.bam 
    -O            -> YOUR_SAMPLE.table 
    --known-sites -> THE_VCF_FILE_YOU_JUST_CREATED
In case you need help
% gatk BaseRecalibrator -R /home/courses/NGS/Exome/db/chr11.fa -I AS_ctrl.bwa.sort.rg.mkdup.bam -O AS_ctrl.table --known-sites dbsnp_138.b37.chr11.vcf 

Q. What are the --known_sites?

Now let's correct the scores

  • Run gatk ApplyBQSR for both alignments, with:

    -R                -> chr11_reference_file 
    -I                -> YOUR_SAMPLE.bwa.sort.rg.mkdup.bam 
    -O                -> YOUR_SAMPLE.bwa.sort.rg.mkdup.recal.bam  
    --bqsr-recal-file -> THE_TABLE_FILE_YOU_JUST_CREATED 
In case you need help
% gatk ApplyBQSR -R /home/courses/NGS/Exome/db/chr11.fa -I AS_ctrl.bwa.sort.rg.mkdup.bam -O AS_ctrl.bwa.sort.rg.mkdup.recal.bam --bqsr-recal-file AS_ctrl.table

Variant discovery (GATK)

The Genome Analysis Toolkit or GATK is a software package developed to analyze next-generation resequencing data, focusing on variant discovery and genotyping. For the variant calling we will use the HaplotypeCaller, which is an SNP/indel caller that uses a Bayesian genotype likelihood model to estimate simultaneously the most likely genotypes and allele frequency in a population of N samples.

  • Run gatk HaplotypeCaller to do the variant calling, with these parameters:

    -R -> chr11_reference_file 
    -I -> YOUR_SAMPLE.bwa.sort.rg.mkdup.recal.bam 
    -O -> YOUR_SAMPLE.gatk.raw.vcf  
In case you need help
% gatk HaplotypeCaller-R /home/courses/NGS/Exome/db/chr11.fa -I  AS_ctrl.bwa.sort.rg.mkdup.recal.bam -O AS_ctrl.gatk.raw.vcf
  • Inspect the VCF files you just created and get familiar with their information.

Q. How do you differentiate between a homozygous mutations from an heterozygous?
How many SNPs do you have in each sample?
Are there any insertions/deletions?

Let's convert the VCF files into a tab delimited files.

  • Use gatk VariantsToTable with:

    -O  -> YOUR_SAMPLE.gatk.raw.table 
    -V  -> YOUR_SAMPLE.gatk.raw.vcf 
    -F  -> CHROM  
    -F  -> POS 
    -F  -> ID 
    -F  -> REF 
    -F  -> ALT 
    -F  -> QUAL 
    -F  -> FILTER 
    -GF -> AC 
    -GF -> GT 
    -GF -> AD 
    -GF -> DP 
In case you need help
% gatk VariantsToTable -O AS_ctrl.gatk.raw.table -V AS_ctrl.gatk.raw.vcf -F CHROM -F POS -F ID -F REF -F ALT -F QUAL -F FILTER -GF AC -GF GT -GF  AD -GF DP 

Have a look and understand the output.

SNP-callers emit a number of false-positive mutations, as any prediction program. Some of these false-positives can be detected and rejected by various statistical tests. Filtering or flagging the variant calls in the VCF file is a useful step to detect such false-positives.

  • Run gatk VariantFiltration with:

    -O                  -> YOUR_SAMPLE.gatk.raw.flag.vcf 
    -V                  -> YOUR_SAMPLE.gatk.raw.vcf 
    --filter-name       -> "LowDP" 
    --filter-expression -> "DP < 10.0"
In case you need help
% gatk VariantFiltration -O AS_ctrl.gatk.raw.flag.vcf -V AS_ctrl.gatk.raw.vcf --filter-name "LowDP" --filter-expression "DP < 10.0"

These are hard filters that are applied when a variant callset is too small or when training sets are not available. Try to understand the filtering done above. You may find some info at the beginning of the VCF file.

Q. How many variants have passed our filtering criteria for both files?

Variant annotation (Annovar)

Now we need to annotate our variants in terms of genomic context, aminoacid change and effect, previous mutation knowledge, etc. so we can evaluate and filter mutations that are irrelevant.

ANNOVAR is a tool that annotates genetic variants. The overall procedure is to create an input file from the VCF file and then annotate the variant positions with information from several databases.

  • Reformat both alignment files with convert2annovar.pl using:

    -format  -> vcf4 
    -outfile -> YOUR_FILE.gatk.raw.flag.annovar
    YOUR_FILE.gatk.raw.flag.vcf 
In case you need help
% convert2annovar.pl -format vcf4 -outfile AS_ctrl.gatk.raw.flag.annovar AS_ctrl.gatk.raw.flag.vcf 
  • Add the annotation to the alignments using table_annovar.pl:

      -buildver  -> hg19 
      -protocol  -> dbsnp_137_chr11_gatk,refGene,ljb2_all,1000g2012apr_all,clinvar_20200316 
      -operation -> f,g,f,f,f 
      -outfile   -> YOUR_FILE.gatk.raw.flag.annovar.table 
      --nastring -> - 
      --remove 
      YOUR_FILE.gatk.raw.flag.annovar 
      /home/courses/NGS/Exome/db/humandb
In case you need help
% table_annovar.pl -buildver hg19 -protocol  dbsnp_137_chr11_gatk,refGene,ljb2_all,1000g2012apr_all,clinvar_20200316 -operation f,g,f,f,f -outfile AS_ctrl.gatk.raw.flag.annovar.table --nastring - --remove AS_ctrl.gatk.raw.flag.annovar /home/courses/NGS/Exome/db/humandb

Q. What kind of information are we adding to our mutations? Have a look at the file with .hg19_multianno.txt extension. Here you can find some information.

  • Merge the *hg19_multianno.txt file with the *gatk.raw.flag.annovar, and save it in a file with extension .tsv. Note that the *gatk.raw.flag.annovar does not have column names, add them before you merge them. Hint: you can use paste
In case you need help
% # creating a header after inspecting AS_ctrl.gatk.raw.flag.annovar
%  echo -e "CHR\tSTART\tEND\tREF\tALT\tGT\tQUAL\tDP" > header
% # adding the header to the file
% cat header AS_ctrl.gatk.raw.flag.annovar > AS_ctrl.gatk.raw.flag.annovar.fix
% # merging both files
% paste AS_ctrl.gatk.raw.flag.annovar.fix AS_ctrl.gatk.raw.flag.annovar.table.hg19_multianno.txt > AS_ctrl

This file now has statistics and annotations for the mutations. It is a lot of data, however we are interested in:

  • Func.refGene: where the variant is located (intron, exon, etc)
  • ExonFunc.refGene: what type of variant is it (missense, nonsense, etc)
  • AAChange.refGene: nucleotide and aminoacid change
  • LJB2_PolyPhen2_HVAR: how damaging is the change the variant produces

Other columns may be self explanatory

Reading the AAChange.refGene column
PGM1:NM_001172818:exon2:c.A316G:p.I106V,PGM1:NM_002633:exon2:c.A262G:p.I88V

PGM1 -> Gene name
	NM_001172818 => Transcript ID_1
		exon2 => exon where the mutation is found
			c.A316G => in the DNA, at position 316 the reference base A is changed to an G
			p.I106V => in the protein, at positon 106 the reference aminoacid I is changed to a V
PGM1 => Gene name
	NM_002633 => Transcript ID_2
		exon2 => exon where the mutation is found	
			A262G => in the DNA, at position 262 the reference base A is changed to a G
			p.I88V => in the protein, at position 88 the reference aminoacid I is changed to a V

Variant filtering

With all this information, it is possible to narrow down the list of mutations to a (hopefully) handful of SNPs. For this let us use R.

In most of the cases, you should copy/paste the code. There will be some instances where you will be asked to decide on a value/threshold/color and this will be highlighted in CAPS within the code.

You are not expected to understand all the code, however you are more than welcome to ask!. Learning R takes a lot of practice!

Data upload to R

Start R in your local computer.

Select your working directory (where you saved the annotated files): Session -> Set Working Directory -> Choose directory

Let's load the packages we will be using:

# Install them first if you don't have them already
install.packages("tidyverse") # this is done only once
install.packages("ggvenn")  # this is done only once


# Load required packages
library(tidyverse) # data manipulation
library(ggvenn) # Venn diagram

Now we need to read the data in R. In the code below, replace the text with your TSV file names. Make sure the name is within double quotes:

# Read in data
A <- read_table("WRITE_THE_NAME_OF_YOUR_AS_2_FILE")  # affected
U <- read_table("WRITE_THE_NAME_OF_YOUR_AS_ctrl_FILE")  #unaffected or control

It is important to check at every step if the data we have make sense, so let's display the data:

# display variable
A
Click here for output
# A tibble: 9,900 × 40
   CHR    START    END REF   ALT   GT     QUAL    DP Chr    Start    End Ref   Alt   dbsnp_137_chr11_gatk Func.refGene Gene.refGene      GeneDetail.refGene   
   <chr>  <dbl>  <dbl> <chr> <chr> <chr> <dbl> <dbl> <chr>  <dbl>  <dbl> <chr> <chr> <chr>                <chr>        <chr>             <chr>                
 1 chr11 123778 123778 T     C     het    86.6    14 chr11 123778 123778 T     C     -                    intergenic   NONE;LINC01001    dist=NONE;dist=3356  
 2 chr11 124986 124986 G     A     het    59.6     4 chr11 124986 124986 G     A     -                    intergenic   NONE;LINC01001    dist=NONE;dist=2148  
 3 chr11 125918 125918 G     A     het    43.6    10 chr11 125918 125918 G     A     -                    intergenic   NONE;LINC01001    dist=NONE;dist=1216  
 4 chr11 176397 176397 A     G     hom    70.3     2 chr11 176397 176397 A     G     rs61871102-129       intergenic   LINC01001;SCGB1C2 dist=37280;dist=16683
 5 chr11 187658 187658 A     G     hom    37.3     2 chr11 187658 187658 A     G     rs2686910-100        intergenic   LINC01001;SCGB1C2 dist=48541;dist=5422 
 6 chr11 189683 189683 A     G     hom    70.3     2 chr11 189683 189683 A     G     rs78392844-131       intergenic   LINC01001;SCGB1C2 dist=50566;dist=3397 
 7 chr11 190650 190650 A     G     hom    35.5     1 chr11 190650 190650 A     G     rs75887666-131       intergenic   LINC01001;SCGB1C2 dist=51533;dist=2430 
 8 chr11 190688 190688 T     G     hom    52.3     2 chr11 190688 190688 T     G     rs76699989-131       intergenic   LINC01001;SCGB1C2 dist=51571;dist=2392 
 9 chr11 190761 190761 A     G     hom    34.5     1 chr11 190761 190761 A     G     rs2739234-100        intergenic   LINC01001;SCGB1C2 dist=51644;dist=2319 
10 chr11 192997 192997 G     A     het    38.6    12 chr11 192997 192997 G     A     rs2280540-100        upstream     SCGB1C1;SCGB1C2   dist=83              
# ℹ 9,890 more rows
# ℹ 23 more variables: ExonicFunc.refGene <chr>, AAChange.refGene <chr>, LJB2_SIFT <chr>, LJB2_PolyPhen2_HDIV <chr>, LJB2_PP2_HDIV_Pred <chr>,
#   LJB2_PolyPhen2_HVAR <chr>, LJB2_PolyPhen2_HVAR_Pred <chr>, LJB2_LRT <chr>, LJB2_LRT_Pred <chr>, LJB2_MutationTaster <chr>, LJB2_MutationTaster_Pred <chr>,
#   LJB_MutationAssessor <chr>, LJB_MutationAssessor_Pred <chr>, LJB2_FATHMM <chr>, `LJB2_GERP++` <chr>, LJB2_PhyloP <chr>, LJB2_SiPhy <chr>,
#   `1000g2012apr_all` <chr>, CLNALLELEID <chr>, CLNDN <chr>, CLNDISDB <chr>, CLNREVSTAT <chr>, CLNSIG <chr>
# ℹ Use `print(n = ...)` to see more rows
# display variable
U
Click here for output
# A tibble: 13,461 × 40
   CHR    START    END REF   ALT   GT     QUAL    DP Chr    Start    End Ref   Alt   dbsnp_137_chr11_gatk Func.refGene   Gene.refGene      GeneDetail.refGene  
   <chr>  <dbl>  <dbl> <chr> <chr> <chr> <dbl> <dbl> <chr>  <dbl>  <dbl> <chr> <chr> <chr>                <chr>          <chr>             <chr>               
 1 chr11 123756 123756 G     T     het    40.6    14 chr11 123756 123756 G     T     -                    intergenic     NONE;LINC01001    dist=NONE;dist=3378 
 2 chr11 124288 124288 A     T     het   106.     11 chr11 124288 124288 A     T     -                    intergenic     NONE;LINC01001    dist=NONE;dist=2846 
 3 chr11 124342 124342 C     T     het    46.6     9 chr11 124342 124342 C     T     -                    intergenic     NONE;LINC01001    dist=NONE;dist=2792 
 4 chr11 135531 135531 A     T     hom    37.3     2 chr11 135531 135531 A     T     -                    ncRNA_intronic LINC01001         -                   
 5 chr11 176397 176397 A     G     hom   112.      4 chr11 176397 176397 A     G     rs61871102-129       intergenic     LINC01001;SCGB1C2 dist=37280;dist=166…
 6 chr11 190488 190488 T     C     hom    79.8     3 chr11 190488 190488 T     C     rs2354751-100        intergenic     LINC01001;SCGB1C2 dist=51371;dist=2592
 7 chr11 193096 193096 T     C     hom   955.     28 chr11 193096 193096 T     C     rs2280539-100        UTR5           SCGB1C1;SCGB1C2   NM_145651:c.-4T>C;N…
 8 chr11 193112 193112 C     T     het   526.     35 chr11 193112 193112 C     T     rs7951297-116        exonic         SCGB1C1;SCGB1C2   -                   
 9 chr11 193194 193194 T     C     het   268.     23 chr11 193194 193194 T     C     rs2686896-100        intronic       SCGB1C1;SCGB1C2   -                   
10 chr11 193863 193863 T     C     het   341.     19 chr11 193863 193863 T     C     rs2294081-100        exonic         SCGB1C1;SCGB1C2   -                   
# ℹ 13,451 more rows
# ℹ 23 more variables: ExonicFunc.refGene <chr>, AAChange.refGene <chr>, LJB2_SIFT <chr>, LJB2_PolyPhen2_HDIV <chr>, LJB2_PP2_HDIV_Pred <chr>,
#   LJB2_PolyPhen2_HVAR <chr>, LJB2_PolyPhen2_HVAR_Pred <chr>, LJB2_LRT <chr>, LJB2_LRT_Pred <chr>, LJB2_MutationTaster <chr>, LJB2_MutationTaster_Pred <chr>,
#   LJB_MutationAssessor <chr>, LJB_MutationAssessor_Pred <chr>, LJB2_FATHMM <chr>, `LJB2_GERP++` <chr>, LJB2_PhyloP <chr>, LJB2_SiPhy <chr>,
#   `1000g2012apr_all` <chr>, CLNALLELEID <chr>, CLNDN <chr>, CLNDISDB <chr>, CLNREVSTAT <chr>, CLNSIG <chr>
# ℹ Use `print(n = ...)` to see more rows

Summarizing data

It is important to have an overview of the entire dataset, so summary statistics and graphs are quite helpful to scan our data. We will look at:

  1. Number of mutations per sample
  2. Distribution of mutations based on location (barplot)
  3. Distribution of mutations based on type of variants (pie chart)

Let's see how many mutations we have:

# Number of mutations based on location
A %>% 
  nrow
Click here for output
[1] 9900
  • Repeat for sample U

Now let's visualize where our variants:

# Summarizing the variant positions
A %>% 
     select(Func.refGene) %>% 
     mutate(Sample = "A") %>%  
     group_by(Func.refGene) %>% 
     summarise(N_variants=n())
Click here for output
# A tibble: 13 × 2
   Func.refGene        N_variants
   <chr>                    <int>
 1 UTR3                       333
 2 UTR5                       166
 3 downstream                 170
 4 exonic                    1987
 5 exonic;splicing              9
 6 intergenic                1857
 7 intronic                  4772
 8 ncRNA_exonic               121
 9 ncRNA_intronic             256
10 ncRNA_splicing               1
11 splicing                    15
12 upstream                   204
13 upstream;downstream          9

Repeat for sample U.

Now let's gather the data and create a barplot:

# selecting the data
A_func <- A %>% 
  select(Func.refGene) %>% 
  mutate(Sample = "A")

U_func <- U %>% 
  select(Func.refGene) %>% 
  mutate(Sample = "U")

# creating the barplot
bind_rows(A_func, U_func) %>% 
  ggplot(aes(x=Func.refGene, fill=Sample)) + 
  geom_bar(stat="count", position="dodge") +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.9, hjust=1))

Q. Where do we have more variants?

We can also summarize the type of variants that we have:

A_exonicFunc <- as.data.frame(table(A %>% 
                               select(ExonicFunc.refGene) %>% 
                               mutate(Sample = "A")))


ggplot(A_exonicFunc, aes(x="", y=Freq, fill=ExonicFunc.refGene)) + 
  geom_bar(stat="identity") + 
  coord_polar("y",start=0) + 
  theme_void() +
  labs(title="AS_2")

Repeat for sample U.

Filtering variants

Now that we have an overall look of our data, it is time to filter the mutations. A common strategy is to focus on mutations that:

  • are within exons (or protein_coding)
  • are annotated as a missense_variant
  • are classified other than tolerated/benign
# Counting our filtered variants
 A %>% 
   filter(Func.refGene == "exonic" & 
          ExonicFunc.refGene == "nonsynonymous" &
          str_detect(LJB2_PolyPhen2_HVAR, "D") &  # D = damaging
          str_detect(LJB2_LRT, "D")) %>%             # D = damaging
   nrow              

# Variable containing relevant information
A_mutation <- A %>% 
              filter(Func.refGene == "exonic" & 
                     ExonicFunc.refGene == "nonsynonymous" &           
                     str_detect(LJB2_PolyPhen2_HVAR, "D") &  # D = damaging
                     str_detect(LJB2_LRT, "D") & 
                     str_detect(LJB2_MutationTaster, "D")) %>%          
              mutate(Mutation=paste(CHR, START, END, REF, ALT, LJB2_SIFT, sep="|")) %>%
              pull(Mutation)

Repeat for Sample U.

Now we can compare our samples and create a VennDiagram:

ggvenn(list(AS_2=A_mutation, AS_ctrl=U_mutation), fill_color = c("ADD_A_COLOR", "ADD_ANOTHER_COLOR"))

Depending on your strategy you can use any of the following expressions to identify probable disease causing mutations:

  • intersect(x, y) finds all rows in both x and y.
  • union(x, y) finds all rows in either x or y, excluding duplicates.
  • union_all(x, y) finds all rows in either x or y, including duplicates.
  • setdiff(x, y) finds all rows in x that aren't in y.

Q. Which are your candidate genes? What was your filtering strategy? Which gene(s) is likely to be the disease causing mutation?

Once you have your list, visual inspection of your candidates helps in assessing them further. Have a look at your candidates in IGV, you may want to load both of your VCF files as well as the BAM files.

Automating visualization (OPTIONAL)

When you have multiple files and positions to investigate in IGV you can save time by creating a bat script. The following script take snapshots of the coordinate chr11:1017135 across AS_2 and Control using both their bams and mutation files.

  • Copy this script to a file called bashIGV.bat, make sure that the paths to the bam and vcf files are correct.

    new 
    genome hg19 
    load PATH_TO_FILE/AS_2.bwa.sort.fix.mkdup.recal.bam 
    load PATH_TO_FILE/AS_2.gatk.raw.flag.vcf
    load PATH_TO_FILE/AS_ctrl.bwa.sort.fix.mkdup.recal.bam 
    load PATH_TO_FILE/AS_ctrl.gatk.raw.flag.vcf
    snapshotDirectory . 
    squish 
    goto chr11:1017135 
    snapshot 1017135_Exomedata.png 
    exit  
  • Run the script with the following command (don't forget to make it executable!):

    igv.sh -b bashIGV.bat
  • Examine the resulting png 1017135_Exomedata.png

  • Change the script to zoom into other mutations of interest, don't forget to change the output name of the png file to the corresponding position.

Q. How are the mutations distributed? Are they all in exons?
How do you differentiate between heterozygous and homozygous mutations?
Some mutations are slightly faded, what does this mean?

Q. Could you filter out some of your candidates with your visual inspection? Which are you final candidate genes?

Structural variants (OPTIONAL)

Now let's analyse WGS data to see if we can identify any structural variants. When analyzing tumor samples you normally have two datasets; one normal (germline) sample and one tumor sample (somatic). To save time we will only work with two tumor samples from chromosome 17, these are breast cancer cell lines: HCC1143 and HCC1954.

  • Create a soft link to the entire folder /home/courses/NGS/Exome/SV

Before running any variant calling, we need to mark duplicates.

  • Run MarkDuplicates for each alignment file, with the following parameters:

    -I -> YOUR_SAMPLE.tumor.bam 
    -O -> YOUR_SAMPLE.tumor.mkdup.bam 
    -M -> YOUR_SAMPLE.tumor.metrics 
    --CREATE_INDEX -> true
  • Run the variant calling with delly call to find any structural variants:

    delly call \
    -g Homo_sapiens_assembly19.fasta_WITHIN_/db
    -o ALL.tumor.bcf
    list_of_alignments_separated_by_space 

The default output is in bcf format.

  • Convert the file to vcf format using bcftools:

    bcftools view ALL.tumor.bcf > ALL.tumor.vcf

Take a look at the vcf file.

Q. How many structural variant did DELLY find?

  • Filter your vcf file, keeping only PASS quality variants.

Q. How many variants are left after filtering for quality PASS?

  • Remove all variants annotates as IMPRECISE

Q. How many variant are left after filtering based on PRECISE. Which variants are unique to sample HCC1143? and for HCC1954?

From the filtered vcf file, choose two regions (one with a deletion (DEL) and one with an inversion (INV)) to visualize in IGV.

  • Load the bam and vcf files in IGV, and zoom in to the regions you decided to look at.

Q. Visually, do you agree with DELLY? Is the variant classified correct as DEL and INV? And what about the zygousity, is the variant correctly classified as homozygous or heterozygous (look at GT field in vcf)?



Developed by Marcela Dávila, 2017. Modified by Marcela Davila, Sanna Abrahamsson and Vanja Börjesson, 2021. Updated by Marcela Dávila, 2023. Updated by Marcela Dávila, 2024.

⚠️ **GitHub.com Fallback** ⚠️