BMA231 IV: Variant analysis - bcfgothenburg/HT24 GitHub Wiki

Course: HT24 Next Generation Sequencing data analysis with clinical applications (BMA231)


The aim of this practical is to introduce you to a general workflow to identify disease causing mutations from DNA sequencing data. To be able to work within the allocated time, you will be processing just a portion of the real dataset to illustrate the overall workflow.



The Data

You will be working with data from two siblings with Papillion-Lefèvre Syndrome (PLS). PLS is a very rare disorder of autosomal recessive inheritance distinguished by palmar plantar hyperkeratosis and early onset of periodontitis affecting the dentition.

The data has been kindly provided by Johan Bylund, Professor at the Institute of Odontology at University of Gothenburg.

Data Sibling A Sibling B Both
Raw data A_R1_fastq_gz_fastqc.html A_R2_fastq_gz_fastqc.html B_R1_fastq_gz_fastqc.html B_R2_fastq_gz_fastqc.html MultiQC_raw.html
Filtered data A_R1_Trim_Galore_fastqc.html A_R2_Trim_Galore_fastqc.html B_R1_Trim_Galore_fastqc.html B_R2_Trim_Galore_fastqc.html MultiQC_filtered.html
Aligned data A_samtools_flagstat.txt A_samtools_idxstats.tsv B_samtools_flagstat.txt B_samtools_idxstats.tsv MultiQC_aligned.html
Alignment A_filtered_bwa-mem.bam B_filtered_bwa-mem.bam

You find this files in CANVAS under the SNP analysis module:

  • Sample_A.zip
  • Sample_B.zip
  • MultiQC_snp.zip
  • GRCh38_90.bed

Sequencing data pre-processing

Quality check with FastQC and MultiQC

The first thing to do when you receive data is to check its quality. The FastQC reports were already generated using FastQC in Galaxy. They have also been merged with MultiQC. Inspect either the FastQC html reports (*gz_fastqc.html) or the MulqiQC html report (MultiQC_raw.html) and answer the following:

Q1. What can you conclude from these graphs?
Focus on Basic Statistics, Per base sequence quality, Per sequence quality scores, Per base sequence content, Sequence Length Distribution, Sequende Duplication Levels and Adapter Content

Click here for output
Almost 1M per sample, too little, but this is just a fragment of the data
Low duplication, which is typical for DNA sequencing, when targeting large regions
Quality looks quite nice, probably been already trimmed
Most of the quality scores are above 30Q
It fails the per base sequence content because of the last position, high T content, not a problem
No ambiguos bases
There are different lengths of the reads, which tells us that the data has been already trimmed
No adapters found
Good dataset

Quality trimming with TrimGalore!

Although our data is good, and apparently it has been already trimmed, TrimGalore! was used in Galaxy with default parameters. You may see no difference at all, but remember that it is a good practice. FastQC and MultiQC were also run to obtain summary reports using the filtered files. Inspect either the FastQC html reports (*Trim_Galore_fastqc.html) or the MulqiQC html report (MultiQC_filtered.html) and answer the following:

Q2. What differences do you see between the raw files and the filtered ones?

Click here for output
830453 vs 827994 - A sample
722724 vs 720957 - B sample

Just less data, not a real change in quality 
since we already had a good dataset

Mapping with BWA

As we are dealing with DNA, we will use BWA, an aligner based on the Burrows-Wheeler transform that has been optimized for aligning short reads. The mapping was already run in Galaxy using BWA mem where the output files are in BAM format.

To evaluate if the mapping is acceptable, some statistics were generated with samtools and gathered with MultiQC, also in Galaxy. Inspect either the samtools reports (*samtools_flagstat.txt and *_samtools_idxstats.tsv) or the MultiQC html report (MultiQC_aligned.html) and answer the following:

Q3. What is the percentage of mapped reads for each sample? Is this acceptable? The data we are analyzing covers one chromosome, which one?

Click here for output
99.84%  sampleA
99.85%  sampleB

chr11 with 1 624 476 SampleA
chr11 with 1 414 610 SampleB

Variant calling

For this part, you will be using Galaxy.

  • Log in into your account
  • Create a history called SNP practical.
    • You do this in the right History panel.

Upload your data

  • Click on Upload on the left menu

  • The Upload from Disk or Web to .. window will open

  • Click on Choose local files

  • Select both of your filtered bam files (A_filtered_bwa-mem.bam and B_filtered_bwa-mem.bam)

    • Select bam under the Autodetect field
    • Select Human Dec. 2013 (GRCh38/hg38) (hg38) under the unspecified (?) field
  • Select also the GRCh38_90.bed file

    • Select bed under the Auto-detect field
    • Select Human Dec. 2013 (GRCh38/hg38) (hg38) under the unspecified (?) field
  • Click on Start

    Click here for output
  • Wait until the data has been uploaded

    • You will see that the three files were added on the History panel
      Click here for output
  • Click on the eye icon of one of the bam files You will see that there are more chromosomes than the canonical ones (chr1..chrX). To make the calculations easier let's remove them.
    Click here for output
  • Search for Samtools view in the Tools panel
  • Click on the tool
  • Set the following parameters as indicated:
    • SAM/BAM/CRAM data set

      • Select the icon for Multiple datasets
      • Select both alignments --> *_filtered_bwa-mem.bam
    • What would you like to look at --> A filtered/subsampled selection of reads

    • Configure filters

      • Filter by regions --> Regions from BED file
      • Filter by intervals in a bed file --> GRCh80_90.bed
    • Click Execute

      Click here for output

Remove duplicates with RmDup

When dealing with variant calling, an important step is to identify possible PCR duplicates, this to remove any bias when we are doing the actual calling.

  • Search for RmDup remove PCR duplicates in the Tools panel
  • Click on the tool
  • Set the following parameters as indicated:
    • BAM File

      • Select the icon for Multiple datasets
      • Select both alignments --> *Samtools view on data X and data X_: filtered alignments
    • Is this paired-end or single end data --> BAM is paired-end

    • Click Execute

      Click here for output

You will get 1 output file per alignment: RmDup on data X

  • Click on the eye icon
    You will see the alignment data, or at least part of it.

If you compare the alignments, you will see that the only difference is the amount of data. There are less reads in the alignment where we removed the duplicates as we are excluding reads that map to the same coordinates.

Click here for output

Realignment with BamLeftAlign

When calling indels, it is important to homogenize the positional distribution of insertions and deletions in the input by using left realignment. Left realignment will place all indels in homopolymer and microsatellite repeats at the same position, provided that doing so does not introduce mismatches between the read and reference other than the indel.

  • Search for BamLeftAlign in the Tools panel

  • Click on the tool

  • Set the following parameters as indicated:

    • Choose the source for the reference genome --> Locally cached

    • Select alignment file in BAM format

      • Select the icon for Multiple datasets
      • Select both alignments --> RmDup on data X
    • Using reference genome --> Human Dec. 2013 (GRCh38/hg38) (hg38)

    • Click Execute

      Click here for output

You will get 1 output file per alignment: BamLeftAlign on data X

  • Click on the eye icon
    You will see the alignment data, or at least part of it.

Variant calling with FreeBayes

FreeBayes is a Bayesian genetic variant detector designed to find small polymorphisms, specifically SNPs (single-nucleotide polymorphisms), indels (insertions and deletions), MNPs (multi-nucleotide polymorphisms), and complex events (composite insertion and substitution events) smaller than the length of a short-read sequencing alignment.

  • Search for FreeBayes in the Tools panel

  • Click on the tool

  • Set the following parameters as indicated:

    • Choose the source for the reference genome --> Locally cached
    • Run in batch mode? --> Run individually
    • BAM dataset
      • Select the icon for Multiple datasets
      • Select both alignments --> BamLefAlign on data X (alignments)
    • Using reference genome --> Human Dec. 2013 (GRCh38/hg38) (hg38)
    • Limit variant calling to a set of regions --> Do not limit
    • Read coverage --> Specify coverage options
      • Require at least this coverage to process a site --> 10
    • Choose parameter selection level --> Simple diploid calling with filtering and coverage
    • Click Execute
      Click here for output

You will get 1 output file per alignment: FreeBayes on data x (variants)

  • Click on the eye icon
    You will see the variant data in VCF format, or at least part of it.

    • The first part starts with an # and it is part of the header
    • The second part is the detailed information of each one of the variants found
      Click here for output

Q4. How many variants did you call for each sample? Hint: Click on the process name

Click here for output
11 802 for sample A
12 031 for sample B 

Variants pre-filtering

When dealing with real data, the size of the resulting files may be a problem as some webtools have a limit on the size of the data they can analyze. This is not an issue when we are using command line tools. Thus to avoid any technical issues we will pre-filter our variants.

Q5. How many variants did you end up with? Hint: Click on the process name

Click here for output
11,965 for sample A
12,101 for sample B 

We can set up thresholds on different statistics like sequencing depth, allele balance, type of variant or genotype, just to mention a few. Here we will only exemplify how to filter based on sequencing depth using VCFfilter:

  • Search for VCFfilter in the Tools panel

  • Click on the tool

  • Set the following parameters as indicated:

    • VCF dataset to filter

      • Select the icon for Multiple datasets
      • Select both VCF files --> FreeBayes on data X (variants)
    • more filters

      • Select the filter type --> Info filter (-f)
      • Specify filtering value --> DP > 19
    • Click Execute

      Click here for output

Q6. How many variants did you end up with? Hint: Click on the process name

Click here for output
5057 for sample A
7269 for sample B 

Now let's download both VCF files for further processing:

  • Display the information of the VCFfilter: on data X
  • Click on the Download icon
  • Save the VCF file in your computer
    • The name of the file will look like: GalaxyXX-[VCFfilter_on_data_X].vcf
    • Rename it to SampleA.vcf or SampleB.vcf depending on the file you are downloading.

Variant analysis

Variant annotation with VEP

Now we need to annotate our variants in terms of genomic context, aminoacid change and effect, previous mutation knowledge, etc. so we can evaluate and filter mutations that are irrelevant. For this we will use VEP website, an online tool that determines the effect of variants (SNPs, insertions, deletions, CNVs or structural variants) on genes, transcripts, and protein sequence, as well as regulatory regions, among other types of annotation.

  • Go to the VEP website

  • Click on Launch Ve!P

  • Set the following parameters as indicated:

    • Species --> GRCh38.p14

    • Name for this job (optional) --> SampleA

    • Input data --> Upload file --> SampleA.vcf

    • Transcript database to use: --> RefSeq transcripts

    • Additional configurations

      • Identifiers --> select ONLY Gene symbol
      • Variants and frequency data --> Find co-located known variants --> No
      • Filtering options --> Restrict results --> Show one selected consequence per variant
    • Click Run

      Click here for output

Repeat the same procedure with SampleB.

Once the annotations are done, click on View results

Click here for output

Q7. How many variants were processed for each sample? Where are most of the variants located within the genome for each sample? Which type of coding consequences are more common in each sample?

Click here for output
5057 sampleA
7269 sampleB

Most of them are within introns

These are the most common consequences:
synonymous => nt change without aa change
missense => nt change with aa change (the important ones)


You could do some filtering within in VEP, however, it will be quicker if we download the files and work with them locally.

  • Under the Download box
    • Click on TXT
    • Save the files with the corresponding sample name (keep the txt extension)

Data upload to R

Now let's summarize, compare and visualize our data. There are a lot of tools that can help with this, some are open source and some are licensed. Here we will be using R.

In most of the cases, you should copy/paste the code. There will be some instances where you will be asked to decide on a value/threshold/color and this will be highlighted in bold within the instructions and in CAPS within the code.

You are not expected to understand all the code, however you are more than welcome to ask!. Learning R takes a lot of practice!

Start R in your local computer.

Select your working directory (where you saved the annotated files): Session -> Set Working Directory -> Choose directory

Let's load the packages we will be using. Remember that packages bundle together code, data, documentation, and tests, and is easy to share with others. They are an excellent way to reuse code, specially because the chances that someone has already solved a problem that you’re working on are big and you don't need to start from scratch:

# Install them first if you don't have them already
#
# install.packages("tidyverse")
# install.packages("ggvenn") 


# Load required packages

library(tidyverse) # data manipulation
library(ggvenn) # Venn diagram

Now we need to read the data in R. In the code below, replace the text with your annotated file names. Make sure the name is within double quotes:

# Read in data
A <- read_table("WRITE_THE_NAME_OF_YOUR_SampleA_FILE")
B <- read_table("WRITE_THE_NAME_OF_YOUR_SampleB_FILE")

It is important to check at every step if the data we have make sense, so let's display the data:

# display variable
A
Click here for output
# A tibble: 5,057 x 41
   `#Uploaded_variati~` Location Allele Consequence IMPACT SYMBOL Gene  Feature_type Feature BIOTYPE EXON  INTRON HGVSc HGVSp cDNA_position
   <chr>                <chr>    <chr>  <chr>       <chr>  <chr>  <chr> <chr>        <chr>   <chr>   <chr> <chr>  <chr> <chr> <chr>        
 1 .                    11:8029~ T      intergenic~ MODIF~ -      -     -            -       -       -     -      -     -     -            
 2 .                    11:8029~ T      intergenic~ MODIF~ -      -     -            -       -       -     -      -     -     -            
 3 .                    11:8029~ A      intergenic~ MODIF~ -      -     -            -       -       -     -      -     -     -            
 4 .                    11:8029~ C      intergenic~ MODIF~ -      -     -            -       -       -     -      -     -     -            
 5 .                    11:8029~ T      intergenic~ MODIF~ -      -     -            -       -       -     -      -     -     -            
 6 .                    11:8029~ G      intergenic~ MODIF~ -      -     -            -       -       -     -      -     -     -            
 7 .                    11:8029~ G      intergenic~ MODIF~ -      -     -            -       -       -     -      -     -     -            
 8 .                    11:8029~ AAAAA~ regulatory~ MODIF~ -      -     RegulatoryF~ ENSR00~ enhanc~ -     -      -     -     -            
 9 .                    11:8029~ G      TF_binding~ MODIF~ -      -     MotifFeature ENSM00~ -       -     -      -     -     -            
10 .                    11:8029~ C      regulatory~ MODIF~ -      -     RegulatoryF~ ENSR00~ enhanc~ -     -      -     -     -            
# ... with 5,047 more rows, and 26 more variables: CDS_position <chr>, Protein_position <chr>, Amino_acids <chr>, Codons <chr>,
#   Existing_variation <chr>, DISTANCE <chr>, STRAND <chr>, FLAGS <chr>, SYMBOL_SOURCE <chr>, HGNC_ID <chr>, MANE_SELECT <chr>,
#   MANE_PLUS_CLINICAL <chr>, TSL <chr>, APPRIS <chr>, REFSEQ_MATCH <chr>, REFSEQ_OFFSET <chr>, GIVEN_REF <chr>, USED_REF <chr>,
#   BAM_EDIT <chr>, SIFT <chr>, PolyPhen <chr>, MOTIF_NAME <chr>, MOTIF_POS <chr>, HIGH_INF_POS <chr>, MOTIF_SCORE_CHANGE <chr>,
#   TRANSCRIPTION_FACTORS <chr>
# display variable
B
Click here for output
# A tibble: 7,269 x 41
   `#Uploaded_variati~` Location Allele Consequence IMPACT SYMBOL Gene  Feature_type Feature BIOTYPE EXON  INTRON HGVSc HGVSp cDNA_position
   <chr>                <chr>    <chr>  <chr>       <chr>  <chr>  <chr> <chr>        <chr>   <chr>   <chr> <chr>  <chr> <chr> <chr>        
 1 .                    11:8029~ A      intergenic~ MODIF~ -      -     -            -       -       -     -      -     -     -            
 2 .                    11:8029~ G      intergenic~ MODIF~ -      -     -            -       -       -     -      -     -     -            
 3 .                    11:8029~ C      intergenic~ MODIF~ -      -     -            -       -       -     -      -     -     -            
 4 .                    11:8029~ T      intergenic~ MODIF~ -      -     -            -       -       -     -      -     -     -            
 5 .                    11:8029~ G      intergenic~ MODIF~ -      -     -            -       -       -     -      -     -     -            
 6 .                    11:8029~ G      intergenic~ MODIF~ -      -     -            -       -       -     -      -     -     -            
 7 .                    11:8029~ G      intergenic~ MODIF~ -      -     -            -       -       -     -      -     -     -            
 8 .                    11:8029~ G      TF_binding~ MODIF~ -      -     MotifFeature ENSM00~ -       -     -      -     -     -            
 9 .                    11:8029~ C      regulatory~ MODIF~ -      -     RegulatoryF~ ENSR00~ enhanc~ -     -      -     -     -            
10 .                    11:8029~ C      regulatory~ MODIF~ -      -     RegulatoryF~ ENSR00~ enhanc~ -     -      -     -     -            
# ... with 7,259 more rows, and 26 more variables: CDS_position <chr>, Protein_position <chr>, Amino_acids <chr>, Codons <chr>,
#   Existing_variation <chr>, DISTANCE <chr>, STRAND <chr>, FLAGS <chr>, SYMBOL_SOURCE <chr>, HGNC_ID <chr>, MANE_SELECT <chr>,
#   MANE_PLUS_CLINICAL <chr>, TSL <chr>, APPRIS <chr>, REFSEQ_MATCH <chr>, REFSEQ_OFFSET <chr>, GIVEN_REF <chr>, USED_REF <chr>,
#   BAM_EDIT <chr>, SIFT <chr>, PolyPhen <chr>, MOTIF_NAME <chr>, MOTIF_POS <chr>, HIGH_INF_POS <chr>, MOTIF_SCORE_CHANGE <chr>,
#   TRANSCRIPTION_FACTORS <chr>

Q8. Are the number of variants the same as in the *txt files? This may be trivial, however it may happen an error while copying files or R connection problems.

Click here for output
yes

5057 vs 5057 sampleA
7269 vs 7269 sampleB

Summarizing data

It is important to have an overview of the entire dataset, so summary statistics and graphs are quite helpful to scan our data. We will look at:

  1. Number of mutations per sample
  2. Number of mutations per chromosome per sample
  3. Distribution of mutations based on selected consequences (barplot)
  4. Distribution of mutations based on Biotype (pie chart)

1. Number of mutations per sample

In a complete dataset we may have different alleles in one same position. For example: in chromosome 1, at position 1, our reference A could show in some reads the nucleotide C, while others show G:

    1:10-10   A > C
    1:10-10   A > G

Thus we need to differentiate them by checking not only the position, but the alleles. In our case, since we annotated the variants specifying Show one selected consequence per variant during the variant annotation with VEP, we will not have more variants than the ones we used as input. However let's exemplify how we would calculate this:

# Number of mutations based on location and allele
A %>% 
  distinct(Location, Allele, .keep_all = TRUE) %>% 
  nrow
Click here for output
[1] 5057

Repeat for sample B

Click here for output
B %>% 
  distinct(Location, Allele, .keep_all = TRUE) %>% 
  nrow
[1] 7269

2. Number of mutations per chromosome per sample

It is good to know the distribution of the variants per chromosome. In our case we expect to have most of them in chr11.

# Number of mutations based on location and allele grouped by chr
A %>% 
  mutate(CHROM = str_replace(Location, ":.+", "")) %>% 
  group_by(CHROM) %>% 
  distinct(Location, Allele) %>% 
  summarise(N_mutations=n())
Click here for output
# A tibble: 2 x 2
  CHROM N_mutations
  <chr>       <int>
1 11           5056
2 3               1

Repeat for sample B

Click here for output
B %>% 
  mutate(CHROM = str_replace(Location, ":.+", "")) %>% 
  group_by(CHROM) %>% 
  distinct(Location, Allele) %>% 
  summarise(N_mutations=n())
# A tibble: 1 x 2
  CHROM N_mutations
  <chr>       <int>
1 11           7269

3. Distribution of mutations based on selected consequences

Now that we have checked that our data makes sense, let's visualize some consequences as a barplot. First we select the consequences we want to plot. Let's display the available Consequences:

unique(c(A$Consequence,B$Consequence))
Click here for output
 [1] "intergenic_variant"                                                                                    
 [2] "regulatory_region_variant"                                                                             
 [3] "TF_binding_site_variant"                                                                               
 [4] "upstream_gene_variant"                                                                                 
 [5] "intron_variant,non_coding_transcript_variant"                                                          
 [6] "downstream_gene_variant"                                                                               
 [7] "non_coding_transcript_exon_variant"                                                                    
 [8] "3_prime_UTR_variant"                                                                                   
 [9] "intron_variant"                                                                                        
[10] "missense_variant"                                                                                      
[11] "5_prime_UTR_variant"                                                                                   
[12] "synonymous_variant"                                                                                    
[13] "splice_region_variant,splice_polypyrimidine_tract_variant,intron_variant,non_coding_transcript_variant"
[14] "splice_region_variant,splice_polypyrimidine_tract_variant,intron_variant"  

Q9. Which consequences would you select to create a simplified overview? Here is a reminder of what each consequence means.

Click here for output
* intron_variant: this is interesting because the data is exome, this of course may look different if we load the complete datasets
* intron_variant,non_coding_transcript_variant: overlaping ncRNAs are always interesting
* synonymous_variant: even if they do not make any damage (hopefully) is good to have an idea of how mutated are the coding regions
* missense_variant: most important
* regulatory and upstream_variants: may affect the level of transcription of a gene

This will of course depend on what you want to show!

For our example, let's display only the synonymous_variant, missense_variant and regulatory_region_variant.

Now, let's select the column that contains information about the Consequences:

# Data selection
A_tmp <- A %>% 
  select(Consequence) %>% 
  mutate(Sample = "A")
A_tmp
Click here for output
# A tibble: 5,057 x 2
   Consequence               Sample
   <chr>                     <chr> 
 1 intergenic_variant        A     
 2 intergenic_variant        A     
 3 intergenic_variant        A     
 4 intergenic_variant        A     
 5 intergenic_variant        A     
 6 intergenic_variant        A     
 7 intergenic_variant        A     
 8 regulatory_region_variant A     
 9 TF_binding_site_variant   A     
10 regulatory_region_variant A     
# ... with 5,047 more rows

Try to use the summarise() function, grouping by Consequence, to create a summary table of the Consequences.

Click here for output
# summarizing the consequences distribution
A_tmp %>% 
   group_by(Consequence) %>% 
   summarise(N_variants=n())
# A tibble: 14 × 2
   Consequence                                         N_variants
   <chr>                                                    <int>
 1 3_prime_UTR_variant                                         16
 2 5_prime_UTR_variant                                          2
 3 TF_binding_site_variant                                      8
 4 downstream_gene_variant                                    105
 5 intergenic_variant                                        1477
 6 intron_variant                                            2406
 7 intron_variant,non_coding_transcript_variant               764
 8 missense_variant                                            13
 9 non_coding_transcript_exon_variant                          17
10 regulatory_region_variant                                  140
11 splice_region_variant,splice_polypyrimidine_tract_…          1
12 splice_region_variant,splice_polypyrimidine_tract_…          1
13 synonymous_variant                                           6
14 upstream_gene_variant                                      101

Repeat for sample B and create B_tmp

Click here for output
# Data selection
B_tmp <- B %>% 
  select(Consequence) %>% 
  mutate(Sample = "B")
B_tmp
# A tibble: 7,269 x 2
   Consequence               Sample
   <chr>                     <chr> 
 1 intergenic_variant        B     
 2 intergenic_variant        B     
 3 intergenic_variant        B     
 4 intergenic_variant        B     
 5 intergenic_variant        B     
 6 intergenic_variant        B     
 7 intergenic_variant        B     
 8 TF_binding_site_variant   B     
 9 regulatory_region_variant B     
10 regulatory_region_variant B     
# ... with 7,259 more rows
B_tmp %>%
  group_by(Consequence) %>%
  summarise(N_variants=n()) 
# A tibble: 13 × 2
   Consequence                                         N_variants
   <chr>                                                    <int>
 1 3_prime_UTR_variant                                         24
 2 5_prime_UTR_variant                                          4
 3 TF_binding_site_variant                                     10
 4 downstream_gene_variant                                    119
 5 intergenic_variant                                        2086
 6 intron_variant                                            3573
 7 intron_variant,non_coding_transcript_variant              1092
 8 missense_variant                                            13
 9 non_coding_transcript_exon_variant                          23
10 regulatory_region_variant                                  186
11 splice_region_variant,splice_polypyrimidine_tract_…          1
12 synonymous_variant                                           7
13 upstream_gene_variant                                      131

Then we make the plot merging the data from both samples. In the code below, add the consequences we selected. Remember to add the term within double quotes:

bind_rows(A_tmp, B_tmp) %>% 
  filter(Consequence %in% c("ADD_CONSEQUENCE_TERM", "ADD_CONSEQUENCE_TERM", "ADD_CONSEQUENCE_TERM")) %>% 
  ggplot(aes(x=Consequence, fill=Sample)) + 
  geom_bar(stat="count", position="dodge") +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.9, hjust=1))
Click here for output
bind_rows(A_tmp, B_tmp) %>% 
  filter(Consequence %in% c("synonymous_variant", "missense_variant", "regulatory_region_variant")) %>% 
  ggplot(aes(x=Consequence, fill=Sample)) + 
  geom_bar(stat="count", position="dodge") +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.9, hjust=1))

4. Distribution of mutations based on Biotype

Biotypes are classifiers for genes of transcripts. This is a complete list of Biotypes used by Ensembl. Let's visualize the type of genes that have genetic variants with a pie chart.

We select the BIOTYPE column, summarize it and create a data.frame, which will give us a frequency table of the different Biotypes:

A %>% 
  select(BIOTYPE) %>% 
  mutate(Sample = "A")
Click here for output
# A tibble: 5,057 x 2
   BIOTYPE  Sample
   <chr>    <chr> 
 1 -        A     
 2 -        A     
 3 -        A     
 4 -        A     
 5 -        A     
 6 -        A     
 7 -        A     
 8 enhancer A     
 9 -        A     
10 enhancer A     
# ... with 5,047 more rows
table(A %>% 
  select(BIOTYPE) %>% 
  mutate(Sample = "A"))
Click here for output
                        Sample
BIOTYPE                     A
  -                      1485
  CTCF_binding_site        15
  enhancer                 86
  lncRNA                  798
  miRNA                     2
  misc_RNA                 29
  open_chromatin_region    35
  promoter                  1
  protein_coding         2589
  TF_binding_site           3
  transcribed_pseudogene   14
A_tmp <- as.data.frame(table(A %>% 
  select(BIOTYPE) %>% 
  mutate(Sample = "A")))
A_tmp
Click here for output
                  BIOTYPE Sample Freq
1                       -      A 1485
2       CTCF_binding_site      A   15
3                enhancer      A   86
4                  lncRNA      A  798
5                   miRNA      A    2
6                misc_RNA      A   29
7   open_chromatin_region      A   35
8                promoter      A    1
9          protein_coding      A 2589
10        TF_binding_site      A    3
11 transcribed_pseudogene      A   14

Then, we can plot the information as a pie chart. Don't forget to add the title:

ggplot(A_tmp, aes(x="", y=Freq, fill=BIOTYPE)) + 
  geom_bar(stat="identity") + 
  coord_polar("y",start=0) + 
  theme_void() +
  labs(title="ADD_YOUR_TITLE_HERE")
Click here for output

Repeat for Sample B

Click here for output
B %>% 
  select(BIOTYPE) %>% 
  mutate(Sample = "B")
# A tibble: 7,267 x 2
   BIOTYPE  Sample
   <chr>    <chr> 
 1 -        B     
 2 -        B     
 3 -        B     
 4 -        B     
 5 -        B     
 6 -        B     
 7 -        B     
 8 -        B     
 9 enhancer B     
10 enhancer B     
# ... with 7,257 more rows
table(B %>% 
  select(BIOTYPE) %>% 
  mutate(Sample = "B"))

                        Sample
BIOTYPE                     B
  -                      2096
  CTCF_binding_site        20
  enhancer                115
  lncRNA                 1136
  miRNA                     1
  misc_RNA                 32
  open_chromatin_region    48
  promoter                  1
  protein_coding         3790
  snoRNA                    7
  TF_binding_site           2
  transcribed_pseudogene   21
B_tmp <- as.data.frame(table(B %>% 
  select(BIOTYPE) %>% 
  mutate(Sample = "B")))
B_tmp
                  BIOTYPE Sample Freq
1                       -      B 2096
2       CTCF_binding_site      B   20
3                enhancer      B  115
4                  lncRNA      B 1136
5                   miRNA      B    1
6                misc_RNA      B   32
7   open_chromatin_region      B   48
8                promoter      B    1
9          protein_coding      B 3790
10                 snoRNA      B    7
11        TF_binding_site      B    2
12 transcribed_pseudogene      B   21
ggplot(B_tmp, aes(x="", y=Freq, fill=BIOTYPE)) + 
  geom_bar(stat="identity") + 
  coord_polar("y",start=0) + 
  theme_void() +
  labs(title="ADD_YOUR_TITLE_HERE")

A quick note on annotation tools

As we have discussed, there are a lot of algorithms and databases for one task, for instance the annotation of whether an amino acid substitution is likely to affect protein function or not (SIFT and Polyphen). In Bioinformatics, we need to set some initial assumptions to build up an algorithm and therefor, different tools give different results. Specially because we are dealing with predictions.

To exemplify this, let's compare the scores from SIFT and Polyphen-2:

A %>% 
  mutate(SIFT_num = parse_number(SIFT),
         PolyPhen_num = parse_number(PolyPhen)) %>% 
  ggplot(aes(x=SIFT_num, y=PolyPhen_num)) + 
  geom_point(alpha=0.2, size = 3) +
  scale_x_reverse()+
  annotate("text", x=0.8, y=0.2, label= "benign/tolerated") + 
  annotate("text", x=0.2, y=0.8, label= "damaging/deleterious") + 
  annotate(geom="rect", xmin=0.5, xmax=0, ymin=0, ymax=0.5, fill="#F8766D", alpha=0.2) +
  annotate(geom="rect", xmin=1, xmax=0.5, ymin=0.5, ymax=1, fill="#619CFF", alpha=0.2) +
  labs(title="Sample A") +
  theme_classic() 
Click here for output

As you can see, there are mutations being predicted as benign by SIFT while those same mutations are annotated as damaging by PolyPhen (red rectangle). Some people focus on those variants where both algorithms predict the same outcome. Just don't forget that even when there is contradictory data, those variants may be interesting to look at as well.

Data filtering

Now that we have an overall look of our data, it is time to filter the mutations. A common strategy is to focus on mutations that:

  • are within exons (or protein_coding)
  • are annotated as a missense_variant
  • are classified other than tolerated/benign

Let's do this step by step:

# 1. BIOTYPE == protein_coding 
A %>% 
  filter(BIOTYPE == "protein_coding") %>% 
  distinct(Location, Allele, .keep_all = TRUE) %>% 
  nrow
Click here for output
[1] 2589 out of 5057
# 2. CONSEQUENCE == missense_variant
A %>% 
  filter(BIOTYPE == "protein_coding" & Consequence == "missense_variant") %>% 
  distinct(Location, Allele) %>% 
  nrow
Click here for output
[1] 13 out of 5057

For this filtering step, select either SIFT or PolyPhen (without quotes in the code). Remember that the SIFT label is tolerated, while PolyPhen uses benign (in between double quotes in the code).

# 3. SIFT != tolerated and Polyphen != benign
A_filtered <- A %>% 
  filter(BIOTYPE == "protein_coding" & 
         Consequence == "missense_variant" & 
         !str_detect(ADD_SIFT_OR_PolyPhen, "ADD_tolerated_OR_benign")) %>% 
  distinct(Location, Allele) %>% 
  nrow
A_filtered
Click here for output
A_sift <- A %>% 
  filter(BIOTYPE == "protein_coding" & 
         Consequence == "missense_variant" & 
         !str_detect(SIFT, "tolerated")) %>%  
  distinct(Location, Allele) %>% 
  nrow
A_sift
[1] 3 out of 5057
A_poly <- A %>% 
  filter(BIOTYPE == "protein_coding" &
         Consequence == "missense_variant" & 
         !str_detect(PolyPhen, "benign")) %>% 
  distinct(Location, Allele) %>% 
  nrow
A_poly
[1] 3 out of 5057

Repeat for Sample B

Click here for output
B %>% 
  filter(BIOTYPE == "protein_coding") %>% 
  distinct(Location, Allele, .keep_all = TRUE) %>% 
  nrow
[1] 3790 out of 7267
# 2. CONSEQUENCE == missense_variant
B %>% 
  filter(BIOTYPE == "protein_coding" & Consequence == "missense_variant") %>% 
  distinct(Location, Allele) %>% 
  nrow
[1] 13 out of 7267
B_sift <- B %>% 
  filter(BIOTYPE == "protein_coding" & 
         Consequence == "missense_variant" & 
         !str_detect(SIFT, "tolerated")) %>%  
  distinct(Location, Allele) %>% 
  nrow
B_sift
[1] 4 out of 7267
B_poly <- B %>% 
  filter(BIOTYPE == "protein_coding" & 
         Consequence == "missense_variant" & 
         !str_detect(PolyPhen, "benign")) %>% 
  distinct(Location, Allele) %>% 
  nrow
B_poly
[1] 3 out of 7267

We can create a table to compare the results

knitr::kable(data.frame(A_sift,B_sift,A_poly,B_poly), caption="SIFT vs PolyPhen")
Table: SIFT vs PolyPhen

| A_sift| B_sift| A_poly| B_poly|
|------:|------:|------:|------:|
|      3|      4|      3|      3|

Now, let's look at the actual variants. In the code below, besides selecting SIFT or PolyPhen, add these column names:

  • name of the gene (SYMBOL)
  • the aminoacid change (Amino_acids) and
  • the dbSNP identifier (Existing_variation)

under the distinct() function

A_filtered <- A %>% 
              filter(BIOTYPE == "protein_coding" & 
                     Consequence == "missense_variant" & 
                     !str_detect(ADD_SIFT_OR_PolyPhen, "ADD_tolerated_OR_benign")) %>% 
              distinct(Location, Allele, ADD_HERE_THE_COLUMN_NAMES) 

knitr::kable(A_filtered, caption = "Filtered mutations from Sample A")
Click here for output
A_filtered_sift <- A %>% 
  filter(BIOTYPE == "protein_coding" & 
         Consequence == "missense_variant" & 
         !str_detect(SIFT, "tolerated")) %>%  
  distinct(Location, Allele, SYMBOL, Amino_acids, Existing_variation)
A_filtered_sift
# A tibble: 3 × 5
  Location           Allele SYMBOL Amino_acids Existing_variation
  <chr>              <chr>  <chr>  <chr>       <chr>             
1 11:85725351-85725… C      SYTL2  D/G         -                 
2 11:85725825-85725… C      SYTL2  A/G         -                 
3 11:88309301-88309… C      CTSC   Y/C         -                        
A_filtered_poly <- A %>% 
  filter(BIOTYPE == "protein_coding" & 
         Consequence == "missense_variant" & 
         !str_detect(PolyPhen, "benign")) %>% 
  distinct(Location, Allele, SYMBOL, Amino_acids, Existing_variation)
A_filtered_poly
# A tibble: 3 × 5
  Location           Allele SYMBOL Amino_acids Existing_variation
  <chr>              <chr>  <chr>  <chr>       <chr>             
1 11:85725351-85725… C      SYTL2  D/G         -                 
2 11:86409331-86409… A      CCDC81 I/K         -                 
3 11:88309301-88309… C      CTSC   Y/C         -                 

Repeat for Sample B

Click here for output
B_filtered_sift <- B %>% 
  filter(BIOTYPE == "protein_coding" & 
         Consequence == "missense_variant" & 
         !str_detect(SIFT, "tolerated")) %>%  
  distinct(Location, Allele, SYMBOL, Amino_acids, Existing_variation) 
B_filtered_sift 
# A tibble: 4 × 5
  Location           Allele SYMBOL Amino_acids Existing_variation
  <chr>              <chr>  <chr>  <chr>       <chr>             
1 11:85725351-85725… C      SYTL2  D/G         -                 
2 11:85725825-85725… C      SYTL2  A/G         -                 
3 11:88309301-88309… C      CTSC   Y/C         -                 
4 11:90041372-90041… A      TRIM4… T/N         -                       
B_filtered_poly <- B %>% 
  filter(BIOTYPE == "protein_coding" & 
         Consequence == "missense_variant" & 
         !str_detect(PolyPhen, "benign")) %>% 
  distinct(Location, Allele, SYMBOL, Amino_acids, Existing_variation)
B_filtered_poly
# A tibble: 3 × 5
  Location           Allele SYMBOL Amino_acids Existing_variation
  <chr>              <chr>  <chr>  <chr>       <chr>             
1 11:85725351-85725… C      SYTL2  D/G         -                 
2 11:88309301-88309… C      CTSC   Y/C         -                 
3 11:90041372-90041… A      TRIM4… T/N         -             

Sample comparison (venn diagram)

Since we have two samples from two patients that suffer the same disease, we should focus on those mutations that are shared by both individuals. First we need to merge each column into a single string, so we can compare across samples:

A_mutations <- A_filtered %>%  
  mutate(Mutation = paste(Location, Allele, SYMBOL, Amino_acids, Existing_variation, sep="|")) %>% 
  pull(Mutation)
A_mutations

B_mutations <- B_filtered %>% 
  mutate(Mutation = paste(Location, Allele, SYMBOL, Amino_acids, Existing_variation, sep="|")) %>% 
  pull(Mutation)
B_mutations 

And then we can generate a venn diagram. In the code below, select two different colors for the venn diagram. Don't forget to add them within double quotes:

ggvenn(list(A=A_mutations, B=B_mutations), fill_color = c("ADD_COLOR1_HERE", "ADD_COLOR2_HERE"))
Click here for output

Filtered mutations based on SIFT:

A_mutations_sift <- A_filtered_sift %>%  
  mutate(Mutation = paste(Location, Allele, SYMBOL, Amino_acids, Existing_variation, sep="|")) %>% 
  pull(Mutation)
A_mutations_sift
[1] "11:85725351-85725351|C|SYTL2|D/G|-"
[2] "11:85725825-85725825|C|SYTL2|A/G|-"
[3] "11:88309301-88309301|C|CTSC|Y/C|-" 
B_mutations_sift <- B_filtered_sift %>% 
  mutate(Mutation = paste(Location, Allele, SYMBOL, Amino_acids, Existing_variation, sep="|")) %>% 
  pull(Mutation)
B_mutations_sift
[1] "11:85725351-85725351|C|SYTL2|D/G|-"  
[2] "11:85725825-85725825|C|SYTL2|A/G|-"  
[3] "11:88309301-88309301|C|CTSC|Y/C|-"   
[4] "11:90041372-90041372|A|TRIM49C|T/N|-"
ggvenn(list(A=A_mutations_sift, B=B_mutations_sift), fill_color = c("red", "blue"))

Click here for output

Filtered mutations based on PolyPhen:

A_mutations_poly <- A_filtered_poly %>%  
  mutate(Mutation = paste(Location, Allele, SYMBOL, Amino_acids, Existing_variation, sep="|")) %>% 
  pull(Mutation)
A_mutations_poly
[1] "11:85725351-85725351|C|SYTL2|D/G|-" 
[2] "11:86409331-86409331|A|CCDC81|I/K|-"
[3] "11:88309301-88309301|C|CTSC|Y/C|-"
B_mutations_poly <- B_filtered_poly %>% 
  mutate(Mutation = paste(Location, Allele, SYMBOL, Amino_acids, Existing_variation, sep="|")) %>% 
  pull(Mutation)
B_mutations_poly 
[1] "11:85725351-85725351|C|SYTL2|D/G|-"  
[2] "11:88309301-88309301|C|CTSC|Y/C|-"   
[3] "11:90041372-90041372|A|TRIM49C|T/N|-"
ggvenn(list(A=A_mutations_poly, B=B_mutations_poly), fill_color = c("green", "purple"))

To get the final candidate mutations, we just extract the common variants:

# * Table/List of common mutations
intersect(A_mutations, B_mutations)
Click here for output
intersect(A_mutations_sift, B_mutations_sift)
[1] "11:85725351-85725351|C|SYTL2|D/G|-"
[2] "11:85725825-85725825|C|SYTL2|A/G|-"
[3] "11:88309301-88309301|C|CTSC|Y/C|-" 
intersect(A_mutations_poly, B_mutations_poly)
[1] "11:85725351-85725351|C|SYTL2|D/G|-"
[2] "11:88309301-88309301|C|CTSC|Y/C|-"

From this result, you should end up with 2 candidate genes when using PolyPhen or SIFT, respectively (CTSC, SYTL2).

The next step would be to read about these genes and try to find a connection to the disease, in our case, to PLS. Once we have justified some of those genes to be causative, then we would do some experimental validations to prove that tehse candidate variants are indeed responsible for PLS.

In our case, CTSC 88309301 is the mutation we are looking for! (hope you got it)

Well done! now you have the general workflow to identify disease causing mutations!



Developed by Marcela Dávila, Jari Martikainen and Björn Andersson, 2021. Updated by Marcela Dávila, 2023. Updated by Marcela Dávila, 2024

⚠️ **GitHub.com Fallback** ⚠️