BMA231 IV: Variant analysis - bcfgothenburg/HT24 GitHub Wiki
Course: HT24 Next Generation Sequencing data analysis with clinical applications (BMA231)
The aim of this practical is to introduce you to a general workflow to identify disease causing mutations from DNA sequencing data. To be able to work within the allocated time, you will be processing just a portion of the real dataset to illustrate the overall workflow.
You will be working with data from two siblings with Papillion-Lefèvre Syndrome (PLS). PLS is a very rare disorder of autosomal recessive inheritance distinguished by palmar plantar hyperkeratosis and early onset of periodontitis affecting the dentition.
The data has been kindly provided by Johan Bylund, Professor at the Institute of Odontology at University of Gothenburg.
Data | Sibling A | Sibling B | Both |
---|---|---|---|
Raw data |
A_R1_fastq_gz_fastqc.html A_R2_fastq_gz_fastqc.html
|
B_R1_fastq_gz_fastqc.html B_R2_fastq_gz_fastqc.html
|
MultiQC_raw.html |
Filtered data |
A_R1_Trim_Galore_fastqc.html A_R2_Trim_Galore_fastqc.html
|
B_R1_Trim_Galore_fastqc.html B_R2_Trim_Galore_fastqc.html
|
MultiQC_filtered.html |
Aligned data |
A_samtools_flagstat.txt A_samtools_idxstats.tsv
|
B_samtools_flagstat.txt B_samtools_idxstats.tsv
|
MultiQC_aligned.html |
Alignment | A_filtered_bwa-mem.bam |
B_filtered_bwa-mem.bam |
You find this files in CANVAS under the SNP analysis
module:
Sample_A.zip
Sample_B.zip
MultiQC_snp.zip
GRCh38_90.bed
The first thing to do when you receive data is to check its quality. The FastQC reports were already generated using FastQC in Galaxy. They have also been merged with MultiQC. Inspect either the FastQC html reports (*gz_fastqc.html
) or the MulqiQC html report (MultiQC_raw.html
) and answer the following:
Q1. What can you conclude from these graphs?
Focus onBasic Statistics
,Per base sequence quality
,Per sequence quality scores
,Per base sequence content
,Sequence Length Distribution
,Sequende Duplication Levels
andAdapter Content
Click here for output
Almost 1M per sample, too little, but this is just a fragment of the data
Low duplication, which is typical for DNA sequencing, when targeting large regions
Quality looks quite nice, probably been already trimmed
Most of the quality scores are above 30Q
It fails the per base sequence content because of the last position, high T content, not a problem
No ambiguos bases
There are different lengths of the reads, which tells us that the data has been already trimmed
No adapters found
Good dataset
Although our data is good, and apparently it has been already trimmed, TrimGalore!
was used in Galaxy with default parameters. You may see no difference at all, but remember that it is a good practice. FastQC
and MultiQC
were also run to obtain summary reports using the filtered files. Inspect either the FastQC html reports (*Trim_Galore_fastqc.html
) or the MulqiQC html report (MultiQC_filtered.html
) and answer the following:
Q2. What differences do you see between the raw files and the filtered ones?
Click here for output
830453 vs 827994 - A sample
722724 vs 720957 - B sample
Just less data, not a real change in quality
since we already had a good dataset
As we are dealing with DNA, we will use BWA, an aligner based on the Burrows-Wheeler transform that has been optimized for
aligning short reads. The mapping was already run in Galaxy using BWA mem
where the output files are in BAM format
.
To evaluate if the mapping is acceptable, some statistics were generated with samtools
and gathered with MultiQC
, also in Galaxy. Inspect either the samtools reports (*samtools_flagstat.txt and *_samtools_idxstats.tsv
) or the MultiQC html report (MultiQC_aligned.html
) and answer the following:
Q3. What is the percentage of mapped reads for each sample? Is this acceptable? The data we are analyzing covers one chromosome, which one?
Click here for output
99.84% sampleA
99.85% sampleB
chr11 with 1 624 476 SampleA
chr11 with 1 414 610 SampleB
For this part, you will be using Galaxy.
- Log in into your account
- Create a history called
SNP practical
.- You do this in the right
History panel
.
- You do this in the right
-
Click on
Upload
on the left menu -
The
Upload from Disk or Web to ..
window will open -
Click on
Choose local files
-
Select both of your filtered bam files (
A_filtered_bwa-mem.bam
andB_filtered_bwa-mem.bam
)- Select
bam
under the Autodetect field - Select
Human Dec. 2013 (GRCh38/hg38) (hg38)
under the unspecified (?) field
- Select
-
Select also the
GRCh38_90.bed
file- Select
bed
under the Auto-detect field - Select
Human Dec. 2013 (GRCh38/hg38) (hg38)
under the unspecified (?) field
- Select
-
Click on
Start
Click here for output
-
Wait until the data has been uploaded
- You will see that the three files were added on the
History panel
Click here for output
- You will see that the three files were added on the
- Click on the eye icon of one of the bam files
You will see that there are more chromosomes than the canonical ones (chr1..chrX). To make the calculations easier let's remove them.
Click here for output
- Search for
Samtools view
in theTools panel
- Click on the tool
- Set the following parameters as indicated:
-
SAM/BAM/CRAM data set
- Select the icon for
Multiple datasets
- Select both alignments -->
*_filtered_bwa-mem.bam
- Select the icon for
-
What would you like to look at -->
A filtered/subsampled selection of reads
-
Configure filters
-
Filter by regions -->
Regions from BED file
-
Filter by intervals in a bed file -->
GRCh80_90.bed
-
Filter by regions -->
-
Click Execute
Click here for output
-
When dealing with variant calling, an important step is to identify possible PCR duplicates, this to remove any bias when we are doing the actual calling.
- Search for
RmDup remove PCR duplicates
in theTools panel
- Click on the tool
- Set the following parameters as indicated:
-
BAM File
- Select the icon for
Multiple datasets
- Select both alignments -->
*Samtools view on data X and data X_: filtered alignments
- Select the icon for
-
Is this paired-end or single end data -->
BAM is paired-end
-
Click Execute
Click here for output
-
You will get 1 output file per alignment: RmDup on data X
- Click on the
eye icon
You will see the alignment data, or at least part of it.
If you compare the alignments, you will see that the only difference is the amount of data. There are less reads in the alignment where we removed the duplicates as we are excluding reads that map to the same coordinates.
Click here for output
When calling indels, it is important to homogenize the positional distribution of insertions and deletions in the input by using left realignment. Left realignment will place all indels in homopolymer and microsatellite repeats at the same position, provided that doing so does not introduce mismatches between the read and reference other than the indel.
-
Search for
BamLeftAlign
in theTools panel
-
Click on the tool
-
Set the following parameters as indicated:
-
Choose the source for the reference genome -->
Locally cached
-
Select alignment file in BAM format
- Select the icon for
Multiple datasets
- Select both alignments -->
RmDup on data X
- Select the icon for
-
Using reference genome -->
Human Dec. 2013 (GRCh38/hg38) (hg38)
-
Click Execute
Click here for output
-
You will get 1 output file per alignment: BamLeftAlign on data X
- Click on the
eye icon
You will see the alignment data, or at least part of it.
FreeBayes is a Bayesian genetic variant detector designed to find small polymorphisms, specifically SNPs (single-nucleotide polymorphisms), indels (insertions and deletions), MNPs (multi-nucleotide polymorphisms), and complex events (composite insertion and substitution events) smaller than the length of a short-read sequencing alignment.
-
Search for
FreeBayes
in theTools panel
-
Click on the tool
-
Set the following parameters as indicated:
-
Choose the source for the reference genome -->
Locally cached
-
Run in batch mode? -->
Run individually
-
BAM dataset
- Select the icon for
Multiple datasets
- Select both alignments -->
BamLefAlign on data X (alignments)
- Select the icon for
-
Using reference genome -->
Human Dec. 2013 (GRCh38/hg38) (hg38)
-
Limit variant calling to a set of regions -->
Do not limit
-
Read coverage -->
Specify coverage options
-
Require at least this coverage to process a site -->
10
-
Require at least this coverage to process a site -->
-
Choose parameter selection level -->
Simple diploid calling with filtering and coverage
- Click Execute
Click here for output
-
Choose the source for the reference genome -->
You will get 1 output file per alignment: FreeBayes on data x (variants)
-
Click on the
eye icon
You will see the variant data inVCF
format, or at least part of it.- The first part starts with an
#
and it is part of the header - The second part is the detailed information of each one of the variants found
Click here for output
- The first part starts with an
Q4. How many variants did you call for each sample? Hint: Click on the process name
Click here for output
11 802 for sample A
12 031 for sample B
When dealing with real data, the size of the resulting files may be a problem as some webtools have a limit on the size of the data they can analyze. This is not an issue when we are using command line tools. Thus to avoid any technical issues we will pre-filter our variants.
Q5. How many variants did you end up with? Hint: Click on the process name
Click here for output
11,965 for sample A
12,101 for sample B
We can set up thresholds on different statistics like sequencing depth, allele balance, type of variant or genotype, just to mention a few. Here we will only exemplify how to filter based on sequencing depth using VCFfilter
:
-
Search for
VCFfilter
in theTools panel
-
Click on the tool
-
Set the following parameters as indicated:
-
VCF dataset to filter
- Select the icon for
Multiple datasets
- Select both VCF files -->
FreeBayes on data X (variants)
- Select the icon for
-
more filters
-
Select the filter type -->
Info filter (-f)
-
Specify filtering value -->
DP > 19
-
Select the filter type -->
-
Click Execute
Click here for output
-
Q6. How many variants did you end up with? Hint: Click on the process name
Click here for output
5057 for sample A
7269 for sample B
Now let's download both VCF
files for further processing:
- Display the information of the
VCFfilter: on data X
- Click on the
Download icon
- Save the VCF file in your computer
- The name of the file will look like:
GalaxyXX-[VCFfilter_on_data_X].vcf
- Rename it to
SampleA.vcf
orSampleB.vcf
depending on the file you are downloading.
- The name of the file will look like:
Now we need to annotate our variants in terms of genomic context, aminoacid change and effect, previous mutation knowledge, etc. so we can evaluate and filter mutations that are irrelevant. For this we will use VEP website, an online tool that determines the effect of variants (SNPs, insertions, deletions, CNVs or structural variants) on genes, transcripts, and protein sequence, as well as regulatory regions, among other types of annotation.
-
Go to the VEP website
-
Click on
Launch Ve!P
-
Set the following parameters as indicated:
-
Species -->
GRCh38.p14
-
Name for this job (optional) -->
SampleA
-
Input data --> Upload file -->
SampleA.vcf
-
Transcript database to use: -->
RefSeq transcripts
-
Additional configurations
-
Identifiers --> select ONLY
Gene symbol
-
Variants and frequency data --> Find co-located known variants -->
No
-
Filtering options --> Restrict results -->
Show one selected consequence per variant
-
Identifiers --> select ONLY
-
Click Run
Click here for output
-
Repeat the same procedure with SampleB
.
Once the annotations are done, click on View results
Click here for output
Q7. How many variants were processed for each sample? Where are most of the variants located within the genome for each sample? Which type of
coding consequences
are more common in each sample?
Click here for output
5057 sampleA
7269 sampleB
Most of them are within introns
These are the most common consequences:
synonymous => nt change without aa change
missense => nt change with aa change (the important ones)
You could do some filtering within in VEP, however, it will be quicker if we download the files and work with them locally.
- Under the
Download
box- Click on
TXT
- Save the files with the corresponding sample name (keep the
txt
extension)
- Click on
Now let's summarize, compare and visualize our data. There are a lot of tools that can help with this, some are open source and some are licensed. Here we will be using R
.
In most of the cases, you should copy/paste
the code. There will be some instances where you will be asked to decide on a value/threshold/color and this will be highlighted in bold within the instructions and in CAPS
within the code.
You are not expected to understand all the code, however you are more than welcome to ask!. Learning R takes a lot of practice!
Start R
in your local computer.
Select your working directory (where you saved the annotated files):
Session -> Set Working Directory -> Choose directory
Let's load the packages we will be using. Remember that packages
bundle together code, data, documentation, and tests, and is easy to share with others. They are an excellent way to reuse code, specially because the chances that someone has already solved a problem that you’re working on are big and you don't need to start from scratch:
# Install them first if you don't have them already
#
# install.packages("tidyverse")
# install.packages("ggvenn")
# Load required packages
library(tidyverse) # data manipulation
library(ggvenn) # Venn diagram
Now we need to read the data in R. In the code below, replace the text with your annotated file names. Make sure the name is within double quotes:
# Read in data
A <- read_table("WRITE_THE_NAME_OF_YOUR_SampleA_FILE")
B <- read_table("WRITE_THE_NAME_OF_YOUR_SampleB_FILE")
It is important to check at every step if the data we have make sense, so let's display the data:
# display variable
A
Click here for output
# A tibble: 5,057 x 41
`#Uploaded_variati~` Location Allele Consequence IMPACT SYMBOL Gene Feature_type Feature BIOTYPE EXON INTRON HGVSc HGVSp cDNA_position
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 . 11:8029~ T intergenic~ MODIF~ - - - - - - - - - -
2 . 11:8029~ T intergenic~ MODIF~ - - - - - - - - - -
3 . 11:8029~ A intergenic~ MODIF~ - - - - - - - - - -
4 . 11:8029~ C intergenic~ MODIF~ - - - - - - - - - -
5 . 11:8029~ T intergenic~ MODIF~ - - - - - - - - - -
6 . 11:8029~ G intergenic~ MODIF~ - - - - - - - - - -
7 . 11:8029~ G intergenic~ MODIF~ - - - - - - - - - -
8 . 11:8029~ AAAAA~ regulatory~ MODIF~ - - RegulatoryF~ ENSR00~ enhanc~ - - - - -
9 . 11:8029~ G TF_binding~ MODIF~ - - MotifFeature ENSM00~ - - - - - -
10 . 11:8029~ C regulatory~ MODIF~ - - RegulatoryF~ ENSR00~ enhanc~ - - - - -
# ... with 5,047 more rows, and 26 more variables: CDS_position <chr>, Protein_position <chr>, Amino_acids <chr>, Codons <chr>,
# Existing_variation <chr>, DISTANCE <chr>, STRAND <chr>, FLAGS <chr>, SYMBOL_SOURCE <chr>, HGNC_ID <chr>, MANE_SELECT <chr>,
# MANE_PLUS_CLINICAL <chr>, TSL <chr>, APPRIS <chr>, REFSEQ_MATCH <chr>, REFSEQ_OFFSET <chr>, GIVEN_REF <chr>, USED_REF <chr>,
# BAM_EDIT <chr>, SIFT <chr>, PolyPhen <chr>, MOTIF_NAME <chr>, MOTIF_POS <chr>, HIGH_INF_POS <chr>, MOTIF_SCORE_CHANGE <chr>,
# TRANSCRIPTION_FACTORS <chr>
# display variable
B
Click here for output
# A tibble: 7,269 x 41
`#Uploaded_variati~` Location Allele Consequence IMPACT SYMBOL Gene Feature_type Feature BIOTYPE EXON INTRON HGVSc HGVSp cDNA_position
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 . 11:8029~ A intergenic~ MODIF~ - - - - - - - - - -
2 . 11:8029~ G intergenic~ MODIF~ - - - - - - - - - -
3 . 11:8029~ C intergenic~ MODIF~ - - - - - - - - - -
4 . 11:8029~ T intergenic~ MODIF~ - - - - - - - - - -
5 . 11:8029~ G intergenic~ MODIF~ - - - - - - - - - -
6 . 11:8029~ G intergenic~ MODIF~ - - - - - - - - - -
7 . 11:8029~ G intergenic~ MODIF~ - - - - - - - - - -
8 . 11:8029~ G TF_binding~ MODIF~ - - MotifFeature ENSM00~ - - - - - -
9 . 11:8029~ C regulatory~ MODIF~ - - RegulatoryF~ ENSR00~ enhanc~ - - - - -
10 . 11:8029~ C regulatory~ MODIF~ - - RegulatoryF~ ENSR00~ enhanc~ - - - - -
# ... with 7,259 more rows, and 26 more variables: CDS_position <chr>, Protein_position <chr>, Amino_acids <chr>, Codons <chr>,
# Existing_variation <chr>, DISTANCE <chr>, STRAND <chr>, FLAGS <chr>, SYMBOL_SOURCE <chr>, HGNC_ID <chr>, MANE_SELECT <chr>,
# MANE_PLUS_CLINICAL <chr>, TSL <chr>, APPRIS <chr>, REFSEQ_MATCH <chr>, REFSEQ_OFFSET <chr>, GIVEN_REF <chr>, USED_REF <chr>,
# BAM_EDIT <chr>, SIFT <chr>, PolyPhen <chr>, MOTIF_NAME <chr>, MOTIF_POS <chr>, HIGH_INF_POS <chr>, MOTIF_SCORE_CHANGE <chr>,
# TRANSCRIPTION_FACTORS <chr>
Q8. Are the number of variants the same as in the
*txt
files? This may be trivial, however it may happen an error while copying files or R connection problems.
Click here for output
yes
5057 vs 5057 sampleA
7269 vs 7269 sampleB
It is important to have an overview of the entire dataset, so summary statistics and graphs are quite helpful to scan our data. We will look at:
- Number of mutations per sample
- Number of mutations per chromosome per sample
- Distribution of mutations based on selected consequences (barplot)
- Distribution of mutations based on Biotype (pie chart)
In a complete dataset we may have different alleles in one same position. For example: in chromosome 1, at position 1, our reference A
could show in some reads the nucleotide C
, while others show G
:
1:10-10 A > C
1:10-10 A > G
Thus we need to differentiate them by checking not only the position, but the alleles. In our case, since we annotated the variants specifying Show one selected consequence per variant
during the variant annotation with VEP
, we will not have more variants than the ones we used as input. However let's exemplify how we would calculate this:
# Number of mutations based on location and allele
A %>%
distinct(Location, Allele, .keep_all = TRUE) %>%
nrow
Click here for output
[1] 5057
Repeat for sample B
Click here for output
B %>%
distinct(Location, Allele, .keep_all = TRUE) %>%
nrow
[1] 7269
It is good to know the distribution of the variants per chromosome. In our case we expect to have most of them in chr11
.
# Number of mutations based on location and allele grouped by chr
A %>%
mutate(CHROM = str_replace(Location, ":.+", "")) %>%
group_by(CHROM) %>%
distinct(Location, Allele) %>%
summarise(N_mutations=n())
Click here for output
# A tibble: 2 x 2
CHROM N_mutations
<chr> <int>
1 11 5056
2 3 1
Repeat for sample B
Click here for output
B %>%
mutate(CHROM = str_replace(Location, ":.+", "")) %>%
group_by(CHROM) %>%
distinct(Location, Allele) %>%
summarise(N_mutations=n())
# A tibble: 1 x 2
CHROM N_mutations
<chr> <int>
1 11 7269
Now that we have checked that our data makes sense, let's visualize some consequences as a barplot. First we select the consequences we want to plot. Let's display the available Consequences
:
unique(c(A$Consequence,B$Consequence))
Click here for output
[1] "intergenic_variant"
[2] "regulatory_region_variant"
[3] "TF_binding_site_variant"
[4] "upstream_gene_variant"
[5] "intron_variant,non_coding_transcript_variant"
[6] "downstream_gene_variant"
[7] "non_coding_transcript_exon_variant"
[8] "3_prime_UTR_variant"
[9] "intron_variant"
[10] "missense_variant"
[11] "5_prime_UTR_variant"
[12] "synonymous_variant"
[13] "splice_region_variant,splice_polypyrimidine_tract_variant,intron_variant,non_coding_transcript_variant"
[14] "splice_region_variant,splice_polypyrimidine_tract_variant,intron_variant"
Q9. Which consequences would you select to create a simplified overview? Here is a reminder of what each consequence means.
Click here for output
* intron_variant: this is interesting because the data is exome, this of course may look different if we load the complete datasets
* intron_variant,non_coding_transcript_variant: overlaping ncRNAs are always interesting
* synonymous_variant: even if they do not make any damage (hopefully) is good to have an idea of how mutated are the coding regions
* missense_variant: most important
* regulatory and upstream_variants: may affect the level of transcription of a gene
This will of course depend on what you want to show!
For our example, let's display only the synonymous_variant
, missense_variant
and regulatory_region_variant
.
Now, let's select the column that contains information about the Consequences
:
# Data selection
A_tmp <- A %>%
select(Consequence) %>%
mutate(Sample = "A")
A_tmp
Click here for output
# A tibble: 5,057 x 2
Consequence Sample
<chr> <chr>
1 intergenic_variant A
2 intergenic_variant A
3 intergenic_variant A
4 intergenic_variant A
5 intergenic_variant A
6 intergenic_variant A
7 intergenic_variant A
8 regulatory_region_variant A
9 TF_binding_site_variant A
10 regulatory_region_variant A
# ... with 5,047 more rows
Try to use the summarise()
function, grouping by Consequence
, to create a summary table of the Consequences.
Click here for output
# summarizing the consequences distribution
A_tmp %>%
group_by(Consequence) %>%
summarise(N_variants=n())
# A tibble: 14 × 2
Consequence N_variants
<chr> <int>
1 3_prime_UTR_variant 16
2 5_prime_UTR_variant 2
3 TF_binding_site_variant 8
4 downstream_gene_variant 105
5 intergenic_variant 1477
6 intron_variant 2406
7 intron_variant,non_coding_transcript_variant 764
8 missense_variant 13
9 non_coding_transcript_exon_variant 17
10 regulatory_region_variant 140
11 splice_region_variant,splice_polypyrimidine_tract_… 1
12 splice_region_variant,splice_polypyrimidine_tract_… 1
13 synonymous_variant 6
14 upstream_gene_variant 101
Repeat for sample B and create B_tmp
Click here for output
# Data selection
B_tmp <- B %>%
select(Consequence) %>%
mutate(Sample = "B")
B_tmp
# A tibble: 7,269 x 2
Consequence Sample
<chr> <chr>
1 intergenic_variant B
2 intergenic_variant B
3 intergenic_variant B
4 intergenic_variant B
5 intergenic_variant B
6 intergenic_variant B
7 intergenic_variant B
8 TF_binding_site_variant B
9 regulatory_region_variant B
10 regulatory_region_variant B
# ... with 7,259 more rows
B_tmp %>%
group_by(Consequence) %>%
summarise(N_variants=n())
# A tibble: 13 × 2
Consequence N_variants
<chr> <int>
1 3_prime_UTR_variant 24
2 5_prime_UTR_variant 4
3 TF_binding_site_variant 10
4 downstream_gene_variant 119
5 intergenic_variant 2086
6 intron_variant 3573
7 intron_variant,non_coding_transcript_variant 1092
8 missense_variant 13
9 non_coding_transcript_exon_variant 23
10 regulatory_region_variant 186
11 splice_region_variant,splice_polypyrimidine_tract_… 1
12 synonymous_variant 7
13 upstream_gene_variant 131
Then we make the plot merging the data from both samples. In the code below, add the consequences we selected. Remember to add the term within double quotes:
bind_rows(A_tmp, B_tmp) %>%
filter(Consequence %in% c("ADD_CONSEQUENCE_TERM", "ADD_CONSEQUENCE_TERM", "ADD_CONSEQUENCE_TERM")) %>%
ggplot(aes(x=Consequence, fill=Sample)) +
geom_bar(stat="count", position="dodge") +
theme_classic() +
theme(axis.text.x = element_text(angle = 45, vjust = 0.9, hjust=1))
Click here for output
bind_rows(A_tmp, B_tmp) %>%
filter(Consequence %in% c("synonymous_variant", "missense_variant", "regulatory_region_variant")) %>%
ggplot(aes(x=Consequence, fill=Sample)) +
geom_bar(stat="count", position="dodge") +
theme_classic() +
theme(axis.text.x = element_text(angle = 45, vjust = 0.9, hjust=1))
Biotypes are classifiers for genes of transcripts. This is a complete list of Biotypes used by Ensembl. Let's visualize the type of genes that have genetic variants with a pie chart.
We select the BIOTYPE
column, summarize it and create a data.frame, which will give us a frequency table of the different Biotypes
:
A %>%
select(BIOTYPE) %>%
mutate(Sample = "A")
Click here for output
# A tibble: 5,057 x 2
BIOTYPE Sample
<chr> <chr>
1 - A
2 - A
3 - A
4 - A
5 - A
6 - A
7 - A
8 enhancer A
9 - A
10 enhancer A
# ... with 5,047 more rows
table(A %>%
select(BIOTYPE) %>%
mutate(Sample = "A"))
Click here for output
Sample
BIOTYPE A
- 1485
CTCF_binding_site 15
enhancer 86
lncRNA 798
miRNA 2
misc_RNA 29
open_chromatin_region 35
promoter 1
protein_coding 2589
TF_binding_site 3
transcribed_pseudogene 14
A_tmp <- as.data.frame(table(A %>%
select(BIOTYPE) %>%
mutate(Sample = "A")))
A_tmp
Click here for output
BIOTYPE Sample Freq
1 - A 1485
2 CTCF_binding_site A 15
3 enhancer A 86
4 lncRNA A 798
5 miRNA A 2
6 misc_RNA A 29
7 open_chromatin_region A 35
8 promoter A 1
9 protein_coding A 2589
10 TF_binding_site A 3
11 transcribed_pseudogene A 14
Then, we can plot the information as a pie chart. Don't forget to add the title:
ggplot(A_tmp, aes(x="", y=Freq, fill=BIOTYPE)) +
geom_bar(stat="identity") +
coord_polar("y",start=0) +
theme_void() +
labs(title="ADD_YOUR_TITLE_HERE")
Click here for output
Repeat for Sample B
Click here for output
B %>%
select(BIOTYPE) %>%
mutate(Sample = "B")
# A tibble: 7,267 x 2
BIOTYPE Sample
<chr> <chr>
1 - B
2 - B
3 - B
4 - B
5 - B
6 - B
7 - B
8 - B
9 enhancer B
10 enhancer B
# ... with 7,257 more rows
table(B %>%
select(BIOTYPE) %>%
mutate(Sample = "B"))
Sample
BIOTYPE B
- 2096
CTCF_binding_site 20
enhancer 115
lncRNA 1136
miRNA 1
misc_RNA 32
open_chromatin_region 48
promoter 1
protein_coding 3790
snoRNA 7
TF_binding_site 2
transcribed_pseudogene 21
B_tmp <- as.data.frame(table(B %>%
select(BIOTYPE) %>%
mutate(Sample = "B")))
B_tmp
BIOTYPE Sample Freq
1 - B 2096
2 CTCF_binding_site B 20
3 enhancer B 115
4 lncRNA B 1136
5 miRNA B 1
6 misc_RNA B 32
7 open_chromatin_region B 48
8 promoter B 1
9 protein_coding B 3790
10 snoRNA B 7
11 TF_binding_site B 2
12 transcribed_pseudogene B 21
ggplot(B_tmp, aes(x="", y=Freq, fill=BIOTYPE)) +
geom_bar(stat="identity") +
coord_polar("y",start=0) +
theme_void() +
labs(title="ADD_YOUR_TITLE_HERE")
A quick note on annotation tools
As we have discussed, there are a lot of algorithms and databases for one task, for instance the annotation of whether an amino acid substitution is likely to affect protein function or not (SIFT and Polyphen). In Bioinformatics, we need to set some initial assumptions to build up an algorithm and therefor, different tools give different results. Specially because we are dealing with predictions.
To exemplify this, let's compare the scores from SIFT
and Polyphen-2
:
A %>%
mutate(SIFT_num = parse_number(SIFT),
PolyPhen_num = parse_number(PolyPhen)) %>%
ggplot(aes(x=SIFT_num, y=PolyPhen_num)) +
geom_point(alpha=0.2, size = 3) +
scale_x_reverse()+
annotate("text", x=0.8, y=0.2, label= "benign/tolerated") +
annotate("text", x=0.2, y=0.8, label= "damaging/deleterious") +
annotate(geom="rect", xmin=0.5, xmax=0, ymin=0, ymax=0.5, fill="#F8766D", alpha=0.2) +
annotate(geom="rect", xmin=1, xmax=0.5, ymin=0.5, ymax=1, fill="#619CFF", alpha=0.2) +
labs(title="Sample A") +
theme_classic()
Click here for output
As you can see, there are mutations being predicted as benign by SIFT
while those same mutations are annotated as damaging by PolyPhen
(red rectangle). Some people focus on those variants where both algorithms predict the same outcome. Just don't forget that even when there is contradictory data, those variants may be interesting to look at as well.
Now that we have an overall look of our data, it is time to filter the mutations. A common strategy is to focus on mutations that:
- are within exons (or
protein_coding
) - are annotated as a
missense_variant
- are classified other than
tolerated/benign
Let's do this step by step:
# 1. BIOTYPE == protein_coding
A %>%
filter(BIOTYPE == "protein_coding") %>%
distinct(Location, Allele, .keep_all = TRUE) %>%
nrow
Click here for output
[1] 2589 out of 5057
# 2. CONSEQUENCE == missense_variant
A %>%
filter(BIOTYPE == "protein_coding" & Consequence == "missense_variant") %>%
distinct(Location, Allele) %>%
nrow
Click here for output
[1] 13 out of 5057
For this filtering step, select either SIFT
or PolyPhen
(without quotes in the code). Remember that the SIFT label is tolerated
, while PolyPhen uses benign
(in between double quotes in the code).
# 3. SIFT != tolerated and Polyphen != benign
A_filtered <- A %>%
filter(BIOTYPE == "protein_coding" &
Consequence == "missense_variant" &
!str_detect(ADD_SIFT_OR_PolyPhen, "ADD_tolerated_OR_benign")) %>%
distinct(Location, Allele) %>%
nrow
A_filtered
Click here for output
A_sift <- A %>%
filter(BIOTYPE == "protein_coding" &
Consequence == "missense_variant" &
!str_detect(SIFT, "tolerated")) %>%
distinct(Location, Allele) %>%
nrow
A_sift
[1] 3 out of 5057
A_poly <- A %>%
filter(BIOTYPE == "protein_coding" &
Consequence == "missense_variant" &
!str_detect(PolyPhen, "benign")) %>%
distinct(Location, Allele) %>%
nrow
A_poly
[1] 3 out of 5057
Repeat for Sample B
Click here for output
B %>%
filter(BIOTYPE == "protein_coding") %>%
distinct(Location, Allele, .keep_all = TRUE) %>%
nrow
[1] 3790 out of 7267
# 2. CONSEQUENCE == missense_variant
B %>%
filter(BIOTYPE == "protein_coding" & Consequence == "missense_variant") %>%
distinct(Location, Allele) %>%
nrow
[1] 13 out of 7267
B_sift <- B %>%
filter(BIOTYPE == "protein_coding" &
Consequence == "missense_variant" &
!str_detect(SIFT, "tolerated")) %>%
distinct(Location, Allele) %>%
nrow
B_sift
[1] 4 out of 7267
B_poly <- B %>%
filter(BIOTYPE == "protein_coding" &
Consequence == "missense_variant" &
!str_detect(PolyPhen, "benign")) %>%
distinct(Location, Allele) %>%
nrow
B_poly
[1] 3 out of 7267
We can create a table to compare the results
knitr::kable(data.frame(A_sift,B_sift,A_poly,B_poly), caption="SIFT vs PolyPhen")
Table: SIFT vs PolyPhen
| A_sift| B_sift| A_poly| B_poly|
|------:|------:|------:|------:|
| 3| 4| 3| 3|
Now, let's look at the actual variants. In the code below, besides selecting SIFT
or PolyPhen
, add these column names:
- name of the gene (
SYMBOL
) - the aminoacid change (
Amino_acids
) and - the dbSNP identifier (
Existing_variation
)
under the distinct()
function
A_filtered <- A %>%
filter(BIOTYPE == "protein_coding" &
Consequence == "missense_variant" &
!str_detect(ADD_SIFT_OR_PolyPhen, "ADD_tolerated_OR_benign")) %>%
distinct(Location, Allele, ADD_HERE_THE_COLUMN_NAMES)
knitr::kable(A_filtered, caption = "Filtered mutations from Sample A")
Click here for output
A_filtered_sift <- A %>%
filter(BIOTYPE == "protein_coding" &
Consequence == "missense_variant" &
!str_detect(SIFT, "tolerated")) %>%
distinct(Location, Allele, SYMBOL, Amino_acids, Existing_variation)
A_filtered_sift
# A tibble: 3 × 5
Location Allele SYMBOL Amino_acids Existing_variation
<chr> <chr> <chr> <chr> <chr>
1 11:85725351-85725… C SYTL2 D/G -
2 11:85725825-85725… C SYTL2 A/G -
3 11:88309301-88309… C CTSC Y/C -
A_filtered_poly <- A %>%
filter(BIOTYPE == "protein_coding" &
Consequence == "missense_variant" &
!str_detect(PolyPhen, "benign")) %>%
distinct(Location, Allele, SYMBOL, Amino_acids, Existing_variation)
A_filtered_poly
# A tibble: 3 × 5
Location Allele SYMBOL Amino_acids Existing_variation
<chr> <chr> <chr> <chr> <chr>
1 11:85725351-85725… C SYTL2 D/G -
2 11:86409331-86409… A CCDC81 I/K -
3 11:88309301-88309… C CTSC Y/C -
Repeat for Sample B
Click here for output
B_filtered_sift <- B %>%
filter(BIOTYPE == "protein_coding" &
Consequence == "missense_variant" &
!str_detect(SIFT, "tolerated")) %>%
distinct(Location, Allele, SYMBOL, Amino_acids, Existing_variation)
B_filtered_sift
# A tibble: 4 × 5
Location Allele SYMBOL Amino_acids Existing_variation
<chr> <chr> <chr> <chr> <chr>
1 11:85725351-85725… C SYTL2 D/G -
2 11:85725825-85725… C SYTL2 A/G -
3 11:88309301-88309… C CTSC Y/C -
4 11:90041372-90041… A TRIM4… T/N -
B_filtered_poly <- B %>%
filter(BIOTYPE == "protein_coding" &
Consequence == "missense_variant" &
!str_detect(PolyPhen, "benign")) %>%
distinct(Location, Allele, SYMBOL, Amino_acids, Existing_variation)
B_filtered_poly
# A tibble: 3 × 5
Location Allele SYMBOL Amino_acids Existing_variation
<chr> <chr> <chr> <chr> <chr>
1 11:85725351-85725… C SYTL2 D/G -
2 11:88309301-88309… C CTSC Y/C -
3 11:90041372-90041… A TRIM4… T/N -
Since we have two samples from two patients that suffer the same disease, we should focus on those mutations that are shared by both individuals. First we need to merge each column into a single string, so we can compare across samples:
A_mutations <- A_filtered %>%
mutate(Mutation = paste(Location, Allele, SYMBOL, Amino_acids, Existing_variation, sep="|")) %>%
pull(Mutation)
A_mutations
B_mutations <- B_filtered %>%
mutate(Mutation = paste(Location, Allele, SYMBOL, Amino_acids, Existing_variation, sep="|")) %>%
pull(Mutation)
B_mutations
And then we can generate a venn diagram. In the code below, select two different colors for the venn diagram. Don't forget to add them within double quotes:
ggvenn(list(A=A_mutations, B=B_mutations), fill_color = c("ADD_COLOR1_HERE", "ADD_COLOR2_HERE"))
Click here for output
Filtered mutations based on SIFT
:
A_mutations_sift <- A_filtered_sift %>%
mutate(Mutation = paste(Location, Allele, SYMBOL, Amino_acids, Existing_variation, sep="|")) %>%
pull(Mutation)
A_mutations_sift
[1] "11:85725351-85725351|C|SYTL2|D/G|-"
[2] "11:85725825-85725825|C|SYTL2|A/G|-"
[3] "11:88309301-88309301|C|CTSC|Y/C|-"
B_mutations_sift <- B_filtered_sift %>%
mutate(Mutation = paste(Location, Allele, SYMBOL, Amino_acids, Existing_variation, sep="|")) %>%
pull(Mutation)
B_mutations_sift
[1] "11:85725351-85725351|C|SYTL2|D/G|-"
[2] "11:85725825-85725825|C|SYTL2|A/G|-"
[3] "11:88309301-88309301|C|CTSC|Y/C|-"
[4] "11:90041372-90041372|A|TRIM49C|T/N|-"
ggvenn(list(A=A_mutations_sift, B=B_mutations_sift), fill_color = c("red", "blue"))
Click here for output
Filtered mutations based on PolyPhen
:
A_mutations_poly <- A_filtered_poly %>%
mutate(Mutation = paste(Location, Allele, SYMBOL, Amino_acids, Existing_variation, sep="|")) %>%
pull(Mutation)
A_mutations_poly
[1] "11:85725351-85725351|C|SYTL2|D/G|-"
[2] "11:86409331-86409331|A|CCDC81|I/K|-"
[3] "11:88309301-88309301|C|CTSC|Y/C|-"
B_mutations_poly <- B_filtered_poly %>%
mutate(Mutation = paste(Location, Allele, SYMBOL, Amino_acids, Existing_variation, sep="|")) %>%
pull(Mutation)
B_mutations_poly
[1] "11:85725351-85725351|C|SYTL2|D/G|-"
[2] "11:88309301-88309301|C|CTSC|Y/C|-"
[3] "11:90041372-90041372|A|TRIM49C|T/N|-"
ggvenn(list(A=A_mutations_poly, B=B_mutations_poly), fill_color = c("green", "purple"))
To get the final candidate mutations, we just extract the common variants:
# * Table/List of common mutations
intersect(A_mutations, B_mutations)
Click here for output
intersect(A_mutations_sift, B_mutations_sift)
[1] "11:85725351-85725351|C|SYTL2|D/G|-"
[2] "11:85725825-85725825|C|SYTL2|A/G|-"
[3] "11:88309301-88309301|C|CTSC|Y/C|-"
intersect(A_mutations_poly, B_mutations_poly)
[1] "11:85725351-85725351|C|SYTL2|D/G|-"
[2] "11:88309301-88309301|C|CTSC|Y/C|-"
From this result, you should end up with 2 candidate genes when using PolyPhen
or SIFT
, respectively (CTSC
, SYTL2
).
The next step would be to read about these genes and try to find a connection to the disease, in our case, to PLS. Once we have justified some of those genes to be causative, then we would do some experimental validations to prove that tehse candidate variants are indeed responsible for PLS.
In our case, CTSC 88309301
is the mutation we are looking for! (hope you got it)
Well done! now you have the general workflow to identify disease causing mutations!
Developed by Marcela Dávila, Jari Martikainen and Björn Andersson, 2021. Updated by Marcela Dávila, 2023. Updated by Marcela Dávila, 2024