NGS IV: Exome - bcfgothenburg/HT24 GitHub Wiki
Course: HT24 Analysis-of-next-generation-sequencing-data (SC0024)
The purpose of this exercise for you to become familiar with the workflow used to analyze targeted sequencing data and identify disease causing mutations.
Our data involves exome sequencing data from an affected individual with Alpers syndrome (which is a homozygous progressive neurodegenerative disorder) and one control sample from a healthy individual. The data was taken from Sofou K, et al., "Whole exome sequencing reveals mutations in NARS2 and PARS2, encoding the mitochondrial asparaginyl-tRNA synthetase and prolyl-tRNA synthetase, in patients with Alpers syndrome". Mol Genet Genomic Med. 2015.
Sample | Description |
---|---|
AS_2 | Affected |
AS_ctrl | Unaffected |
-
Connect to the server using MobaXterm (PC users) or your local Terminal (MACS users):
ssh -Y your_account@remote_server
-
Load the following modules:
annovar/20200608 bcftools/1.17 bwa-mem2/2.2.1 delly/1.1.6 fastqc/0.12.1 gatk/4.4.0.0 igv/2.16.1 multiqc/1.14 R/4.3.1 samtools/1.17 picard/3.0.0 trimgalore/0.6.10
-
Create a directory called
Exome
-
Go to this directory
-
Create a soft link of the entire
/home/courses/NGS/Exome/Fastq
NOTE: if you need to remove a soft link, use:
unlink <link_name>
Remember that you can create bash scripts so your jobs use the resources of the cluster rather than only the login node. Although it is not compulsory for this practical, it would be a good practice. Besides if you submit a job and you get disconnected from the cluster (for whatever reason) the job will keep on running/queued!
Due to limit in time, we will be working with data from chr11 (so some results may look bad).
-
Run
FastQC
on your samples and evaluate the quality of the dataNOTE: Remember that we can use a
loop
to run a command on multiple samples. Here an example on how to index all bam files in a directory# running in the command line for i in *bam; do samtools index $i; done # running as a job for i in *bam; do qsub runSamtools.sh $i;done
-
Run
TrimGalore!
as you see fit and check that your approach improved your data quality. Hint: You can usemultiqc
to gather your results
Q. How does the quality look like, before and after trimming? How many reads were trimmed for both the control and the patient sample? Have a look under the
FastQC
html files.
-
Map the filtered reads towards
chr11.fa
usingbwa-mem2
:- Save the alignment as:
YOUR_SAMPLE.bwa.bam
- The reference genome (
chr11.fa
) is under the/home/courses/NGS/Exome/db
directory
- Save the alignment as:
NOTE: If you run bwa-mem2 mem
you will get help to know how to run the tool.
In case you need help
Usage: bwa-mem2 mem [options] idxbase in1.fq in2.fq
% bwa-mem2 mem /home/courses/NGS/Exome/db/chr11.fa Exome/AS_ctrl.chr11.R1.fastq Exome/AS_ctrl.chr11.R2.fastq | samtools view -Sb -o AS_ctrl.bwa.bam - &
-
Run
AddOrReplaceReadGroups
frompicard
:% java -jar /apps/bio/software/picard/3.0.0/picard.jar AddOrReplaceReadGroups -h
-
Use the following parameters (you can more if you want!):
-I -> YOUR_SAMPLE.bwa.bam -O -> YOUR_SAMPLE.bwa.sort.rg.bam -SM -> AS_2 or AS_ctrl -LB -> TruSeq -PL -> Illumina -PU -> CGG -SO -> coordinate --CREATE_INDEX -> true
In case you need help
% java -jar /apps/bio/software/picard/3.0.0/picard.jar AddOrReplaceReadGroups -I AS_ctrl.bwa.bam -O AS_ctrl.bwa.sort.rg.bam -SM AS_ctrl -LB TruSeq -PL Illumina, -PU CGG -SO coordinate --CREATE_INDEX true
- Generate the
flagstat
andidxstats
for each alignment file
Q. What is the mapping rate in the sample (% of mapped reads)? Is it really sequencing data from chr11?
Since the data was from a targeted approach, let's inspect the on-target coverage plot. GATK
's DepthOfCoverage
can help us in this task. The tool is labeled as being in BETA version however, this tool has been part of GATK since the beginning and works just fine.
- Run the following code for each alignment file:
gatk DepthOfCoverage \ -R /home/courses/NGS/Exome/db/chr11.fa \ -I YOUR_SAMPLE.bwa.sort.rg.bam \ -O YOUR_SAMPLE_NAME.metrics \ -L chr11.exon.bed \ --output-format TABLE \ --summary-coverage-threshold 10 \ --summary-coverage-threshold 30 \ --summary-coverage-threshold 50
Several files are generated:
- Have a look at
*.sample_summary
files
Q. What is the mean coverage per sample? Do you think it is enough sequencing?
- Inspect the
*.sample_interval_summary
files.- Each row shows the coverage of each targeted region (exons) at different depths (last three columns)
Q. For each sample, which exon (if any) is covered at least 300 times in average?
What percentage of the exons are fully covered at least 50 times?
-
The
*.sample_cumulative_coverage_proportions
files have two rows (really long rows!). Showing the first five columns (this values are for the entire dataset, not only chr21):gte_0 gte_1 gte_2 gte_3 AS_ctrl 1.00 0.96 0.94 0.93
- Each row is the coverage information per sample (only one in our case)
- Each column shows the aggregated proportion of the regions with >= of the corresponding coverage. In other words,:
-
gte_0 - 1.00
tells us that 100% of the regions are covered cero or more times; -
gte_1 - 0.96
tells us that 96% of the regions are covered one or more times; -
gte_2 - 0.94
tells us that 94% of the regions are covered two or more times; -
gte_3 - 0.93
tells us that 93% of the regions are covered three or more times;
-
To better understand this data, let's make a plot with this information.
-
First we transpose our file (changing columns to rows) and then, we convert it to percentage
Remember you can use your own code or directly use R, this is just one way of doing it:tail -1 SAMPLE_NAME.sample_cumulative_coverage_proportions | \ sed 's/\t/\n/g' | \ tail -n +2 | \ awk 'OFS="\t" {print $1*100,num++}' > SAMPLE_NAME.proportions
-
The first lines should look like:
100 0 29 1 15 2 10 3
-
-
Open R in the terminal by typing
R
-
Type in the following code
Remember to change the name of the file you want to use:# reading the data data <- read.table("SAMPLE_NAME.proportions") # setting name of figure png("SAMPLE_NAME.coverage.png") # scatter plot plot (data$V2,data$V1,type="l",log="x",ylim=c(0,100), ylab ="% On-target",xlab="Depth", main="On-target coverage SAMPLE_NAME") # adding lines at different coveraged abline(v=c(10,20,30,50),col="gray",lty=2) # adding lines at different % of on-target abline(h=c(50,70,80),col=c(51,"gray",51),lty=2) # adding X and % within the graph text(c(8,17,26,43),c(100,100,100,100),c("10x","20x","30x","50x"),col="gray") text(c(1,1,1),c(47,67,77),c("50%","70%","80%"),col=c(51,"gray",51)) # closing graphical device dev.off()
-
Type
q()
to exit theR
session
Q. Looking at the plots, do we need to be worried about under-sequencing?
When dealing with variant calling, an important step is to identify possible PCR duplicates, this to remove any bias when we are doing the actual calling.
- Run
MarkDuplicates
frompicard
using (at least) these arguments:-I -> YOUR_SAMPLE.bwa.sort.rg.bam -O -> YOUR_SAMPLE.bwa.sort.rg.mkdup.bam -M -> YOUR_SAMPLE.metrics -ASO -> coordinate --CREATE_INDEX -> TRUE
In case you need help
% java -jar /apps/bio/software/picard/3.0.0/picard.jar MarkDuplicates -M AS_ctrl.metrics -I AS_ctrl.bwa.sort.rg.bam -O AS_ctrl.bwa.sort.rg.mkdup.bam -ASO coordinate --CREATE_INDEX true
- Generate
flagstat
andidxstats
metrics for this new alignment file - Run
multiqc
to include these statistics
Q. How many duplicates do the samples have? Which sample has more sequencing reads?
This pre-processing step detects systematic errors made by the sequencing machine when it estimates the accuracy of each base call. Most of the short variant calling tools rely on the base quality score, so it is important to correct these systematic technical errors.
First, we will produce a recalibration file from our input data using the BaseRecalibrator
tool. To speed up the processing time, let's generate a file containing only SNPs from chr11
. (You can use your own code if you prefer):
# keeping the header of the VCF file
grep ^# /home/courses/NGS/Exome/db/dbsnp_138.b37.vcf | \
sed 's/ID=11/ID=chr11/' | \
grep -P "contig=<ID=[1-9,A-Z]" -v | \
grep -P "<ID=GL" -v > title
# extracting SNPs in chr11, formatting to add "chr", removing unwanted rows
grep ^11 -w /home/courses/NGS/Exome/db/dbsnp_138.b37.vcf | \
#awk '$2 < 57311868 {print $0}' | \
sed 's/^11/chr11/' > 11
# merging title and data
cat title 11 > dbsnp_138.b37.chr11.vcf
- Index the file using
gatk IndexFeatureFile
and:-I -> dbsnp_138.b37.chr11.vcf
In case you need help
% gatk IndexFeatureFile -I dbsnp_138.b37.chr11.vcf
-
Run the
gatk BaseRecalibrator
for both samples, with :-R -> chr11_reference_file -I -> YOUR_SAMPLE.bwa.sort.rg.mkdup.bam -O -> YOUR_SAMPLE.table --known-sites -> THE_VCF_FILE_YOU_JUST_CREATED
In case you need help
% gatk BaseRecalibrator -R /home/courses/NGS/Exome/db/chr11.fa -I AS_ctrl.bwa.sort.rg.mkdup.bam -O AS_ctrl.table --known-sites dbsnp_138.b37.chr11.vcf
Q. What are the
--known_sites
?
Now let's correct the scores
-
Run
gatk ApplyBQSR
for both alignments, with:-R -> chr11_reference_file -I -> YOUR_SAMPLE.bwa.sort.rg.mkdup.bam -O -> YOUR_SAMPLE.bwa.sort.rg.mkdup.recal.bam --bqsr-recal-file -> THE_TABLE_FILE_YOU_JUST_CREATED
In case you need help
% gatk ApplyBQSR -R /home/courses/NGS/Exome/db/chr11.fa -I AS_ctrl.bwa.sort.rg.mkdup.bam -O AS_ctrl.bwa.sort.rg.mkdup.recal.bam --bqsr-recal-file AS_ctrl.table
The Genome Analysis Toolkit or GATK is a software package developed to analyze
next-generation resequencing data, focusing on variant discovery and genotyping.
For the variant calling we will use the HaplotypeCaller
, which is an SNP/indel
caller that uses a Bayesian genotype likelihood model to estimate simultaneously
the most likely genotypes and allele frequency in a population of N samples.
-
Run
gatk HaplotypeCaller
to do the variant calling, with these parameters:-R -> chr11_reference_file -I -> YOUR_SAMPLE.bwa.sort.rg.mkdup.recal.bam -O -> YOUR_SAMPLE.gatk.raw.vcf
In case you need help
% gatk HaplotypeCaller-R /home/courses/NGS/Exome/db/chr11.fa -I AS_ctrl.bwa.sort.rg.mkdup.recal.bam -O AS_ctrl.gatk.raw.vcf
- Inspect the VCF files you just created and get familiar with their information.
Q. How do you differentiate between a homozygous mutations from an heterozygous?
How many SNPs do you have in each sample?
Are there any insertions/deletions?
Let's convert the VCF files into a tab delimited files.
-
Use
gatk VariantsToTable
with:-O -> YOUR_SAMPLE.gatk.raw.table -V -> YOUR_SAMPLE.gatk.raw.vcf -F -> CHROM -F -> POS -F -> ID -F -> REF -F -> ALT -F -> QUAL -F -> FILTER -GF -> AC -GF -> GT -GF -> AD -GF -> DP
In case you need help
% gatk VariantsToTable -O AS_ctrl.gatk.raw.table -V AS_ctrl.gatk.raw.vcf -F CHROM -F POS -F ID -F REF -F ALT -F QUAL -F FILTER -GF AC -GF GT -GF AD -GF DP
Have a look and understand the output.
SNP-callers emit a number of false-positive mutations, as any prediction program. Some of these false-positives can be detected and rejected by various statistical tests. Filtering or flagging the variant calls in the VCF file is a useful step to detect such false-positives.
-
Run
gatk VariantFiltration
with:-O -> YOUR_SAMPLE.gatk.raw.flag.vcf -V -> YOUR_SAMPLE.gatk.raw.vcf --filter-name -> "LowDP" --filter-expression -> "DP < 10.0"
In case you need help
% gatk VariantFiltration -O AS_ctrl.gatk.raw.flag.vcf -V AS_ctrl.gatk.raw.vcf --filter-name "LowDP" --filter-expression "DP < 10.0"
These are hard filters that are applied when a variant callset is too small or when training sets are not available. Try to understand the filtering done above. You may find some info at the beginning of the VCF file.
Q. How many variants have passed our filtering criteria for both files?
Now we need to annotate our variants in terms of genomic context, aminoacid change and effect, previous mutation knowledge, etc. so we can evaluate and filter mutations that are irrelevant.
ANNOVAR is a tool that annotates genetic variants. The overall procedure is to create an input file from the VCF file and then annotate the variant positions with information from several databases.
-
Reformat both alignment files with
convert2annovar.pl
using:-format -> vcf4 -outfile -> YOUR_FILE.gatk.raw.flag.annovar YOUR_FILE.gatk.raw.flag.vcf
In case you need help
% convert2annovar.pl -format vcf4 -outfile AS_ctrl.gatk.raw.flag.annovar AS_ctrl.gatk.raw.flag.vcf
-
Add the annotation to the alignments using
table_annovar.pl
:-buildver -> hg19 -protocol -> dbsnp_137_chr11_gatk,refGene,ljb2_all,1000g2012apr_all,clinvar_20200316 -operation -> f,g,f,f,f -outfile -> YOUR_FILE.gatk.raw.flag.annovar.table --nastring -> - --remove YOUR_FILE.gatk.raw.flag.annovar /home/courses/NGS/Exome/db/humandb
In case you need help
% table_annovar.pl -buildver hg19 -protocol dbsnp_137_chr11_gatk,refGene,ljb2_all,1000g2012apr_all,clinvar_20200316 -operation f,g,f,f,f -outfile AS_ctrl.gatk.raw.flag.annovar.table --nastring - --remove AS_ctrl.gatk.raw.flag.annovar /home/courses/NGS/Exome/db/humandb
Q. What kind of information are we adding to our mutations? Have a look at the file with
.hg19_multianno.txt
extension. Here you can find some information.
- Merge the
*hg19_multianno.txt
file with the*gatk.raw.flag.annovar
, and save it in a file with extension.tsv
. Note that the*gatk.raw.flag.annovar
does not have column names, add them before you merge them. Hint: you can usepaste
In case you need help
% # creating a header after inspecting AS_ctrl.gatk.raw.flag.annovar
% echo -e "CHR\tSTART\tEND\tREF\tALT\tGT\tQUAL\tDP" > header
% # adding the header to the file
% cat header AS_ctrl.gatk.raw.flag.annovar > AS_ctrl.gatk.raw.flag.annovar.fix
% # merging both files
% paste AS_ctrl.gatk.raw.flag.annovar.fix AS_ctrl.gatk.raw.flag.annovar.table.hg19_multianno.txt > AS_ctrl
This file now has statistics and annotations for the mutations. It is a lot of data, however we are interested in:
-
Func.refGene
: where the variant is located (intron, exon, etc) -
ExonFunc.refGene
: what type of variant is it (missense, nonsense, etc) -
AAChange.refGene
: nucleotide and aminoacid change -
LJB2_PolyPhen2_HVAR
: how damaging is the change the variant produces
Other columns may be self explanatory
Reading the AAChange.refGene column
PGM1:NM_001172818:exon2:c.A316G:p.I106V,PGM1:NM_002633:exon2:c.A262G:p.I88V
PGM1 -> Gene name
NM_001172818 => Transcript ID_1
exon2 => exon where the mutation is found
c.A316G => in the DNA, at position 316 the reference base A is changed to an G
p.I106V => in the protein, at positon 106 the reference aminoacid I is changed to a V
PGM1 => Gene name
NM_002633 => Transcript ID_2
exon2 => exon where the mutation is found
A262G => in the DNA, at position 262 the reference base A is changed to a G
p.I88V => in the protein, at position 88 the reference aminoacid I is changed to a V
With all this information, it is possible to narrow down the list of
mutations to a (hopefully) handful of SNPs. For this let us use R
.
In most of the cases, you should copy/paste
the code. There will be some instances where you will be asked to decide on a value/threshold/color and this will be highlighted in CAPS
within the code.
You are not expected to understand all the code, however you are more than welcome to ask!. Learning R takes a lot of practice!
Start R
in your local computer.
Select your working directory (where you saved the annotated files):
Session -> Set Working Directory -> Choose directory
Let's load the packages we will be using:
# Install them first if you don't have them already
install.packages("tidyverse") # this is done only once
install.packages("ggvenn") # this is done only once
# Load required packages
library(tidyverse) # data manipulation
library(ggvenn) # Venn diagram
Now we need to read the data in R. In the code below, replace the text with your TSV file names. Make sure the name is within double quotes:
# Read in data
A <- read_table("WRITE_THE_NAME_OF_YOUR_AS_2_FILE") # affected
U <- read_table("WRITE_THE_NAME_OF_YOUR_AS_ctrl_FILE") #unaffected or control
It is important to check at every step if the data we have make sense, so let's display the data:
# display variable
A
Click here for output
# A tibble: 9,900 × 40
CHR START END REF ALT GT QUAL DP Chr Start End Ref Alt dbsnp_137_chr11_gatk Func.refGene Gene.refGene GeneDetail.refGene
<chr> <dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
1 chr11 123778 123778 T C het 86.6 14 chr11 123778 123778 T C - intergenic NONE;LINC01001 dist=NONE;dist=3356
2 chr11 124986 124986 G A het 59.6 4 chr11 124986 124986 G A - intergenic NONE;LINC01001 dist=NONE;dist=2148
3 chr11 125918 125918 G A het 43.6 10 chr11 125918 125918 G A - intergenic NONE;LINC01001 dist=NONE;dist=1216
4 chr11 176397 176397 A G hom 70.3 2 chr11 176397 176397 A G rs61871102-129 intergenic LINC01001;SCGB1C2 dist=37280;dist=16683
5 chr11 187658 187658 A G hom 37.3 2 chr11 187658 187658 A G rs2686910-100 intergenic LINC01001;SCGB1C2 dist=48541;dist=5422
6 chr11 189683 189683 A G hom 70.3 2 chr11 189683 189683 A G rs78392844-131 intergenic LINC01001;SCGB1C2 dist=50566;dist=3397
7 chr11 190650 190650 A G hom 35.5 1 chr11 190650 190650 A G rs75887666-131 intergenic LINC01001;SCGB1C2 dist=51533;dist=2430
8 chr11 190688 190688 T G hom 52.3 2 chr11 190688 190688 T G rs76699989-131 intergenic LINC01001;SCGB1C2 dist=51571;dist=2392
9 chr11 190761 190761 A G hom 34.5 1 chr11 190761 190761 A G rs2739234-100 intergenic LINC01001;SCGB1C2 dist=51644;dist=2319
10 chr11 192997 192997 G A het 38.6 12 chr11 192997 192997 G A rs2280540-100 upstream SCGB1C1;SCGB1C2 dist=83
# ℹ 9,890 more rows
# ℹ 23 more variables: ExonicFunc.refGene <chr>, AAChange.refGene <chr>, LJB2_SIFT <chr>, LJB2_PolyPhen2_HDIV <chr>, LJB2_PP2_HDIV_Pred <chr>,
# LJB2_PolyPhen2_HVAR <chr>, LJB2_PolyPhen2_HVAR_Pred <chr>, LJB2_LRT <chr>, LJB2_LRT_Pred <chr>, LJB2_MutationTaster <chr>, LJB2_MutationTaster_Pred <chr>,
# LJB_MutationAssessor <chr>, LJB_MutationAssessor_Pred <chr>, LJB2_FATHMM <chr>, `LJB2_GERP++` <chr>, LJB2_PhyloP <chr>, LJB2_SiPhy <chr>,
# `1000g2012apr_all` <chr>, CLNALLELEID <chr>, CLNDN <chr>, CLNDISDB <chr>, CLNREVSTAT <chr>, CLNSIG <chr>
# ℹ Use `print(n = ...)` to see more rows
# display variable
U
Click here for output
# A tibble: 13,461 × 40
CHR START END REF ALT GT QUAL DP Chr Start End Ref Alt dbsnp_137_chr11_gatk Func.refGene Gene.refGene GeneDetail.refGene
<chr> <dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
1 chr11 123756 123756 G T het 40.6 14 chr11 123756 123756 G T - intergenic NONE;LINC01001 dist=NONE;dist=3378
2 chr11 124288 124288 A T het 106. 11 chr11 124288 124288 A T - intergenic NONE;LINC01001 dist=NONE;dist=2846
3 chr11 124342 124342 C T het 46.6 9 chr11 124342 124342 C T - intergenic NONE;LINC01001 dist=NONE;dist=2792
4 chr11 135531 135531 A T hom 37.3 2 chr11 135531 135531 A T - ncRNA_intronic LINC01001 -
5 chr11 176397 176397 A G hom 112. 4 chr11 176397 176397 A G rs61871102-129 intergenic LINC01001;SCGB1C2 dist=37280;dist=166…
6 chr11 190488 190488 T C hom 79.8 3 chr11 190488 190488 T C rs2354751-100 intergenic LINC01001;SCGB1C2 dist=51371;dist=2592
7 chr11 193096 193096 T C hom 955. 28 chr11 193096 193096 T C rs2280539-100 UTR5 SCGB1C1;SCGB1C2 NM_145651:c.-4T>C;N…
8 chr11 193112 193112 C T het 526. 35 chr11 193112 193112 C T rs7951297-116 exonic SCGB1C1;SCGB1C2 -
9 chr11 193194 193194 T C het 268. 23 chr11 193194 193194 T C rs2686896-100 intronic SCGB1C1;SCGB1C2 -
10 chr11 193863 193863 T C het 341. 19 chr11 193863 193863 T C rs2294081-100 exonic SCGB1C1;SCGB1C2 -
# ℹ 13,451 more rows
# ℹ 23 more variables: ExonicFunc.refGene <chr>, AAChange.refGene <chr>, LJB2_SIFT <chr>, LJB2_PolyPhen2_HDIV <chr>, LJB2_PP2_HDIV_Pred <chr>,
# LJB2_PolyPhen2_HVAR <chr>, LJB2_PolyPhen2_HVAR_Pred <chr>, LJB2_LRT <chr>, LJB2_LRT_Pred <chr>, LJB2_MutationTaster <chr>, LJB2_MutationTaster_Pred <chr>,
# LJB_MutationAssessor <chr>, LJB_MutationAssessor_Pred <chr>, LJB2_FATHMM <chr>, `LJB2_GERP++` <chr>, LJB2_PhyloP <chr>, LJB2_SiPhy <chr>,
# `1000g2012apr_all` <chr>, CLNALLELEID <chr>, CLNDN <chr>, CLNDISDB <chr>, CLNREVSTAT <chr>, CLNSIG <chr>
# ℹ Use `print(n = ...)` to see more rows
It is important to have an overview of the entire dataset, so summary statistics and graphs are quite helpful to scan our data. We will look at:
- Number of mutations per sample
- Distribution of mutations based on location (barplot)
- Distribution of mutations based on type of variants (pie chart)
Let's see how many mutations we have:
# Number of mutations based on location
A %>%
nrow
Click here for output
[1] 9900
- Repeat for sample
U
Now let's visualize where our variants:
# Summarizing the variant positions
A %>%
select(Func.refGene) %>%
mutate(Sample = "A") %>%
group_by(Func.refGene) %>%
summarise(N_variants=n())
Click here for output
# A tibble: 13 × 2
Func.refGene N_variants
<chr> <int>
1 UTR3 333
2 UTR5 166
3 downstream 170
4 exonic 1987
5 exonic;splicing 9
6 intergenic 1857
7 intronic 4772
8 ncRNA_exonic 121
9 ncRNA_intronic 256
10 ncRNA_splicing 1
11 splicing 15
12 upstream 204
13 upstream;downstream 9
Repeat for sample U
.
Now let's gather the data and create a barplot:
# selecting the data
A_func <- A %>%
select(Func.refGene) %>%
mutate(Sample = "A")
U_func <- U %>%
select(Func.refGene) %>%
mutate(Sample = "U")
# creating the barplot
bind_rows(A_func, U_func) %>%
ggplot(aes(x=Func.refGene, fill=Sample)) +
geom_bar(stat="count", position="dodge") +
theme_classic() +
theme(axis.text.x = element_text(angle = 45, vjust = 0.9, hjust=1))
Q. Where do we have more variants?
We can also summarize the type of variants that we have:
A_exonicFunc <- as.data.frame(table(A %>%
select(ExonicFunc.refGene) %>%
mutate(Sample = "A")))
ggplot(A_exonicFunc, aes(x="", y=Freq, fill=ExonicFunc.refGene)) +
geom_bar(stat="identity") +
coord_polar("y",start=0) +
theme_void() +
labs(title="AS_2")
Repeat for sample U
.
Now that we have an overall look of our data, it is time to filter the mutations. A common strategy is to focus on mutations that:
- are within exons (or
protein_coding
) - are annotated as a
missense_variant
- are classified other than
tolerated/benign
# Counting our filtered variants
A %>%
filter(Func.refGene == "exonic" &
ExonicFunc.refGene == "nonsynonymous" &
str_detect(LJB2_PolyPhen2_HVAR, "D") & # D = damaging
str_detect(LJB2_LRT, "D")) %>% # D = damaging
nrow
# Variable containing relevant information
A_mutation <- A %>%
filter(Func.refGene == "exonic" &
ExonicFunc.refGene == "nonsynonymous" &
str_detect(LJB2_PolyPhen2_HVAR, "D") & # D = damaging
str_detect(LJB2_LRT, "D") &
str_detect(LJB2_MutationTaster, "D")) %>%
mutate(Mutation=paste(CHR, START, END, REF, ALT, LJB2_SIFT, sep="|")) %>%
pull(Mutation)
Repeat for Sample U
.
Now we can compare our samples and create a VennDiagram:
ggvenn(list(AS_2=A_mutation, AS_ctrl=U_mutation), fill_color = c("ADD_A_COLOR", "ADD_ANOTHER_COLOR"))
Depending on your strategy you can use any of the following expressions to identify probable disease causing mutations:
- intersect(x, y) finds all rows in both x and y.
- union(x, y) finds all rows in either x or y, excluding duplicates.
- union_all(x, y) finds all rows in either x or y, including duplicates.
- setdiff(x, y) finds all rows in x that aren't in y.
Q. Which are your candidate genes? What was your filtering strategy? Which gene(s) is likely to be the disease causing mutation?
Once you have your list, visual inspection of your candidates helps
in assessing them further. Have a look at your candidates in IGV
,
you may want to load both of your VCF files as well as the BAM files.
When you have multiple files and positions to investigate in IGV you can save time by creating a bat script
. The following script take snapshots of the coordinate chr11:1017135 across AS_2 and Control using both their bams and mutation files.
-
Copy this script to a file called
bashIGV.bat
, make sure that the paths to thebam
andvcf
files are correct.new genome hg19 load PATH_TO_FILE/AS_2.bwa.sort.fix.mkdup.recal.bam load PATH_TO_FILE/AS_2.gatk.raw.flag.vcf load PATH_TO_FILE/AS_ctrl.bwa.sort.fix.mkdup.recal.bam load PATH_TO_FILE/AS_ctrl.gatk.raw.flag.vcf snapshotDirectory . squish goto chr11:1017135 snapshot 1017135_Exomedata.png exit
-
Run the script with the following command (don't forget to make it executable!):
igv.sh -b bashIGV.bat
-
Examine the resulting png
1017135_Exomedata.png
-
Change the script to zoom into other mutations of interest, don't forget to change the output name of the png file to the corresponding position.
Q. How are the mutations distributed? Are they all in exons?
How do you differentiate between heterozygous and homozygous mutations?
Some mutations are slightly faded, what does this mean?
Q. Could you filter out some of your candidates with your visual inspection? Which are you final candidate genes?
Now let's analyse WGS data to see if we can identify any structural variants. When analyzing tumor samples you normally have two datasets; one normal (germline) sample and one tumor sample (somatic). To save time we will only work with two tumor samples from chromosome 17, these are breast cancer cell lines: HCC1143 and HCC1954.
- Create a soft link to the entire folder
/home/courses/NGS/Exome/SV
Before running any variant calling, we need to mark duplicates.
-
Run
MarkDuplicates
for each alignment file, with the following parameters:-I -> YOUR_SAMPLE.tumor.bam -O -> YOUR_SAMPLE.tumor.mkdup.bam -M -> YOUR_SAMPLE.tumor.metrics --CREATE_INDEX -> true
-
Run the variant calling with
delly call
to find any structural variants:delly call \ -g Homo_sapiens_assembly19.fasta_WITHIN_/db -o ALL.tumor.bcf list_of_alignments_separated_by_space
The default output is in bcf
format.
-
Convert the file to
vcf
format usingbcftools
:bcftools view ALL.tumor.bcf > ALL.tumor.vcf
Take a look at the vcf file.
Q. How many structural variant did DELLY find?
- Filter your
vcf
file, keeping only PASS quality variants.
Q. How many variants are left after filtering for quality PASS?
- Remove all variants annotates as
IMPRECISE
Q. How many variant are left after filtering based on PRECISE. Which variants are unique to sample HCC1143? and for HCC1954?
From the filtered vcf file, choose two regions (one with a deletion (DEL) and one with an inversion (INV)) to visualize in IGV
.
- Load the bam and vcf files in
IGV
, and zoom in to the regions you decided to look at.
Q. Visually, do you agree with DELLY? Is the variant classified correct as DEL and INV? And what about the zygousity, is the variant correctly classified as homozygous or heterozygous (look at GT field in vcf)?
Developed by Marcela Dávila, 2017. Modified by Marcela Davila, Sanna Abrahamsson and Vanja Börjesson, 2021. Updated by Marcela Dávila, 2023. Updated by Marcela Dávila, 2024.