NGS IV: Exome - bcfgothenburg/HT23 GitHub Wiki
Course: HT23 Analysis-of-next-generation-sequencing-data (SC00204)
The purpose of this exercise for you to become familiar with the workflow used to analyze targeted sequencing data and identify disease causing mutations.

Our data involves exome sequencing data from an affected individual with Alpers syndrome (which is a progressive neurodegenerative disorder) and one control sample from a healthy individual. The data was taken from Sofou K, et al., "Whole exome sequencing reveals mutations in NARS2 and PARS2, encoding the mitochondrial asparaginyl-tRNA synthetase and prolyl-tRNA synthetase, in patients with Alpers syndrome". Mol Genet Genomic Med. 2015.
| Sample | Description |
|---|---|
| AS_2 | Affected |
| AS_ctrl | Unaffected |
-
Connect to the server using MobaXterm (PC users) or your local Terminal (MACS users):
ssh -Y your_account@remote_server
-
Load the following modules:
annovar/20200608 bcftools/1.17 bwa-mem2/2.2.1 delly/1.1.6 fastqc/0.12.1 gatk/4.4.0.0 igv/2.16.1 multiqc/1.14 R/4.3.1 samtools/1.17 picard/3.0.0 trimgalore/0.6.10
-
Create a directory called
Exome -
Go to this directory
-
Create a soft link of the entire
/home/ftp_courses/NGS/Exome/Fastq -
Create another soft link of the entire
/home/ftp_courses/NGS/Exome/dbNOTE: if you need to remove a soft link, use:
unlink <link_name>
Remember that you can create bash scripts so your jobs use the resources of the cluster rather than only the login node. Although it is not compulsory for this practical, it would be a good practice. Besides if you submit a job and you get disconnected from the cluster (for whatever reason) the job will keep on running/queued!
Due to limit in time, we will be working with data from chr11 (so some results may look bad).
-
Run
FastQCon your samples and evaluate the quality of the dataNOTE: Remember that we can use a
loopto run a command on multiple samples. Here an example on how to index all bam files in a directory# running in the command line for i in *bam; do samtools index $i; done # running as a job for i in *bam; do qsub runSamtools.sh $i;done
-
Run
TrimGalore!as you see fit and check that your approach improved your data quality. Hint: You can usemultiqcto gather your results
Q. How does the quality look like, before and after trimming? How many reads were trimmed for both the control and the patient sample? Have a look under the
FastQCfolder for the raw fastq files, and under theTrimGalorefor the filtered fastq files.
-
Map the filtered reads towards
chr11.fausingbwa-mem2- Save the alignment as:
YOUR_SAMPLE.bwa.bam - The reference genome is under the
dbdirectory
- Save the alignment as:
-
Run
AddOrReplaceReadGroupsfrompicardusing the following parameters (you can more if you want!):-I -> YOUR_SAMPLE.bwa.bam -O -> YOUR_SAMPLE.rg.sort.bwa.bam -SM -> AS_2 or AS_ctrl -LB -> TruSeq -PL -> Illumina -PU -> CGG -SO -> coordinate --CREATE_INDEX -> true
-
Generate the
flagstatandidxstatsfor each alignment file
Q. What is the mapping rate in the sample (% of mapped reads)? Is it really sequencing data from chr11?
Since the data was from a targeted approach, let's inspect the on-target coverage plot. GATK's DepthOfCoverage can help us in this task. The tool is labeled as being in BETA version however, this tool has been part of GATK since the beginning and works just fine.
- Run the following code for each alignment file:
gatk DepthOfCoverage \ -R db/chr11.fa \ -I YOUR_SAMPLE.rg.sort.bwa.bam \ -O YOUR_SAMPLE_NAME.metrics \ -L chr11.exon.bed \ --output-format TABLE \ --summary-coverage-threshold 10 \ --summary-coverage-threshold 30 \ --summary-coverage-threshold 50
Several files are generated:
- Have a look at
*.sample_summaryfiles
Q. What is the mean coverage per sample? Do you think it is enough sequencing?
- Inspect the
*.sample_interval_summaryfiles.- Each row shows the coverage of each targeted region (exons) at different depths (last three columns)
Q. For each sample, which exon (if any) is covered at least 300 times in average?
What percentage of the exons are fully covered at least 50 times?
-
The
*.sample_cumulative_coverage_proportionsfiles have two rows (really long rows!). Showing the first five columns (this values are for the entire dataset, not only chr21):gte_0 gte_1 gte_2 gte_3 AS_ctrl 1.00 0.96 0.94 0.93
- Each row is the coverage information per sample (only one in our case)
- Each column shows the aggregated proportion of the regions with >= of the corresponding coverage. In other words,:
-
gte_0 - 1.00tells us that 100% of the regions are covered cero or more times; -
gte_1 - 0.96tells us that 96% of the regions are covered one or more times; -
gte_2 - 0.94tells us that 94% of the regions are covered two or more times; -
gte_3 - 0.93tells us that 93% of the regions are covered three or more times;
-
To better understand this data, let's make a plot with this information.
-
First we transpose our file (changing columns to rows) and then, we convert it to percentage
Remember you can use your own code or directly use R, this is just one way of doing it:tail -1 SAMPLE_NAME.sample_cumulative_coverage_proportions | \ sed 's/\t/\n/g' | \ tail -n +2 | \ awk 'OFS="\t" {print $1*100,num++}' > SAMPLE_NAME.proportions
-
The first lines should look like:
100 0 29 1 15 2 10 3
-
-
Open R in the terminal by typing
R -
Type in the following code
Remember to change the name of the file you want to use:# reading the data data <- read.table("SAMPLE_NAME.proportions") # setting name of figure png("SAMPLE_NAME.coverage.png") # scatter plot plot (data$V2,data$V1,type="l",log="x",ylim=c(0,100), ylab ="% On-target",xlab="Depth", main="On-target coverage SAMPLE_NAME") # adding lines at different coveraged abline(v=c(10,20,30,50),col="gray",lty=2) # adding lines at different % of on-target abline(h=c(50,70,80),col=c(51,"gray",51),lty=2) # adding X and % within the graph text(c(8,17,26,43),c(100,100,100,100),c("10x","20x","30x","50x"),col="gray") text(c(1,1,1),c(47,67,77),c("50%","70%","80%"),col=c(51,"gray",51)) # closing graphical device dev.off()
-
Type
q()to exit theRsession
Q. Looking at the plots, do we need to be worried about under-sequencing?
When dealing with variant calling, an important step is to identify possible PCR duplicates, this to remove any bias when we are doing the actual calling.
-
Run
MarkDuplicatesfrompicardusing (at least) these arguments:-I -> YOUR_SAMPLE.bwa.sort.fix.bam -O -> YOUR_SAMPLE.bwa.sort.fix.mkdup.bam -M -> YOUR_SAMPLE.metrics --CREATE_INDEX -> TRUE
-
Generate
flagstatandidxstatsmetrics for this new alignment file -
Run
multiqcto include these statistics
Q. How many duplicates do the samples have? Which sample has more sequencing reads?
This pre-processing step detects systematic errors made by the sequencing machine when it estimates the accuracy of each base call. Most of the short variant calling tools rely on the base quality score, so it is important to correct these systematic technical errors.
First, we will produce a recalibration file from our input data using the BaseRecalibrator tool. To speed up the processing time, let's generate a file containing only SNPs from chr11. (You can use your own code if you prefer):
# keeping the header of the VCF file
grep ^# db/dbsnp_138.b37.vcf | \
sed 's/ID=11/ID=chr11/' | \
grep -P "contig=<ID=[1-9,A-Z]" -v | \
grep -P "<ID=GL" -v | \
sed 's/135006516/57311868/' > title
# extracting SNPs in chr11
grep ^11 -w db/dbsnp_138.b37.vcf | \
awk '$2 < 57311868 {print $0}' | \
sed 's/^11/chr11/' > 11
# merging title and data, formatting to add "chr", removing unwanted rows
cat title 11 > dbsnp_138.b37.chr11.vcf-
Index the file using
gatk IndexFeatureFileand:-I -> dbsnp_138.b37.chr11.vcf -
Run the
gatk BaseRecalibratorfor both samples, with :-R -> chr11_reference_file -I -> YOUR_SAMPLE.bwa.sort.fix.mkdup.bam -O -> YOUR_SAMPLE.table --known-sites -> THE_VCF_FILE_YOU_JUST_CREATED
Q. What are the
--known_sites?
Now let's correct the scores of the sample.
-
Run
gatk ApplyBQSRfor both alignments, with:-R -> chr11_reference_file -I -> YOUR_SAMPLE.bwa.sort.fix.mkdup.bam -O -> YOUR_SAMPLE.sort.fix.mkdup.recal.bam --bqsr-recal-file -> THE_TABLE_FILE_YOU_JUST_CREATED
The Genome Analysis Toolkit or GATK is a software package developed to analyze
next-generation resequencing data, focusing on variant discovery and genotyping.
For the variant calling we will use the HaplotypeCaller, which is an SNP/indel
caller that uses a Bayesian genotype likelihood model to estimate simultaneously
the most likely genotypes and allele frequency in a population of N samples.
-
Run
gatk HaplotypeCallerto do the variant calling, with these parameters:-R -> chr11_reference_file -I -> YOUR_SAMPLE.bwa.sort.fix.mkdup.recal.bam -O -> YOUR_SAMPLE.gatk.raw.vcf
-
Inspect the VCF files you just created and get familiar with their information.
Q. How do you differentiate between a homozygous mutations from an heterozygous?
How many SNPs do you have in each sample?
Are there any insertions/deletions?
Let's convert the VCF files into a tab delimited files.
-
Use
gatk VariantsToTablewith:-O -> YOUR_SAMPLE.gatk.raw.table -V -> YOUR_SAMPLE.gatk.raw.vcf -F -> CHROM -F -> POS -F -> ID -F -> REF -F -> ALT -F -> QUAL -F -> FILTER -GF -> AC -GF -> GT -GF -> AD -GF -> DP
We will use this file later. Have a look and understand the output.
SNP-callers emit a number of false-positive mutations, as any prediction program. Some of these false-positives can be detected and rejected by various statistical tests. Filtering or flagging the variant calls in the VCF file is a useful step to detect such false-positives.
-
Run
gatk VariantFiltrationwith:-O -> YOUR_SAMPLE.gatk.raw.flag.vcf -V -> YOUR_SAMPLE.gatk.raw.vcf --filter-name -> "LowDP" --filter-expression -> "DP < 10.0"
These are hard filters that are applied when a variant callset is too small or when training sets are not available. Try to understand the filtering done above. You may find some info at the beginning of the VCF file.
Q. How many variants have passed our filtering criteria for both files?
Now we need to annotate our variants in terms of genomic context, aminoacid change and effect, previous mutation knowledge, etc. so we can evaluate and filter mutations that are irrelevant.
ANNOVAR is a tool that annotates genetic variants. The overall procedure is to create an input file from the VCF file and then annotate the variant positions with information from several databases.
-
Reformat both alignment files with
convert2annovar.plusing:-format -> vcf4 -outfile -> YOUR_FILE.gatk.raw.flag.annovar YOUR_FILE.gatk.raw.flag.vcf
-
Add the annotation to the alignments using
table_annovar.pl:-buildver -> hg19 -protocol -> dbsnp_137_chr11_gatk,refGene,ljb2_all,1000g2012apr_all,clinvar_20200316 -operation -> f,g,f,f,f -outfile -> YOUR_FILE.gatk.raw.flag.annovar.table --nastring -> - --remove YOUR_FILE.gatk.raw.flag.annovar /home/ftp_courses/NGS/Exome/db/humandb
Q. What kind of information are we adding to our mutations? Have a look at the file with
.hg19_multianno.txtextension. Here you can find some information.
-
Merge the
*hg19_multianno.txtfile with the*gatk.raw.flag.table, and save it in a file with extension.tsv. Hint: you can usepaste- The file should look like:
- The file should look like:
This file now has statistics and annotations for the mutations. If you have a look here is a description of some columns:
| Column | Description |
|---|---|
| AC | Allele count |
| GT | Genotype |
| AD | Allele depth |
| DP | Depth |
| SIFT | D = damaging, T = tolerated |
| HDIV | D = probably damaging, P = possibly damaging, B = benign |
| HVAR | D = probably damaging, P = possibly damaging, B = benign |
| LRT | D = deleterious, N = Neutral , U = Unknown |
| MutationTaster | A=disease_causing_automatic, D=disease_causing, N=polymorphism, P=polymorphism_automatic |
| phyloP100way | the higher the value the more conserved is the aminoacid in other organisms |
This is how you read the AAChange column:
PGM1:NM_001172818:exon2:c.A316G:p.I106V,PGM1:NM_002633:exon2:c.A262G:p.I88V
PGM1 -> Gene name
NM_001172818 => Transcript ID_1
exon2 => exon where the mutation is found
c.A316G => in the DNA, at position 316 the reference base A is changed to an G
p.I106V => in the protein, at positon 106 the reference aminoacid I is changed to a V
PGM1 => Gene name
NM_002633 => Transcript ID_2
exon2 => exon where the mutation is found
A262G => in the DNA, at position 262 the reference base A is changed to a G
p.I88V => in the protein, at position 88 the reference aminoacid I is changed to a V
With all this information, it is possible to narrow down the list of mutations to a handful of SNPs. Think of a filtering strategy and filter your SNPs. Don't forget to look at both samples (Affected and Unaffected).
Q. Which are your candidate genes? What was your filtering strategy?
Once you have your list, visual inspection of your candidates helps
in assessing them further. Have a look at your candidates in IGV,
you may want to load both of your VCF files as well as the BAM files.
When you have multiple files and positions to investigate in IGV you can save time by creating a bat script. The following script take snapshots of the coordinate chr11:1017135 across AS_2 and Control using both their bams and mutation files.
-
Copy this script to a file called
bashIGV.bat, make sure that the paths to thebamandvcffiles are correct.new genome hg19 load PATH_TO_FILE/AS_2.bwa.sort.fix.mkdup.recal.bam load PATH_TO_FILE/AS_2.gatk.raw.flag.vcf load PATH_TO_FILE/AS_ctrl.bwa.sort.fix.mkdup.recal.bam load PATH_TO_FILE/AS_ctrl.gatk.raw.flag.vcf snapshotDirectory . squish goto chr11:1017135 snapshot 1017135_Exomedata.png exit
-
Run the script with the following command (don't forget to make it executable!):
igv.sh -b bashIGV.bat
-
Examine the resulting png
1017135_Exomedata.png -
Change the script to zoom into other mutations of interest, don't forget to change the output name of the png file to the corresponding position.
Q. How are the mutations distributed? Are they all in exons?
How do you differentiate between heterozygous and homozygous mutations?
Some mutations are slightly faded, what does this mean?
Q. Could you filter out some of your candidates with your visual inspection? Which are you final candidate genes?
Now let's analyse WGS data to see if we can identify any structural variants. When analyzing tumor samples you normally have two datasets; one normal (germline) sample and one tumor sample (somatic). To save time we will only work with two tumor samples from chromosome 17, these are breast cancer cell lines: HCC1143 and HCC1954.
- Create a soft link to the entire folder
/home/ftp_courses/NGS/Exome/SV
Before running any variant calling, we need to mark duplicates.
-
Run
MarkDuplicatesfor each alignment file, with the following parameters:-I -> YOUR_SAMPLE.tumor.bam -O -> YOUR_SAMPLE.tumor.mkdup.bam -M -> YOUR_SAMPLE.tumor.metrics --CREATE_INDEX -> true
-
Run the variant calling with
delly callto find any structural variants:delly call \ -g Homo_sapiens_assembly19.fasta_WITHIN_/db -o ALL.tumor.bcf list_of_alignments_separated_by_space
The default output is in bcf format.
-
Convert the file to
vcfformat usingbcftools:bcftools view ALL.tumor.bcf > ALL.tumor.vcf
Take a look at the vcf file.
Q. How many structural variant did DELLY find?
- Filter your
vcffile, keeping only PASS quality variants.
Q. How many variants are left after filtering for quality PASS?
- Remove all variants annotates as
IMPRECISE
Q. How many variant are left after filtering based on PRECISE. Which variants are unique to sample HCC1143? and for HCC1954?
From the filtered vcf file, choose two regions (one with a deletion (DEL) and one with an inversion (INV)) to visualize in IGV.
- Load the bam and vcf files in
IGV, and zoom in to the regions you decided to look at.
Q. Visually, do you agree with DELLY? Is the variant classified correct as DEL and INV? And what about the zygousity, is the variant correctly classified as homozygous or heterozygous (look at GT field in vcf)?
Developed by Marcela Dávila, 2017. Modified by Marcela Davila, Sanna Abrahamsson and Vanja Börjesson, 2021. Updated by Marcela Dávila, 2023.