B2 II: Exome - bcfgothenburg/HT23 GitHub Wiki
Course: HT23 Bioinformatics 2 (SC00038)
The purpose of this exercise is to introduce you to common tools to asses the quality of a DNA sequencing experiment. You will become familiar with the workflow used to analyze targeted sequencing data as well as its visualization.
Your task is to identify possible disease causing mutation for two patients with Alpers syndrome.
Due to limit in time, you won't be running some analysis, but rather interpreting the results. These analyses can be performed using a web server (Galaxy, for instance), where you can choose an array of tools to produce the same output files you will be provided with. Throughout the practical, we include the command lines we have used in a linux environment to generate the output files.
Our data involves exome sequencing data from two affected individuals with Alpers syndrome (which is a progressive neurodegenerative disorder) and one control sample from a healthy individual. The data was taken from Sofou K, Kollberg G, Holmström M, et al. Whole exome sequencing reveals mutations in NARS2 and PARS2, encoding the mitochondrial asparaginyl-tRNA synthetase and prolyl-tRNA synthetase, in patients with Alpers syndrome. Mol Genet Genomic Med. 2015;3(1):59-68. doi:10.1002/mgg3.115.
| Sample | ID |
|---|---|
| Affected individual 1 | AS_1 |
| Affected individual 2 | AS_2 |
| Healthy individual | AS_ctrl |
- Download the files from the
Exomefolder, as indicated in the course web platform
The first thing to do when you receive data is to check its quality. FastQC is a program that generates general statistics from high throughput data (and pipelines), creating an HTML report. Since we have a lot of samples, it is convenient to merge all the reports in one summary file, and we can achieve this by using MultiQC:
# running fastqc for each sample
fastqc -o . AS_1_R1_001.fastq.gz
fastqc -o . AS_2_R1_001.fastq.gz
# Running multiqc to gather all html reports
multiqc --filename multiqc_report_Raw_data \ # output filename
--title Raw_data \ # title in the html file
. # output directory- Open
multiqc_report_Raw_data.htmlin your web browser.
Q1. How does the sequencing quality looks like? Inspect the different statistics. Is the data worth working with? Would you do any trimming on this data?
There are several protocols to follow in order to keep only good quality reads. It is recommended to perform this step even when the data looks fine.
Trim Galore! is a wrapper script to automate quality and adapter trimming as well as quality control. It is widely used and it is quite fast compared to other programs. We used the following command for the trimming step:
# processing R1 and R2 simultaneously
trim_galore --paired \ # processes the data as paired-end files (both reads must pass the filters to be retained)
--quality 20 \ # trims low-quality ends from the reads in addition to adapter removal
--max_n 0.2 \ # discards reads that exceed the number of N's in the read
--length 40 \ # discards reads that became shorter than the set length
--fastqc \ # runs FastQC on the trimmed set
-o trim_galore \ # sets the output directory
AS_ctrl_R1_001.fastq.gz \ # Read 1 file
AS_ctrl_R2_001.fastq.gz # Read 2 fileRemember that after any kind of trimming (or any data processing) it is important to evaluate the output to check everything looks fine. In our case, since we have processed three samples it will be easier (quicker) to merge the reports with multiqc:
# running multiqc to gather all html reports
multiqc --filename multiqc_report_Trimmed_data --title Trimmed_data . - Open the report file from the trimmed data
multiqc_report_Trimmed_data.html.
Q2. Is there any improvement on the data quality after filtering?. How many sequences were removed? What other trimming strategies would be suitable for any sequencing data?
Now that we have a good dataset, it is time to map the reads to our reference genome. Remember that there are several aligners and depending on which one you use you may have slightly different results.
Here we are using BWA, an aligner based on the Burrows-Wheeler transform. bwa-mem2 is one of BWA's algorithms that can be used for short reads (as the ones we are using) and being the latest developed, it is generally recommended as it is faster and more accurate. We used this simplified command line for the mapping:
# mapping trimmed reads with default parameters
bwa-mem2 mem -t 10 \ # number of threads to run the process
-V \ # output the reference FASTA header in the SAM outfile
-o AS_1.sam \ # output SAM file name
../BWAIndex/genome.fa \ # reference genome index in bwa-mem2 format
trim_galore/AS_1_R1_001_val_1.fq.gz \ # trimmed Read 1 file
trim_galore/AS_1_R2_001_val_2.fq.gz # trimmed Read 2 fileThe output, is an alignment in SAM format, a rather nasty format to inspect by eye. To be able to work with it we need to convert the alignment to BAM format
(the binary format of sam, in other words a compressed format). For this, we can use tools like samtools
and Picard. Both are a set of command line tools for manipulating high throughput sequencing data. For this exercise we used picard:
# converting sam to bam format
java -jar picard.jar SamFormatConverter \ # program
--INPUT AS_1.sam \ # sam file
--OUTPUT AS_1.bam # bam fileAn important step is to add the correct metadata. This information will be useful to troubleshoot and to process several samples at the same time without mixing them (a good practice to maintain our data FAIR). The metadata will be included in the bam file as lines annotated with RG (that stands for read group). We used the following command line to add this information:
# adding metadata and sorting bam file
java -jar picard.jar AddOrReplaceReadGroups \ # program
--INPUT AS_1.bam \ # input bam file
--OUTPUT AS_1.sort.bam \ # output bam file
--RGLB AS_1 \ # library
--RGPL Illumina \ # platform
--RGPU AS_1 \ # platform unit
--RGSM AS_1 \ # sample name
--SORT_ORDER coordinate \ # sorting bam file by
--CREATE_INDEX true # create an index of the coordinate sorted bam file- Have a look at the flags definition here
Q3. Here we assigned the sample name for all RG values. What would be more suitable values for the different flags?
As downstream processing, the bam file needs to be sorted and indexed, so other tools have quick access to it. Years back, these steps were done separately with different picard tools. Today, these can be included as extra flags as you see in the previous command line. The output from the indexing is referred to bai file.
When dealing with variant calling, an important step is to identify possible PCR duplicates,
this to remove any bias when we are doing the actual calling. For this we can also use picard:
# removing duplicates
java -jar picard.jar MarkDuplicates \ # program
--INPUT AS_1.sort.bam \ # input bam file
--METRICS_FILE AS_1.metrics \ # statistics output file
--OUTPUT AS_1.sort.dedup.bam \ # output bam file
--REMOVE_DUPLICATES true \ # do not write duplicates to output bam file
--CREATE_INDEX true # create a bai fileAnother strategy is to only flag the duplicates rather than removing them. However, sometimes space is an issue and most of the duplicated reads are not even used in the downstream analysis. So choose accordingly!
Now that we have our alignment without duplicates, we calculate some general statistics. We can produce similar stats with picard and samtools, so now let's use samtools. As there are several reports, we summarize them with multiqc:
# calculating simple statistics
samtools flagstat AS_1.sort.dedup.bam > AS_1.sort.dedup.flagstat
samtools idxstats AS_1.sort.dedup.bam > AS_1.sort.dedup.idxstats
# running multiqc to gather all html reports
multiqc --filename multiqc_report_Dedup_data --title Dedup_data .- Open
multiqc_report_Dedup_data.htmland inspect the stats.
Q4. Have all the reads been mapped to our reference genome? Which chromosome has more coverage? How many duplicates were there in each sample?
Now let's visualize the alignment files!
- Open
IGVin your local computer. Here you can find some documentation.- Make sure
GRCh37is loaded - In the viewer go to
File->Load from file - Choose
AS_1.sort.dedup.bam
- Make sure
- Zoom in to see the alignments
As a reminder, try the following display schemes:
- Color alignment by:
- insert size
- read strand
- first-of-pair strand
- View as pairs
-
Generate a coverage track (
tdffile)- Go to
Tools->Run igvtools - Select
count - Select
AS_1.sort.dedup.bam
- Go to
-
Load
AS_1.sort.dedup.bam.tdf -
Zoom out to identify regions with coverage.
-
right-clickon theRefSeq Genestrack- Select
Expanded
This will display the different splice variants.
- Select
-
Generate and load the coverage track files of
AS_2.sort.dedup.bamandAS_ctrl.sort.dedup.bam.
Q5. Are the 3 samples covering the same regions? Do they have the same coverage?
The Genome Analysis Toolkit or GATK is a software package developed to analyze
next-generation resequencing data, focusing on variant discovery and genotyping.
For the variant calling we will use the HaplotypeCaller, which is an SNP/indel
caller that uses a Bayesian genotype likelihood model to estimate simultaneously
the most likely genotypes and allele frequency in a population of N samples:
# calling germline SNPs and indels via local re-assembly of haplotypes
gatk HaplotypeCaller \ # program
--input AS_1.sort.dedup.bam \ # input alignment
--output AS_1.gatk.raw.vcf \ # output VCF filename
--reference /db/Homo_sapiens/Ensembl/GRCh37/genome.fa \ # reference genomeThe output file is in VCF format or Variant Call Format. You can open it in a text editor or Excel, unless you want to use unix commands or load the file into R.
- Inspect
AS_1.gatk.raw.vcfand get familiar with its information.
Q6. How do you differentiate between a homozygous mutations from an heterozygous one? How many SNPs do you have in each sample? Where can you find the coverage of each mutation? Are there any insertions/deletions?
SNP-callers emit a number of false-positive mutations, as any prediction program. Some of these false-positives can be
detected and rejected by various statistical tests. Filtering or flagging the variants in the vcf file is a useful step to detect such false-positives. For this we used
Variant Filtration from gatk following some of their guidelines for hard filtering:
# hard filtering variant calls based on the INFO field
gatk VariantFiltration \ # program
--ouput AS_1.gatk.raw.flag.vcf \ # vcf output filename
--variant AS_1.gatk.raw.vcf \ # vcf input filename
--filter-expression "DP < 10.0" --filter-name "LowDP" \ # expression used to filter the INFO field
--filter-expression "QD < 2.0" --filter-name "QD" \ # expression used to filter the INFO field
--filter-expression "FS > 60.0" --filter-name "FS" \ # expression used to filter the INFO field
--filter-expression "MQ < 40.0" --filter-name "MQD" # expression used to filter the INFO fieldTry to understand the filtering done above. You may find some info at the
beginning of the vcf file.
- Now, have a look at
AS_1.gatk.raw.flag.vcf.
Q7. How many variants have passed our filtering criteria?
- Load
AS_1.gatk.raw.flag.vcfinIGV
Q8. How are the mutations distributed? Are they all in exons? Are all the mutations shared among the samples? How do you differentiate between heterozygous and homozygous mutations? Some mutations are slightly faded, what does this mean?
Now we need to annotate our variants in terms of genomic context, aminoacid change and effect, previous mutation knowledge, etc. so we can evaluate and filter mutations that are irrelevant. There are several tools that help with this kind of annotation, one of them is VEP, which can be used as a webtool to determine the effect of genetic variants.
-
Create a filtered file from
AS_1.gatk.raw.flag.vcfcontaining only the variants that have passed our filtering thresholds (you can also skip the header if you like as the program only reads the mutations), save it asAS_1.gatk.raw.flag.filtered.vcf. -
Go to the VEP's web interface
By default the annotation will be done against the GRCh38 version of the human genome, however in our case since the data was annotated with GRCh37, we need to change version. So go to the GRCh37 website. Note: Remember ALWAYS to work with the same organism version.
- Upload the filtered file you just created
- Under
Transcript database to useselectRefseq transcripts - Name the job according to the
Sample ID Run
There are Additional configurations, where you can customize what kind of annotation you want to add. For the time being we just run the the analysis with the default parameters.
-
Annotate also the other two samples.
Usually the annotation is quick, let's hope there are not a lot of jobs run while we do the annotation! -
Once the jobs are done, have a look at the results. You will see:
- Job details
- Summary statistics
- Results preview, where you can
- Inspect the results using the web interface. Although this seems tempting, sometimes the service is quite slow!
- Filter, based on different thresholds/categories. Ideally, we would use this tool to only download interesting mutations, however it is also quite slow!
-
Download the information, either in
VCF,VEPorTXTformat as well as queryBioMartand extract any information based on either the mutations or the genes to which they belong
Q9. What are the most common variants in terms of
Consequences? How many genes harbor variants in each sample?
- For
AS_1, download theTXTannotation file asAS_1_annotation.txt.
This format is easier to work with, but you are more than welcome to work with any of the other formats.
Ideally we would use Unix commands or R to facilitate the filtering and comparison of samples. If you know any of these tools give it a try, otherwise just open the files in Excel.
- Inspect
AS_1_annotation.txt
Note: Remember that Excel is famous for converting numbers or gene names into dates, so instead of directly open the file do the following:
- Open a blank workbook in Excel
- Go to
File->Open- Select
AS_1_annotation.txt
- Select
- The
Text Import Wizardwill open- Under
Choose the file type that best describes your dataselectDelimited- Click
Next
- Click
- Under
DelimiterstickTab- Click
Next
- Click
- Under
Data previewselect the columnSymbol - Under
Column data formatselectText - Set the columns
ExonandIntronalso toText- Click
Finish
- Click
- Under
Familiarize yourself with the annotated file.
With all this information, it is possible to narrow down the list of mutations to a handful of SNPs.
- Download the annotation files for
AS_2andAS_ctrl - Filter the variants as you see fit
- Compare the filtered variants among the 3 samples to identify candidate genes
Q10. What was your filtering strategy? Which are your candidate genes?
Once you have your list, visual inspection of your candidates helps
in assessing them further. Have a look at your candidates in IGV,
you may want to load AS_2.gatk.raw.flag.vcf, AS_ctrl.gatk.raw.flag.vcf, AS_2.sort.dedup.bam and AS_ctrl.sort.dedup.bam.
Q11. Could you filter out some of your candidates with your visual inspection? Which are you final candidate genes? What was your filtering strategy?
Home: Bioinformatics 2
Developed by Marcela Dávila, 2010. Modified by Marcela Dávila, 2017. Modified by Marcela Dávila, 2021. Modified by Marcela Dávila, 2023.