B2 II: Exome - bcfgothenburg/HT23 GitHub Wiki

Course: HT23 Bioinformatics 2 (SC00038)

The purpose of this exercise is to introduce you to common tools to asses the quality of a DNA sequencing experiment. You will become familiar with the workflow used to analyze targeted sequencing data as well as its visualization.

Your task is to identify possible disease causing mutation for two patients with Alpers syndrome.

Due to limit in time, you won't be running some analysis, but rather interpreting the results. These analyses can be performed using a web server (Galaxy, for instance), where you can choose an array of tools to produce the same output files you will be provided with. Throughout the practical, we include the command lines we have used in a linux environment to generate the output files.

The Data

Our data involves exome sequencing data from two affected individuals with Alpers syndrome (which is a progressive neurodegenerative disorder) and one control sample from a healthy individual. The data was taken from Sofou K, Kollberg G, Holmström M, et al. Whole exome sequencing reveals mutations in NARS2 and PARS2, encoding the mitochondrial asparaginyl-tRNA synthetase and prolyl-tRNA synthetase, in patients with Alpers syndrome. Mol Genet Genomic Med. 2015;3(1):59-68. doi:10.1002/mgg3.115.

Sample	ID
Affected individual 1	AS_1
Affected individual 2	AS_2
Healthy individual	AS_ctrl

Download the files from the Exome folder, as indicated in the course web platform

Data quality check (FastQC)

The first thing to do when you receive data is to check its quality. FastQC is a program that generates general statistics from high throughput data (and pipelines), creating an HTML report. Since we have a lot of samples, it is convenient to merge all the reports in one summary file, and we can achieve this by using MultiQC:

# running fastqc for each sample
fastqc -o . AS_1_R1_001.fastq.gz
fastqc -o . AS_2_R1_001.fastq.gz

# Running multiqc to gather all html reports
multiqc --filename multiqc_report_Raw_data \ # output filename
    --title Raw_data                       \ # title in the html file
    .                                        # output directory

Open multiqc_report_Raw_data.html in your web browser.

Q1. How does the sequencing quality looks like? Inspect the different statistics. Is the data worth working with? Would you do any trimming on this data?

QC trimming (Trim Galore!)

There are several protocols to follow in order to keep only good quality reads. It is recommended to perform this step even when the data looks fine.

Trim Galore! is a wrapper script to automate quality and adapter trimming as well as quality control. It is widely used and it is quite fast compared to other programs. We used the following command for the trimming step:

# processing R1 and R2 simultaneously
trim_galore --paired         \ # processes the data as paired-end files (both reads must pass the filters to be retained)
    --quality 20             \ # trims low-quality ends from the reads in addition to adapter removal
    --max_n 0.2              \ # discards reads that exceed the number of N's in the read
    --length 40              \ # discards reads that became shorter than the set length
    --fastqc                 \ # runs FastQC on the trimmed set
    -o trim_galore           \ # sets the output directory
    AS_ctrl_R1_001.fastq.gz  \ # Read 1 file
    AS_ctrl_R2_001.fastq.gz    # Read 2 file

Remember that after any kind of trimming (or any data processing) it is important to evaluate the output to check everything looks fine. In our case, since we have processed three samples it will be easier (quicker) to merge the reports with multiqc:

# running multiqc to gather all html reports
multiqc --filename multiqc_report_Trimmed_data --title Trimmed_data .

Open the report file from the trimmed data multiqc_report_Trimmed_data.html.

Q2. Is there any improvement on the data quality after filtering?. How many sequences were removed? What other trimming strategies would be suitable for any sequencing data?

Mapping to a reference genome (BWA, Picard)

Now that we have a good dataset, it is time to map the reads to our reference genome. Remember that there are several aligners and depending on which one you use you may have slightly different results.

Here we are using BWA, an aligner based on the Burrows-Wheeler transform. bwa-mem2 is one of BWA's algorithms that can be used for short reads (as the ones we are using) and being the latest developed, it is generally recommended as it is faster and more accurate. We used this simplified command line for the mapping:

# mapping trimmed reads with default parameters
bwa-mem2 mem -t 10                      \ # number of threads to run the process
    -V                                  \ # output the reference FASTA header in the SAM outfile
    -o AS_1.sam                         \ # output SAM file name
    ../BWAIndex/genome.fa               \ # reference genome index in bwa-mem2 format
    trim_galore/AS_1_R1_001_val_1.fq.gz \ # trimmed Read 1 file
    trim_galore/AS_1_R2_001_val_2.fq.gz   # trimmed Read 2 file

The output, is an alignment in SAM format, a rather nasty format to inspect by eye. To be able to work with it we need to convert the alignment to BAM format (the binary format of sam, in other words a compressed format). For this, we can use tools like samtools and Picard. Both are a set of command line tools for manipulating high throughput sequencing data. For this exercise we used picard:

# converting sam to bam format
java -jar picard.jar SamFormatConverter \ # program
    --INPUT AS_1.sam                    \ # sam file
    --OUTPUT AS_1.bam                     # bam file

An important step is to add the correct metadata. This information will be useful to troubleshoot and to process several samples at the same time without mixing them (a good practice to maintain our data FAIR). The metadata will be included in the bam file as lines annotated with RG (that stands for read group). We used the following command line to add this information:

# adding metadata and sorting bam file
java -jar picard.jar AddOrReplaceReadGroups \ # program
    --INPUT AS_1.bam                        \ # input bam file
    --OUTPUT AS_1.sort.bam                  \ # output bam file
    --RGLB AS_1                             \ # library
    --RGPL Illumina                         \ # platform
    --RGPU AS_1                             \ # platform unit
    --RGSM AS_1                             \ # sample name
    --SORT_ORDER coordinate                 \ # sorting bam file by
    --CREATE_INDEX true                       # create an index of the coordinate sorted bam file

Have a look at the flags definition here

Q3. Here we assigned the sample name for all RG values. What would be more suitable values for the different flags?

As downstream processing, the bam file needs to be sorted and indexed, so other tools have quick access to it. Years back, these steps were done separately with different picard tools. Today, these can be included as extra flags as you see in the previous command line. The output from the indexing is referred to bai file.

PCR duplicates (Picard)

When dealing with variant calling, an important step is to identify possible PCR duplicates, this to remove any bias when we are doing the actual calling. For this we can also use picard:

# removing duplicates 
java -jar picard.jar MarkDuplicates \ # program
    --INPUT AS_1.sort.bam           \ # input bam file
    --METRICS_FILE AS_1.metrics     \ # statistics output file
    --OUTPUT AS_1.sort.dedup.bam    \ # output bam file
    --REMOVE_DUPLICATES true        \ # do not write duplicates to output bam file
    --CREATE_INDEX true               # create a bai file

Another strategy is to only flag the duplicates rather than removing them. However, sometimes space is an issue and most of the duplicated reads are not even used in the downstream analysis. So choose accordingly!

Mapping statistics (samtools)

Now that we have our alignment without duplicates, we calculate some general statistics. We can produce similar stats with picard and samtools, so now let's use samtools. As there are several reports, we summarize them with multiqc:

# calculating simple statistics
samtools flagstat AS_1.sort.dedup.bam > AS_1.sort.dedup.flagstat
samtools idxstats AS_1.sort.dedup.bam > AS_1.sort.dedup.idxstats

# running multiqc to gather all html reports
multiqc --filename multiqc_report_Dedup_data --title Dedup_data .

Open multiqc_report_Dedup_data.html and inspect the stats.

Q4. Have all the reads been mapped to our reference genome? Which chromosome has more coverage? How many duplicates were there in each sample?

Integrative Genome Viewer - IGV

Now let's visualize the alignment files!

Open IGV in your local computer. Here you can find some documentation.
- Make sure GRCh37 is loaded
- In the viewer go to File -> Load from file
- Choose AS_1.sort.dedup.bam
Zoom in to see the alignments

As a reminder, try the following display schemes:

- Color alignment by: 
   - insert size
   - read strand
   - first-of-pair strand

- View as pairs

Generate a coverage track (tdf file)
- Go to Tools -> Run igvtools
- Select count
- Select AS_1.sort.dedup.bam
Load AS_1.sort.dedup.bam.tdf
Zoom out to identify regions with coverage.
right-click on the RefSeq Genes track
- Select Expanded
  This will display the different splice variants.
Generate and load the coverage track files of AS_2.sort.dedup.bam and AS_ctrl.sort.dedup.bam.

Q5. Are the 3 samples covering the same regions? Do they have the same coverage?

Variant calling with GATK

The Genome Analysis Toolkit or GATK is a software package developed to analyze next-generation resequencing data, focusing on variant discovery and genotyping. For the variant calling we will use the HaplotypeCaller, which is an SNP/indel caller that uses a Bayesian genotype likelihood model to estimate simultaneously the most likely genotypes and allele frequency in a population of N samples:

# calling germline SNPs and indels via local re-assembly of haplotypes
gatk HaplotypeCaller                                        \ # program
   --input AS_1.sort.dedup.bam                              \ # input alignment
   --output AS_1.gatk.raw.vcf                               \ # output VCF filename
   --reference /db/Homo_sapiens/Ensembl/GRCh37/genome.fa    \ # reference genome

The output file is in VCF format or Variant Call Format. You can open it in a text editor or Excel, unless you want to use unix commands or load the file into R.

Inspect AS_1.gatk.raw.vcf and get familiar with its information.

Q6. How do you differentiate between a homozygous mutations from an heterozygous one? How many SNPs do you have in each sample? Where can you find the coverage of each mutation? Are there any insertions/deletions?

SNP-callers emit a number of false-positive mutations, as any prediction program. Some of these false-positives can be detected and rejected by various statistical tests. Filtering or flagging the variants in the vcf file is a useful step to detect such false-positives. For this we used Variant Filtration from gatk following some of their guidelines for hard filtering:

# hard filtering variant calls based on the INFO field
gatk VariantFiltration                                   \ # program
   --ouput AS_1.gatk.raw.flag.vcf                        \ # vcf output filename
   --variant AS_1.gatk.raw.vcf                           \ # vcf  input filename
   --filter-expression "DP < 10.0" --filter-name "LowDP" \ # expression used to filter the INFO field
   --filter-expression "QD < 2.0"  --filter-name "QD"    \ # expression used to filter the INFO field
   --filter-expression "FS > 60.0" --filter-name "FS"    \ # expression used to filter the INFO field
   --filter-expression "MQ < 40.0" --filter-name "MQD"     # expression used to filter the INFO field

Try to understand the filtering done above. You may find some info at the beginning of the vcf file.

Now, have a look at AS_1.gatk.raw.flag.vcf.

Q7. How many variants have passed our filtering criteria?

Load AS_1.gatk.raw.flag.vcf in IGV

Q8. How are the mutations distributed? Are they all in exons? Are all the mutations shared among the samples? How do you differentiate between heterozygous and homozygous mutations? Some mutations are slightly faded, what does this mean?

Variant annotation with the Variant Effect Predictor (VEP)

Now we need to annotate our variants in terms of genomic context, aminoacid change and effect, previous mutation knowledge, etc. so we can evaluate and filter mutations that are irrelevant. There are several tools that help with this kind of annotation, one of them is VEP, which can be used as a webtool to determine the effect of genetic variants.

Create a filtered file from AS_1.gatk.raw.flag.vcf containing only the variants that have passed our filtering thresholds (you can also skip the header if you like as the program only reads the mutations), save it as AS_1.gatk.raw.flag.filtered.vcf.
Go to the VEP's web interface

By default the annotation will be done against the GRCh38 version of the human genome, however in our case since the data was annotated with GRCh37, we need to change version. So go to the GRCh37 website. Note: Remember ALWAYS to work with the same organism version.

Upload the filtered file you just created
Under Transcript database to use select Refseq transcripts
Name the job according to the Sample ID
Run

There are Additional configurations, where you can customize what kind of annotation you want to add. For the time being we just run the the analysis with the default parameters.

Annotate also the other two samples.
Usually the annotation is quick, let's hope there are not a lot of jobs run while we do the annotation!
Once the jobs are done, have a look at the results. You will see:
- Job details
- Summary statistics
- Results preview, where you can
  - Inspect the results using the web interface. Although this seems tempting, sometimes the service is quite slow!
  - Filter, based on different thresholds/categories. Ideally, we would use this tool to only download interesting mutations, however it is also quite slow!
  - Download the information, either in VCF, VEP or TXT format as well as query BioMart and extract any information based on either the mutations or the genes to which they belong

Q9. What are the most common variants in terms of Consequences ? How many genes harbor variants in each sample?

Variant filtering

For AS_1, download the TXT annotation file as AS_1_annotation.txt.
This format is easier to work with, but you are more than welcome to work with any of the other formats.

Ideally we would use Unix commands or R to facilitate the filtering and comparison of samples. If you know any of these tools give it a try, otherwise just open the files in Excel.

Inspect AS_1_annotation.txt

Note: Remember that Excel is famous for converting numbers or gene names into dates, so instead of directly open the file do the following:

Open a blank workbook in Excel
Go to File -> Open
- Select AS_1_annotation.txt
The Text Import Wizard will open
- Under Choose the file type that best describes your data select Delimited
  - Click Next
- Under Delimiters tick Tab
  - Click Next
- Under Data preview select the column Symbol
- Under Column data format select Text
- Set the columns Exon and Intron also to Text
  - Click Finish

Familiarize yourself with the annotated file.

With all this information, it is possible to narrow down the list of mutations to a handful of SNPs.

Download the annotation files for AS_2 and AS_ctrl
Filter the variants as you see fit
Compare the filtered variants among the 3 samples to identify candidate genes

Q10. What was your filtering strategy? Which are your candidate genes?

Once you have your list, visual inspection of your candidates helps in assessing them further. Have a look at your candidates in IGV, you may want to load AS_2.gatk.raw.flag.vcf, AS_ctrl.gatk.raw.flag.vcf, AS_2.sort.dedup.bam and AS_ctrl.sort.dedup.bam.

Q11. Could you filter out some of your candidates with your visual inspection? Which are you final candidate genes? What was your filtering strategy?

Well done! now you have the general workflow to identify disease causing mutations!

Home: Bioinformatics 2

Developed by Marcela Dávila, 2010. Modified by Marcela Dávila, 2017. Modified by Marcela Dávila, 2021. Modified by Marcela Dávila, 2023.