Unix VIII: Disease causing mutations - BDC-training/VT25 GitHub Wiki

Course: VT25 Unix applied to genomic data (SC00036)

In this exercise, we will prioritize possible disease causing mutations



Copy Samples.vcf.tar.gz from /home/courses/Unix/files and untar it. You will get five VCF files. Here is a quick reference to the VCF format.

Inspect the S1.vcf file and answer:

Q1. How many SNPs do you have per chromosome?

Q2. Are there any insertions or deletions?

The 7th column (FILTER) in a VCF file indicates which SNPs have passed certain thresholds (if you grep FILTER on your VCF file, you can see which filters have been used)

Q3. How many SNPs have passed all these filters?

The 10th column in a VCF file, has information about the predicted genotype of the mutation:

  • 0/1 indicates Heterozygous, while
  • 1/1 indicates Homozygous

Save the homozygous mutations that have passed all thresholds into a new file (S1.pass.homozygous.vcf), including only the following columns: CHROM, POS, REF and ALT, so it looks like:

 chr1    131263  CCAGT   C
 chr1    162463  T       C
 chr1    714427  G       A
 chr1    715348  T       G
 chr1    715996  A       T
 chr1    723891  G       C

Process all the other sample in the same way, using a for loop. As a reminder:

    for i in S*vcf
        do _your code here_
    done

Q4. How many confident homozygous SNPs in chr14 are found in each sample? Are there any SNPs found across all samples? If so, how many?

Create a tarball of the files you have created.

Keep working with S1.vcf. A common next step is to annotate our candidate SNPs to give priority to those SNPs in coding regions, especially those which nucleotide changes make an impact the aminoacid composition of the protein. We will use the ANNOVAR package to annotate the SNPs, which runs as a standalone application where standard Perl modules are installed.

  • Go to the webpage
  • Click on User Guide and then Download ANNOVAR
  • As you can see, you need to register to get the stand alone application. If you want go for it, otherwise this is the link you would get: http://www.openbioinformatics.org/annovar/download/0wgxR2rIVP/annovar.latest.tar.gz.
  • Copy the application using wget
  • Untar the file
  • Go into the newly created annovar directory. * You will see different executable perl scripts and the humandb directory that has several databases.
  • To run annovar via the command line without having to type the absolute path, open your ~/.bashrc with a text editor and add the following line (don´t forget to add the correct path):
export PATH=/home/path_to_annovar_installation:$PATH
  • Save the document and close it
  • Run this command to execute the export command you just added:
source ~/.bashrc

As usual, we need to modify our file to get the correct format to use for the annotation.

First use convert2annovar.pl and convert your vcf file to annovar format. If you run convert2annovar.pl you will get help on how to run the tool. For this exercise use the following arguments:

 * format     input format, in our case vcf4 (you can see this in the first line of any VCF file)
 * filter     output variants with this filter, use pass
 * coverage   read coverage threshold in pileup file, use 50
 * outfile    S1.annovar

Now, annotate the output file with table_annovar.pl using the following settings:

 * database-location         path_to_your_humandb/
 * buildver                  genome build version, use hg19
 * protocol                  comma-delimited string specifying database protocol, use refGene
 * operation                 comma-delimited string specifying type of operation, use g

Don't forget to understand what these options mean!

Several files will be created. For this exercise, focus on the one with extension annovar.hg19_multianno.txt. Have a look at the file, if you need more info to understand it, go to the ANNOVAR webpage.

Q5. What are the top 10 genes in S1 that have the highest number of aminoacid changes?

Repeat the same annotation for the other vcf files. Try to use a for loop.

Q6. Where do you find the largest number of mutations with respect to their genomic location? How many non-synonymous mutations are shared across all samples?


Home: Unix applied to genomic data


Developed by Marcela Davila, 2018. Modified by Marcela Dávila, 2021. Updated by Marcela Dávila, 2024.