Home - WheelerLab/gwasqc_pipeline GitHub Wiki

About

The GWAS QC Pipeline consists of three main bash scripts as well as six associated R scripts. Its main purpose is to automate the quality control of GWAS data, a protocol that is typical for the wheeler lab. This pipeline is based on code written by my lab fellows Angela Andeleon and Peter Fornica. This pipeline further simplifies the process followed by these two and others in this lab by breaking down the quality control process into three steps:

Filtering out SNPs that are poorly genotyped, have low call rates (also known as filtering by SNP missingness), and are additionally found to be significantly out of hardy weinburg equilibrium
Filtering out individuals that are duplicates or show a high degree of relatedness
Merging sample data with a HapMap cohort and calculating principle components

Principles of Design

The goal of this pipeline was for each step to be highly reusable. As gwas data tends to be highly variable, this necessitates that this pipeline maintain a high degree of flexibility. As such it has been broken into three parts, each with their own unique flags to easily run quality control in as few as three lines of code.

Defaults

The pipeline also has easily accessible defaults built in to each step. These defaults are always written at the top of each script after the shebang statement. If while running this pipeline you find you find yourself rewriting the same flags you should open the script with your preferred text editor and change these defaults to suit your needs