01MissingnessFiltering - WheelerLab/gwasqc_pipeline GitHub Wiki

Missingness Filtering

Missingness filtering is the first step of the pipeline. The goal of this step is to remove snps that are poorly genotyped. It consists of 5 additional substeps with an optional 0th step. Missingness Filtering follows the following route:

  1. Optional prefiltering step
  2. Determination of initial missingness benchmark
  3. Create new bfiles based on missingness threshold
  4. Determine new missingness status after filtering
  5. Plot generation and validation of call rate distribution
  6. Calculate and plot Hardy-Weinberg Equilibrium statistics
  7. Filter bfiles by hwe pvalues and recalculate statistics.

Users can supply the genotyping threshold with the -gt option using and a number between zero and one. For example if one supplies -gt 0.1 this would filter out snps that have a call rate of less than 90%. Since this step is fairly fast, it can easily be rerun multiple times supplying different genotyping thresholds in order to get a satisfactory result. The results of this step can be evaluated based on the missingness plots generated in the QCstats folder.

Example

./01MissingnessFiltering -b ~/Data/examplebfile -a --geno 0.001 --hwe 0.001

Will run the script on the bfile set examplebfiles.
-a will run autosome filtering on the initial file set.
--geno remove snps that have a gentyping rate <99.9%
--hwe removes snps that have a hardy weinburg pvalue < 0.001

Options

Most if not all options used in this pipeline are shared by plink and their function can be expected to stay the same between them.

      -a or --autosome 
          Flag for initial filtering by autosome. By default will not run.
      -b or --bfile 
          Path to the directory containing bim/bed/fam files as well as their shared prefix for ex /path/to/directory/prefix covers will use prefix.bim, prefix.bed, and prefix.fam
      --bim 
          Full path to the bim file you wish to use, used when bim/bed/fam do not share their prefix. For ex /path/to/file.bim
      --bed 
          Full path to the bed file you wish to use, used when bim/bed/fam do not share their prefix. For ex /path/to/file.bed
      --fam 
          Full path to the bim file you wish to use, used when bim/bed/fam do not share their prefix. For ex /path/to/file.fam
      -g or --geno
          Genotyping call rate threshold used for filtering. By default uses a threshold of 0.01 in other words filters out snps that have a call rate <99%
      -h or --hwe
          Minimum threshold for filtering by hardy weinburg statistics by p-value. Note that this performs the plink equivalent of --hardy --hwe [p-val] midp. This filter can remain relatively low as serious genotyping errors often yield extreme p-values such as 1e-50, which is what we wish to filter out. The default is currently set to be .0001
      -k or --keep 
          Flag used for prefiltering. List of individuals you would like to keep. mutually exclusive with remove flag.
      -o or --output 
          directory where you'd like to send all your QC results. Bey Default ~/QC
      -r or --remove
          Flag used for prefiltering. List of individuals you would like to remove. Mutually exclusive to keep flag.

Default Settings

By default this pipeline will not perform Autosome filtering, nor will it perform the keep/remove individuals unless these flags are supplied. Of particular use to the user is the Bfile default. If the user wishes to perform multiple analyses on a particular set of bfiles this default can be changed to meet their needs. By default hwe minimum is set to .0001.

Complete Defaults

AutosomeDefault=False
BfileDefault=/home/wheelerlab2/Data/MESA_dbGaP_55081/phg000071.v2.NHLBI_SHARE_MESA.genotype-calls-matrixfmt.c1/SHARE_MESA_c1
GenotypingThresholdDefault=0.01
HWEpvalDefault=0.0001
OutputDirDefault=~/QC
PrefilterDefault=none

File details

All files generated by 01MissingnessFiltering will be placed in the directory $OUTPUTDIR/missingness_hwe_steps or $OUTPUTDIR/plots_stats, with $OUTPUTDIR being whatever is specified by the -o flag or $HOME/QC/ by default

0. Optional prefiltering step

One of five different file sets can be generated depending on what options are used. If any of these files are generated, the remaining missingness filtering is carried out using based on them.

  • 00autosome_k
    Is generated from the starting bfiles by performing plinks --autosome and --keep options.
  • 00autosome_r
    Is generated from the starting bfiles by performing plinks --autosome and --remove options.
  • 00autosome
    Is generated from the starting bfiles by performing plinks --autosome option.
  • 00filt_k
    Is generated from the starting bfiles by performing plinks --keep option.
  • 00filt_r
    Is generated from the starting bfiles by performing plinks --remove option.

1. Determination of initial missingness benchmark

  • 01initial_missingness
    An estimate of the call rate of individual snps within the initial bfiles. Generated using the plink --missing option.

2. Create new bfiles based on missingness threshold

  • 02geno_0.01_filtered or 02geno_${Geno}_filtered
    Bfiles generated by filtering out snps that have a bfile Generated using the plink --geno option

3. Determine new missingness status after filtering

  • 03missingness_validation
    Creates new estimates of the genotyping rate from 02geno_${Geno}_filtered bfiles. Generated using the plink --missing option.

4. Plot generation and validation of call rate distribution

  • /plots_stats/callRateDistributions.pdf
    A plot of the distribution of call rate before and after filtering by missingness. Generated using the Rscript CallRateDistributions.R

5. Calculate and plot Hardy-Weinberg Equilibrium statistics

  • 04initial_HWE_stats
    An hwe file containing Hardy Weinberg statistics for the sample population and their associated pvalues. generated using the plink --hardy option
  • hwestatsinitial.txt/.pdf
    A text file containing summary statistics for HWE as well as a plot of the distribution of HWE among snps. Generated using the Rscript hwe.R

6. Filter bfiles by hwe pvalues and recalculate statistics.

  • 05filtered_HWE
    A new set of bfiles that has been filtered to remove outliers of HWE as well as a fresh estimate of the hwe statistics. made using the plink --hardy and --hwe options.

  • hwestatsfiltered.txt/.pdf
    A text file containing summary statistics for HWE as well as a plot of the distribution of HWE among snps. Generated using the Rscript hwe.R

⚠️ **GitHub.com Fallback** ⚠️