01MissingnessFiltering - WheelerLab/gwasqc_pipeline GitHub Wiki
Missingness filtering is the first step of the pipeline. The goal of this step is to remove snps that are poorly genotyped. It consists of 5 additional substeps with an optional 0th step. Missingness Filtering follows the following route:
- Optional prefiltering step
- Determination of initial missingness benchmark
- Create new bfiles based on missingness threshold
- Determine new missingness status after filtering
- Plot generation and validation of call rate distribution
- Calculate and plot Hardy-Weinberg Equilibrium statistics
- Filter bfiles by hwe pvalues and recalculate statistics.
Users can supply the genotyping threshold with the -gt option using and a number between zero and one. For example if one supplies -gt 0.1
this would filter out snps that have a call rate of less than 90%. Since this step is fairly fast, it can easily be rerun multiple times supplying different genotyping thresholds in order to get a satisfactory result. The results of this step can be evaluated based on the missingness plots generated in the QCstats folder.
./01MissingnessFiltering -b ~/Data/examplebfile -a --geno 0.001 --hwe 0.001
Will run the script on the bfile set examplebfiles.
-a will run autosome filtering on the initial file set.
--geno remove snps that have a gentyping rate <99.9%
--hwe removes snps that have a hardy weinburg pvalue < 0.001
Most if not all options used in this pipeline are shared by plink and their function can be expected to stay the same between them.
-a or --autosome
Flag for initial filtering by autosome. By default will not run.
-b or --bfile
Path to the directory containing bim/bed/fam files as well as their shared prefix for ex /path/to/directory/prefix covers will use prefix.bim, prefix.bed, and prefix.fam
--bim
Full path to the bim file you wish to use, used when bim/bed/fam do not share their prefix. For ex /path/to/file.bim
--bed
Full path to the bed file you wish to use, used when bim/bed/fam do not share their prefix. For ex /path/to/file.bed
--fam
Full path to the bim file you wish to use, used when bim/bed/fam do not share their prefix. For ex /path/to/file.fam
-g or --geno
Genotyping call rate threshold used for filtering. By default uses a threshold of 0.01 in other words filters out snps that have a call rate <99%
-h or --hwe
Minimum threshold for filtering by hardy weinburg statistics by p-value. Note that this performs the plink equivalent of --hardy --hwe [p-val] midp. This filter can remain relatively low as serious genotyping errors often yield extreme p-values such as 1e-50, which is what we wish to filter out. The default is currently set to be .0001
-k or --keep
Flag used for prefiltering. List of individuals you would like to keep. mutually exclusive with remove flag.
-o or --output
directory where you'd like to send all your QC results. Bey Default ~/QC
-r or --remove
Flag used for prefiltering. List of individuals you would like to remove. Mutually exclusive to keep flag.
By default this pipeline will not perform Autosome filtering, nor will it perform the keep/remove individuals unless these flags are supplied. Of particular use to the user is the Bfile default. If the user wishes to perform multiple analyses on a particular set of bfiles this default can be changed to meet their needs. By default hwe minimum is set to .0001.
Complete Defaults
AutosomeDefault=False
BfileDefault=/home/wheelerlab2/Data/MESA_dbGaP_55081/phg000071.v2.NHLBI_SHARE_MESA.genotype-calls-matrixfmt.c1/SHARE_MESA_c1
GenotypingThresholdDefault=0.01
HWEpvalDefault=0.0001
OutputDirDefault=~/QC
PrefilterDefault=none
All files generated by 01MissingnessFiltering
will be placed in the directory $OUTPUTDIR/missingness_hwe_steps or $OUTPUTDIR/plots_stats, with $OUTPUTDIR being whatever is specified by the -o flag or $HOME/QC/ by default
One of five different file sets can be generated depending on what options are used. If any of these files are generated, the remaining missingness filtering is carried out using based on them.
-
00autosome_k
Is generated from the starting bfiles by performing plinks --autosome and --keep options. -
00autosome_r
Is generated from the starting bfiles by performing plinks --autosome and --remove options. -
00autosome
Is generated from the starting bfiles by performing plinks --autosome option. -
00filt_k
Is generated from the starting bfiles by performing plinks --keep option. -
00filt_r
Is generated from the starting bfiles by performing plinks --remove option.
-
01initial_missingness
An estimate of the call rate of individual snps within the initial bfiles. Generated using the plink --missing option.
-
02geno_0.01_filtered or 02geno_${Geno}_filtered
Bfiles generated by filtering out snps that have a bfile Generated using the plink --geno option
-
03missingness_validation
Creates new estimates of the genotyping rate from 02geno_${Geno}_filtered bfiles. Generated using the plink --missing option.
-
/plots_stats/callRateDistributions.pdf
A plot of the distribution of call rate before and after filtering by missingness. Generated using the Rscript CallRateDistributions.R
-
04initial_HWE_stats
An hwe file containing Hardy Weinberg statistics for the sample population and their associated pvalues. generated using the plink --hardy option -
hwestatsinitial.txt/.pdf
A text file containing summary statistics for HWE as well as a plot of the distribution of HWE among snps. Generated using the Rscript hwe.R
-
05filtered_HWE
A new set of bfiles that has been filtered to remove outliers of HWE as well as a fresh estimate of the hwe statistics. made using the plink --hardy and --hwe options. -
hwestatsfiltered.txt/.pdf
A text file containing summary statistics for HWE as well as a plot of the distribution of HWE among snps. Generated using the Rscript hwe.R