02RelatednessFiltering - WheelerLab/gwasqc_pipeline GitHub Wiki

The purpose of relatedness filtering is to remove duplicate records and individuals who may be highly related. Relatedness filtering consists of five basic parts:

Creates a pruned list of SNP IDs
Determine the identity by descent of samples
Check the heterozygosity distribution of samples and the existence of samples
LD prune samples
optionally remove duplicate or related samples & regenerate heterozygosity estimates from filtered samples

The ability to remove samples based on relatedness has been created as optional because some sample groups may be too small to remove a large number of individuals based on cryptic relatedness, in which case samples will simply be LD pruned and heterozygous outliers will be noted.

Example

./02RelatednessFiltering -b $DATA/examplebfiles --rel 0.25

In this example, the bfiles examplebfiles will undergo LD pruning, heterozygosity estimation, and will have individuals with relatedness estimate >0.25 removed.

Options

      -b or --bfile 
          Path to the directory containing bim/bed/fam files as well as their shared prefix for ex /path/to/directory/prefix covers will use prefix.bim, prefix.bed, and prefix.fam
      --bim 
          Full path to the bim file you wish to use, used when bim/bed/fam do not share their prefix. For ex /path/to/file.bim
      --bed 
          Full path to the bed file you wish to use, used when bim/bed/fam do not share their prefix. For ex /path/to/file.bed
      --fam 
          Full path to the bim file you wish to use, used when bim/bed/fam do not share their prefix. For ex /path/to/file.fam
      -k or --keep 
          List of individuals you would like to keep. mutually exclusive with remove
      -o or --output 
          directory where you'd like to send all your QC results. By Default ~/QC
      --offplotibd 
          Plotting IBD is typically necessary only once per data set. Since it can also take a considerable amount of time given a large population, this option toggles off ibd plotting for a particular run.
      --rel or --relatedness
          Relatedness threshold you'd like to filter by, a number zero to one. Will remove individuals who show identity by descent of greater than threshold.
      -r or --remove
          Flag used for prefiltering. List of individuals you would like to remove. Mutually exclusive to keep flag.

Default Settings

By default this script uses a set of bfiles generated in 01MissingnessFiltering named QCStep2. By default it filters out individuals demonstrating a coefficient of inbreeding, otherwise referred to as relatedness, greater than 0.25.

Complete Defaults

BfileDefault=~/QC/missingness_hwe_steps/05filtered_HWE
PrefilterDefault=none
OutputDirDefault=~/QC
RelatednessDefault=1
RunIBD=true

File Details

1. Creates a pruned list of SNP IDs

01LD_prune_list
Contains a list of snps that can be kept or removed based on linkage disequilibrium pruning. Created using the plink option --indep-pairwise with the inputs of 50 5 0.3

2. Determine the identity by descent of samples

02relatedness
an analysis of the sample genomes to to identify the identity by descent of individuals. Generated using the plink --genome option.
IBD.png
A graph of the identity by descent of samples. Generated from the .genome file using the Rscript ibd.R

3. Check the heterozygosity distribution of samples and the existence of outlier samples

03het_unfiltered
Heterozygosity estimates of the unfiltered bfile. generated using the plink option --het

4. LD prune samples

04LD_pruned
LD prunes the snps from the bfiles. Generated using the plink option --extract

5. optionally remove duplicate or related samples & regenerate heterozygosity estimates from filtered samples

05without_relateds
Bfiles that remove individuals who are related based on the relatedness cutoff provided by the user. Generated using plinks --rel-cutoff option.
06het_without_relateds
Heterozygosity estimates of the population after related individuals have been removed.