02RelatednessFiltering - WheelerLab/gwasqc_pipeline GitHub Wiki
The purpose of relatedness filtering is to remove duplicate records and individuals who may be highly related. Relatedness filtering consists of five basic parts:
- Creates a pruned list of SNP IDs
- Determine the identity by descent of samples
- Check the heterozygosity distribution of samples and the existence of samples
- LD prune samples
- optionally remove duplicate or related samples & regenerate heterozygosity estimates from filtered samples
The ability to remove samples based on relatedness has been created as optional because some sample groups may be too small to remove a large number of individuals based on cryptic relatedness, in which case samples will simply be LD pruned and heterozygous outliers will be noted.
Example
./02RelatednessFiltering -b $DATA/examplebfiles --rel 0.25
In this example, the bfiles examplebfiles will undergo LD pruning, heterozygosity estimation, and will have individuals with relatedness estimate >0.25 removed.
Options
-b or --bfile
Path to the directory containing bim/bed/fam files as well as their shared prefix for ex /path/to/directory/prefix covers will use prefix.bim, prefix.bed, and prefix.fam
--bim
Full path to the bim file you wish to use, used when bim/bed/fam do not share their prefix. For ex /path/to/file.bim
--bed
Full path to the bed file you wish to use, used when bim/bed/fam do not share their prefix. For ex /path/to/file.bed
--fam
Full path to the bim file you wish to use, used when bim/bed/fam do not share their prefix. For ex /path/to/file.fam
-k or --keep
List of individuals you would like to keep. mutually exclusive with remove
-o or --output
directory where you'd like to send all your QC results. By Default ~/QC
--offplotibd
Plotting IBD is typically necessary only once per data set. Since it can also take a considerable amount of time given a large population, this option toggles off ibd plotting for a particular run.
--rel or --relatedness
Relatedness threshold you'd like to filter by, a number zero to one. Will remove individuals who show identity by descent of greater than threshold.
-r or --remove
Flag used for prefiltering. List of individuals you would like to remove. Mutually exclusive to keep flag.
Default Settings
By default this script uses a set of bfiles generated in 01MissingnessFiltering named QCStep2. By default it filters out individuals demonstrating a coefficient of inbreeding, otherwise referred to as relatedness, greater than 0.25.
Complete Defaults
BfileDefault=~/QC/missingness_hwe_steps/05filtered_HWE
PrefilterDefault=none
OutputDirDefault=~/QC
RelatednessDefault=1
RunIBD=true
File Details
1. Creates a pruned list of SNP IDs
- 01LD_prune_list
Contains a list of snps that can be kept or removed based on linkage disequilibrium pruning. Created using the plink option --indep-pairwise with the inputs of 50 5 0.3
2. Determine the identity by descent of samples
- 02relatedness
an analysis of the sample genomes to to identify the identity by descent of individuals. Generated using the plink --genome option. - IBD.png
A graph of the identity by descent of samples. Generated from the .genome file using the Rscript ibd.R
3. Check the heterozygosity distribution of samples and the existence of outlier samples
- 03het_unfiltered
Heterozygosity estimates of the unfiltered bfile. generated using the plink option --het
4. LD prune samples
- 04LD_pruned
LD prunes the snps from the bfiles. Generated using the plink option --extract
5. optionally remove duplicate or related samples & regenerate heterozygosity estimates from filtered samples
- 05without_relateds
Bfiles that remove individuals who are related based on the relatedness cutoff provided by the user. Generated using plinks --rel-cutoff option. - 06het_without_relateds
Heterozygosity estimates of the population after related individuals have been removed.