03MergeHapMap - WheelerLab/gwasqc_pipeline GitHub Wiki

Merge Hapmap performs multiple functions. As its name suggests it primary handles merging our sample data with a hapmap population for pca. Understandably then, it also runs pca but before these the merge script also creates the bfile set based on our relatedness threshold. This script has seven parts as well as an optional prefiltering step. Step 0 is meant to provide flexibility on removing heterozygous outliers from your data set.

  1. Optionally remove outliers, translate snp IDs to rsID format, and/or perform liftover
  2. Create new bfiles based on whatever relatedness threshold you decided
  3. An optional Liftover step, depending if your hapmap build is out of date or not.
  4. Create an initial merge of hapmap, determining what snps are missing
  5. create new hapmap bfiles without those snps that are missing
  6. Merge again without the missing snps
  7. Filter your merged files based on genotyping rate and minor allele frequency
  8. Run PCA

Please note the importance of optional step zero. While it is not always necessary to remove samples, it should always be considered necessary to translate snp IDs. While step says translation must be to rsID, in reality that is not the case. What's most important is that SNP IDs are the same format between reference population and experimental population (That is, between the populations we are merging together.) RsIDs are simply and currently the most common format.

Note on the affy option Please be aware that the affy option has a limited scope. While it will likely work with the test data and with affymetrix IDs in general, this may not always be the case. Because of the variety of variant IDs and mapping files it is not feasible to code for all possible translations in this pipeline. For best results the user should translate the IDs themselves to their desired format. Feel free to contact the developer for tips in this regard.

Example

~/gwasqc_pipeline/shellscripts/03MergeHapMap -h ~/HAPMAP3_hg18/ --affy /home/peter/AA_GAIN_SCZ/summarystatistics.txt -b ~/QC/relatedness_steps/05without_relateds 

Complete Options

      --affy
          Flag signaling that snp IDs are not in rsID format and must be translated. Originally implemented to translate the affymetrix format to rsID format, it is now more universally applied to translate all SNP ID's to rsID format. This relies on the use of a summarystatistics.txt file which contains information on the chromosome and locus of all snps being translated. SNPs that cannot be translated maintain their original untranslated name.
      -b or --bfile 
          Path to the directory containing bim/bed/fam files as well as their shared prefix for ex /path/to/directory/prefix covers will use prefix.bim, prefix.bed, and prefix.fam
      --bim 
          Full path to the bim file you wish to use, used when bim/bed/fam do not share their prefix. For ex /path/to/file.bim
      --bed 
          Full path to the bed file you wish to use, used when bim/bed/fam do not share their prefix. For ex /path/to/file.bed
      --fam 
          Full path to the bim file you wish to use, used when bim/bed/fam do not share their prefix. For ex /path/to/file.fam
      -g or --geno
          Genotyping call rate threshold used for filtering. By default uses a threshold of 0.01 in other words filters out snps that have a call rate <99%
      -h or --hapmap
          Full path to the directory containing hapmap bim/bed/fam files, assumes they are all in the same directory. Do not have to have to the same name
      --liftover
          Calls the liftover.py script to perform liftover and synchonize the files prior to any of the rest of the script.
      --maf
          Minor allele frequency threshold, defines the minimum maf you'd like to filter by
      -o or --output 
          Directory where you'd like to send all your QC results. By Default ~/QC
      -pca 
          Number of principle components you would like to calculate
      -r or  --remove
          When provided with a list of individuals, will remove these individuals and make new bfiles without them. In particular, this is used to optionally remove heterozygous outliers from the data set.

Defaults

By default the pipeline does not run Affymetrix translation of the SNP IDs to rsIDs. It also filters snps by call rate of 0.01 and maf of 0.05. By default it calculates 10 PCs

Complete Defaults

AutosomeDefault=False
BfileDefault=~/QC/relatedness_steps/04without_relateds
GenotypingThresholdDefault=0.01
HapMapDirDefault=/home/wheelerlab2/Data/HAPMAP3_hg18/
LiftOverDefault=False
MafDefault=0.05
OutputDirDefault=~/QC
PCADefault=10
PrefilterDefault=none
RunAffyDefault=F