05 Make QQ plots - WheelerLab/gwasqc_pipeline GitHub Wiki

qqplot_tophits.R is an rscript to generate a qqplot that comes with six basic flags. It is designed to be run through the command line using these flags as necessary. This script can be used to generate either partial or complete qqplots, with partial qqplots displaying only the top proportion of hits according to user input. The script is designed to take in a variety of types described as follows:

  • Data with or without a header (Default is with)

The script assumes by default that the input file has a header. If the input file does not have a header, then the user may signal this by supplying the --noheader flag at runtime

  • Data in .gz format

The script in utilizing the fread function from data.table interprets whether or not the data is in .gz format based on the file name ending with .gz or not

  • Single or multi column data

If input file is only one column, the script interprets this column as a column of pvalues and executes accordingly. If the input file consists of multiple columns, then it is assumed that only one such column contains pvalues. In this case the user must supply the column number at runtime using the --column flag (assume column indexing starts at 1).

  • Plink format gwas

If the data is in plink format gwas results then the user does not need any additional flags including column number.

  • Partial or complete data sets

Using the --limit flag at runtime will limit the number of pvalues that are graphed in the final output. This flag is based on proportions and accepts an input from 0 to 1 to plot out the top proportion of pvalues. For example, if you supply it as such --limit 0.5 the script will limit output to half of all pvalues, and this limitation will only output the half that is significant. Depending on the number of pvalues you have, this limit can afford to be very strict. For example when running on test data of over 2.4e7 data points, --limit 0.01 produced reasonable output that containing the significant signal.

Additionally it's possible for you to supply a subset of pvalue data to this script. If you are still attempting to create a qqplot that only plots the significant signal in your data, then this requires that your input data only contains your top pvalues. In addition to your subset of top pvalues, you supply the script with the --range flag which specifies how many pvalues are in the superset that the input was derived from. This will calculate the expected pvalues based on the number of presumed pvalues in the parental set. Note: the --range flag itself will not be able to correct your provided value for any NAs or redundant lines in your parental set so it important that this provided value be accurate.

All options

--input or -i
full path to input file. Can be single or multi column as well as plink format or not plink format
--column or -c
the column number that contains your pvalues for a multi-column file that is not plink.
--noheader
signifies that input file does not have a header. If not used assumes it does have a header.
--limit
limits output to the top proportion of values as defined by the user. Accepts value between 0 and 1
--out or -o
Use as in plink - output prefix optionally including file path, but not including file type (ie not including .txt, .png etc)
--range
Specifies the total number of pvalues within the parent set of input. For use when your input is a subset of your actual data (that is your input file is too big to actually use). 

Example

To run it from the command line do it as so

Rscript qqplot_tophits.R --input pvalues.txt --out ./output_prefix --limit 0.01