PARIS Pathway Analysis - pcgoddard/Burchardlab_Tutorials GitHub Wiki

PARIS Tutorial: Pathway Analysis

Pagé Goddard

Oct 9, 2017

Resources
Context: Biofilter ubmrella program
Intro: What is PARIS?
PARIS Vignette
PARIS Pipeline

Resources

Context: biofilter umbrella program

PARIS is a subprogram of biofilter 2.4. Here's a little bit about the master program in general

multi-purpose software program developed by the Ritchie Lab
inputs:
- SNP identifiers (eg rsIDs)
- rare variant identifiers
- base pair positions
- gene symbols
- genetic regions
- copy number variant location
what it can do:
- Loci/Region annotation
- generate gene-gene interaction models
- integrating database information

What is PARIS?

Pathway Analysis by Randomization Incorporating Structure

enrichment analysis tool - determines aggregated association signals from GWAS results
pathway analysis tool - highlights biological pathways associated with phenotypes
approach: permutation testing of genomic features to evaluate genomic structure of interrogated pathways; data reduction from SNP to gene and gene to pathway
- claims to eliminate "many over-testing concerns with other pathway analysis approaches"
- TBH, not entirely sure what this means yet but it sounds good!

input	output
list of SNPs	list of pathways with p-values
-OR-	-AND-
Positions	list of genes contributing to pathways

PARIS Vignette

1) Build your database
2) Prepare input files
3) Run PARIS
- Customize your script
- Troubleshooting
4) Understand PARIS results
5) Parse PARIS results
6) Visualize results

NB: This vignette assumes you already have biofilter installed as it has already been installed on CAESAR

# check you have the most updated version
biofilter.py --version
# Biofilter version 2.4.1 (2017-04-17)
#      LOKI version 2.2.4 (2017-04-14)
#    SQLite version 3.9.2
#      APSW version 3.9.2-r1

1) Build your database

Biofilter uses a local database interface called Library of Knowledge Integration (LOKI) which needs to be intialized and then updated periodically as the reference databases are updated.

This can take a while (4-24 hours) depending on you internet connection speed. This will download and process the bulk data files from all supported knowledge sources, storing the result in the file “loki.db” (recommended to rename with current date, such as “loki-yymmdd.db”)

NB: you need ~30 GB of disk space to build the database: 10-20 GB for temp storage and 5-10 GB for final database file.

# check disk space
df -h

# Filesystem            Size  Used Avail Use% Mounted on
# /dev/vzfs              20G  3.5G   180G  1.9% /

# prepare to submit as a job as a bash script 
# or run command on detachable screen
# this vignette uses screen
screen -N loki-build

# build loki database
loki-build.py --verbose --knowledge loki.db --update

# detach screen
[ctrl+A] [ctrl+D] 

# to check on process
screen -r loki-build

# rename database file
date="170831" # update current yymmdd
mv loki.db loki-${date}.db
ls
# loki-170831.db

If the build seems to freeze with warnings like the following, usually a keyboard interrupt will tell loki-build to keep going.

# 2017/09/15
# WARNING: data from these sources was not updated:
#  mint
#  pfam

the build should work but will finish with this verbiage: # WARNING: errors encountered during knowledge database update; skipping optimization step

note: the LOKI database has been build on the server and is available in the pgoddard directory

database sources

Source	URL	Summary
KEGG	http://www.genome.jp/kegg/pathway.html	metabolic and pathway data
BioGRID	http://thebiogrid.org	Repository of genetic and protein interaction data from model organisms and humans; used to link position and region data to interaction information.
NCBI dbSNP	http://www.ncbi.nlm.nih.gov/snp	Database of SNPs and multiple small-scale variations (insertions, deletions, microsatellites and non-polymorphic variants). Includes a complete list of known human SNPs and their base pair positions relative to the human reference genome. Biofilter uses the data of dbSNP in two ways: connecting SNP identifiers to genomic positions and updating retired identifiers to current identifiers.
NCBI Gene / “Entrez”	http://www.ncbi.nlm.nih.gov/gene	Search engine for many discrete health sciences databases at the NCBI. Provides an extensive list of known human genes, their beginning and ending base pair positions, and many alternate names and cross-referenced database identifiers. This data is used to connect gene symbols to their genomic regions, and to connect equivalent gene symbols and identifiers to each other.
Gene Ontology (GO)	http://www.geneontology.org	The Gene Ontology database defines terms representing gene product properties, such as cellular components, molecular function, and biological processes, within a hierarchical tree of ontology groups and related proteins.
NHGRI GWAS Catalog	http://www.genome.gov/gwastudies	The NHGRI GWAS Catalog provides associations between SNPs and various phenotypes which were discovered via genome-wide association studies (GWAS).
MINT	http://mint.bio.uniroma2.it/mint/Welcome.do	The Molecular Interaction database contains experimentally verified protein-protein interactions from the scientific literature, which are used in Biofilter for linking position and region data to interacting protein pairs.
NetPath	http://www.netpath.org	The NetPath database consists of curated human signaling pathways which are used by Biofilter.
OregAnno	http://www.oreganno.org/oregano	The Open REGulatory ANNOtation database is used by Biofilter for curation information about known regulatory elements from the scientific literature.
Pfam	http://pfam.sanger.ac.uk	The Pfam database is a large collection of protein families. The annotation of data respective to proteins within Biofilter is based on the information from Pfam.
PharmGKB	http://www.pharmgkb.org	Biofilture currently uses this database for pathway based data, future releases of Biofilter will also include gene-drug associations and pharmacological association study results.
Reactome	http://www.reactome.org/ReactomeGWT/entrypoint.html	Biofilter uses the information contained in Reactome to establish pathway and network relationships between genes.
UCSC genome browser	http://genome.ucsc.edu	This source provides access to a growing database of genomic sequence and annotations for a wide variety of organisms, currently we use the UCSC for location information for evolutionary conserved regions (ECRs) for Biofilter and to acess OregAnno’s regulatory region data.

2) Set up data

PARIS inputs

SNP/test file (note the commented headers)

a. --paris-snp-file snps.txt

#rs	chr	pval
rs123456	1	0.0001
rs678910	1	0.025

b. --paris-position-file *pos.txt*

#chr	label	ignored	position	chr	pval
1	rs123456	-	123456	1	0.0001
5	cpg388	-	89101112	5	0.025

2. LD Block Regions

--region-file *LD.file*

#chr	start	stop
1	12356	14000

or

#chr	label	start	stop
1	block1	12356	14000

Generate SNP/Position File

note*:*** not all GWAS/snp association results will have rsIDs. genotyped snps may have rsIDs but imputed snps will not. When rsIDs are not given for every variant, you will need to format a position file.

Genotyped SNP GWAS output (from plink):

# genotyped SNPs will have rsIDs

 CHR               SNP         BP   A1       TEST    NMISS       BETA         STAT            P
   1     rs112750067     10327      G        ADD      165     -304.4       -1.107       0.2698
   1     rs56289060     10433    A        ADD      165      475.5        1.267        0.207
   ...

## convert to `--paris-snp-file` format in R

# read in data
df <- read.table("PATH/TO/mygwas.assoc.linear", sep = "\t", header = T)

# extract pertinent columns in required order
paris <- df[ , c("SNP", "CHR", "P")]

# rename columns with commented header
colnames(paris) <- c("#rsid","chr","pval")

# save file
write.table(paris, "PATH/TO/paris_snps.txt", row.names = F, quote = F)

Imputed GWAS output (from plink):

# imputed SNPs do not have rsIDs assigned to them; nor will other genome features like methylation sites

 CHR               SNP         BP   A1       TEST    NMISS       BETA         STAT            P
   1      1:567867:A:G     567867    G        ADD      165     -304.4       -1.107       0.2698
   1      1:636975:A:G     636975    A        ADD      165      475.5        1.267        0.207
   ...

## convert to `--paris-position-file` format in R

# read in data
df <- read.table("PATH/TO/mygwas.assoc.linear", sep = "\t", header = T)

# extract pertinent columns in required order
paris <- df[ , c("CHR", "SNP", "A1", "BP", "CHR", "P")] # be sure to include the repeated "CHR" column

# rename columns with commented header
colnames(paris) <- c("#chr", "label", "ref_allele", "position", "chr", "pval") # note: "ref_allele" column is our 'ignored' column

# save file
write.table(paris, "PATH/TO/paris_positions.txt", row.names = F, quote = F)

Generate LD regions

calculate your own simple high-LD region list using plink
typical threshold for "high LD" is an r^2 >= 0.8
LD-Spline can be used for a more robust calculation
see LD-Spline publication for discussion of linkage disequilibrium measurements & concept

# set variables
date="171009" # update current date
wrkdir="~/wrkdir" # path to working directory
geno="/media/BurchardRaid01/LabShare/Data/data_freeze/burchardlab_temp/axiom_merged-genotypes/GALA2_merge/GALA2_mergedLAT-LATP_noParents_030816_rsID" # path to your plink genotype binary files
ldout="GALA2_mergedLAT-LATP_noParents_030816_rsID_r2_0.8" # prefix for output

# run plink; save only regions with r2 >= 0.8
plink --bfile $geno --r2 --ld-window-r2 0.8 --out $ldout
    # example log
    # 4427 people (2179 males, 2248 females) loaded from .fam.
    # Total genotyping rate is 0.995559.
    # 733668 variants and 4427 people pass filters and QC.
    # Among remaining phenotypes, 2267 are cases and 2158 are controls.  (2 phenotypes are missing.)

 head GALA2_mergedLAT-LATP_noParents_030816_rsID_r2_0.8.ld
    #  CHR_A         BP_A          SNP_A  CHR_B         BP_B          SNP_B           R2
    #      1       754182      rs3131969      1       785989      rs2980300     0.811966
    #      1      1121794     rs11260549      1      1122196      rs4634847     0.943057
    #      1      1151917     rs78479912      1      1152601   Affx-4564046      0.82517
    #      1      1249187     rs12142199      1      1337837      rs1240709     0.942998

# in R* reformat to PARIS LD input 
# ( *or use whatever your preferred method is. I think this is easiest)

# set up
setwd("~/wrkdir")
plink <- read.table("GALA2_mergedLAT-LATP_noParents_030816_rsID_r2_0.8.ld", header=T)

# subset appropriate columns
paris <- with(plink, data.frame(chr=CHR_A, start=BP_A, stop=BP_B)) # can also copy SNP or R2 value as the "label" column if you want

# save file
write.table(paris, "paris_LD_GALA2_mergedLAT-LATP_noParents_030816_rsID_r2_0.8.txt", quote=F, row.names=F)

# in bash, check the final LD regions file
head paris_LD_GALA2_mergedLAT-LATP_noParents_030816_rsID_r2_0.8.txt
    # chr start stop
    # 1 754182 785989
    # 1 1121794 1122196
    # 1 1151917 1152601
    # 1 1249187 1337837

3) run PARIS

# Set variables
wd="/media/BurchardRaid01/LabShare/Home/pgoddard/wrkdir_paris_wgsbdr"
loki="/media/BurchardRaid01/LabShare/Home/pgoddard/bin/loki-170831.db"
today="yymmdd" # set to run date
ld="/media/BurchardRaid01/LabShare/Home/pgoddard/keys_grm_genesis_gwas/LD_sage_rsq03_snpid_updated_nodup_snp_rsid_merged_r2_0.8.ld" # ld regions
snp="$wd/SNP_sage_mergedLAT-LATP_noParents_030816_rsID.txt" # position / snp file
out_sage="p0001_Result_sage_wgsbdr_perm${permute}_kegg_${today}" # name of output

# optional variables
permute="10000" # default # is 1000; recommend at least 10,000
maxp="0.05" # sets maximum output p-value for PARIS; reduces computation burden by not calculating p-values above cutoff threshold of 0.05
sigp="0.0001" # sets significance threshold for input data (default is 0.05); recommended if working with WGS or imputed data, else default is fine
list="kegg" # netpath reactome go # space-delimited list of databases of interests (reduces number of tests and run time)


# Run Paris
time biofilter.py --knowledge loki-170831.db --grch-build-version 37 --source $list \ # biofilter inputs
--paris --paris-position-file $snp --region-file $ld \ # paris input files
--paris-p-value $sigp --paris-details --prefix $out # paris options

# 'time' is not necessary; it will just tell you how long your command needed to run

log:

knowledge database genome build: GRCh38 / UCSC hg38
user input genome build: GRCh37 / UCSC hg19

real    4m15.565s # run time
user    4m11.877s
sys     0m3.947s

Customize your pathway analysis

--source dbSNP KEGG etc. - limit databases
--region-match-bases [n] --region-match-percentage [%] - replace [x] with numeric value; by default PARIS considers a feature-gene overlap valid if at least one basepair is shared (i.e. default is --region-match-bases 1); higher value requires more overlap; negative value indicates "within n bases of gene"
--paris-p-value [p] - control p-value threshold of input file for PARIS to consider significant; default = 0.05. this should be adjusted if you are working with whole genome or imputed data (0.05 is too lenient with that many data points)
--paris-permutation-count [n] - default is 1000 permutations yielding precision of 0.001 false positives; this gives you a minimum p-value of <0.001. If you perform 10 permutations, you'll have a min p of <0.1. You sacrifice precision for speed. Recommended permutation count is 10,000.
--paris-bin-size [n] - default 10,000. when PARIS permutes a gene/pathway, it randomly replaces all the features with features of similar size (relative to input SNP/position count, not genomic region size), pulling from a pool of [n] features
--grch-build-version 37 - default grch 38; be sure to match your position file/gwas analysis
--verbose - "PARIS, talk to me"
--overwrite - screw that last run
--user-defined-knowledge file - specify candidate gene lists or pathways

User-defined knowledge

Sometimes it can be useful to run enrichment analysis on candidate investigation lists. For instance, you may want to check previously identified genes or pathways without spending copious computation resources and time on a full agnostic run. To do this, you will need a group list file.

Input file

The first line will start with the nickname you will use to call your file with the --source option. This will be followed by section headers, which are indicated by the word GROUP before the section name. For your source nickname and section names, if you want to employ spaces, use an underscore _ to separate words. Biofilter looks for the first white space to distinguish the source nickname from descriptive information.

Biofilter will restrict its analysis to the groups and genes in your user-defined list. For instance, the pathway-p-value calculated for the "lung" group below will be based on the gene-level p-values for A1BC and D2EF. In the output files, the groups will be listed as "lung" and "liver_cancer"

mylist="path/to/candidate_gene_list"
head $mylist

cancer My custom cancer gene sets # file header: nickname (description)
GROUP lung (My lung cancer genes) # group header: GROUP name (description)
A1BC
D2EF
GROUP liver_cancer (My liver cancer genes)
G3HI
J4KL
M5NO

Calling candidate list

time biofilter.py --knowledge loki-170831.db --grch-build-version 37 \ #biofilter inputs
--user-defined-knowledge $mylist --source cancer \ # user defined knowledge; note that --source is followed by the first space-delimited term in your candidate gene list
--paris --paris-position-file $snp --region-file $ld \ # paris input files
--paris-p-value $sigp --paris-details --prefix $out # paris options

troubleshooting (growing list)

note: if you get errors or your input file is being printed to the screen strangely, rewrite the -- in your code; if you copied from the pdf it is highly likely that you have dashes that are encoded incorrectly because pdf's suck for code sharing and biofilter is trying to read through your input file for the proper commands*
If you get errors regarding Loki and/or python modules (APSW module for instance), you may need to adjust your bash profile to ensure that python is correctly directed to the modules. To do so you can run these lines:

export PYTHONPATH=/media/BurchardRaid01/LabShare/Home/cmak/bin/biofilter/lib/python2.7/site-packages:$PYTHONPATH
export PYTHONPATH=/media/BurchardRaid01/LabShare/Home/cmak/lib/python2.7/site-packages:$PYTHONPATH
export PYTHONPATH=/media/BurchardRaid01/LabShare/Home/cmak/lib64/python2.7/site-packages:$PYTHONPATH

PARIS run times

# SNPs / p-value	0.05	0.01	0.001	0.0001

           | -  |    |     |

4) PARIS Results

prefix.paris-summary file columns

group - pathway or gene group analyzed
description - description, if any
genes - number of genes in group/pathway
features - total # genomic features covered
simple - number simple features (containing only one input SNP/position)
(sig) - proportion that were significant according to SNP association p-values
complex - features containing >1 SNP (e.g. LD block with multiple tagged SNPs)
(sig) - same, but wrt complex features
pval - probability of a randomized gene grouping with the feature containing at least as many significant features as the original

prefix.paris-details file

Pathway Investigation:  biogrid:745788


genes   features    simple  (sig)   complex (sig)   pval
2   5   5   0   0   0   0.93

Gene Breakdown: biogrid:745788

gene    features    simple  (sig)   complex (sig)   pval
MRPL45  5   5   0   0   0   1
SLAC8  5   5   0   0   0   0.85

This isn't super easy to parse so we will want to reformat it to look like this:

Group   group.pval  gene    gene.features   gene.simple.features gene.simple.sig gene.complex.features  gene.complex.sig    gene.pval
biogrid:745788  1   MRPL45  5   5   0   0   0   1
biogrid:745788  1   gene2  5   5   0   0   0   1

# ignoring other group-level features because they are covered in the .paris-summary file

Reformat with linux

1.sed command

# remove unwanted subheaders and blank lines within our blocks
sed -i '/^\s*$/d' ${mydata}.paris-detail # remove blanks; -i tells sed to alter input directly
sed -i '/^gene\t/d' ${mydata}.paris-detail # remove subheader lines beginning with 'gene' (but not 'genes'); edit input file directly

sed - linux command to find regex expressions
-i - tells sed to modify input file directly
'/pattern\d' - delete lines containing pattern
^\s*$ - from start of line (^), all white space/blank (\s*) till end of the line ($) # removes blank lines
'this \t is \t a \t tabbed \t header' - (spaces added to make reading easier); \t indicates table delimiter in strings

#check file
head ${mydata}.paris-detail
# Pathway Investigation:  biogrid:745788
# genes   features        simple  (sig)   complex (sig)   pval
# 1       5       5       0       0       0       1
# Gene Breakdown: biogrid:745788
# MRPL45  5       5       0       0       0       1
# Pathway Investigation:  biogrid:564461
# genes   features        simple  (sig)   complex (sig)   pval
# 1       9       9       0       0       0       1
# Gene Breakdown: biogrid:564461
# MYC     9       9       0       0       0       1

2.echo command

# write header to new file
echo -e "Group\tgene.count\tpath.pval\tgene\tgene.features\tgene.simple.features\tgene.simple.sig\tgene.complex.features\tgene.complex.sig\tgene.pval" > ${mydata}.paris-detail-parsed

-e tells echo to recognize regular expressions (i.e. read \t as 'tab')

3.awk command

# extract and reformat desired lines with awk
awk 'BEGIN {FS="\t"; OFS="\t"};
$1=="Pathway Investigation:" {pathway=$2}; $1~/^[0-9]{1,10}/ {pval=$7; count=$1}; /Gene Breakdown:/{flag=1;next}/Pathway Investigation:/{flag=0} flag {line=$0; print pathway, count, pval, line}' \
${mydata}.paris-detail >> ${mydata}.paris-detail-parsed

awk 'BEGIN{} ... - calls awk in linux
{FS="\t"; OFS="\t"} - sets input/output delimeters (Field Separators) to tab (\t)
$1=="Gene Breakdown:" {path=$2} - when field 1 reads "Gene Breakdown:", field 2 is the pathway name; capture pathway name as variable "pathway"
/Gene Breakdown:/{flag=1;next} - matches lines with /pattern A/; sets flag when pattern A is found in a line; skips the line containing pattern A to avoid printing it
/Pathway Investigation:/{flag=0} - unsets flag when /pattern B/ is found in a line; skips the line containing pattern to avoid printing it
flag - pattern with the default action, which is to print $0: if flag is equal 1 the line is printed. This way, it will print all those lines occurring from the time PAT1 occurs and up to the next PAT2 is seen. This will also print the lines from the last match of PAT1 up to the end of the file.
{print pathway, $0} - instructs flag to print pathway variable before flagged line
; and \ - just a breaks for awk and linux respectively to make the code easier to parse visually
input.file > output.file - read in from input, write parsed data to output

for more explanation of this awk script approach, see: stackoverflow response

check file

head ${mydata}.paris-detail-parsed
# Group   gene    gene.features   gene.simple.features    gene.simple.sig gene.complex.features   gene.complex.sig    gene.pval
# biogrid:745788  MRPL45  5   5   0   0   0   1
# biogrid:564461  MYC 9   9   0   0   0   1
# biogrid:925550  SH3GL1  11  11  0   0   0   1
# biogrid:678627  FBXO25  5   5   0   0   0   1
# biogrid:1267483 SNRNP70 6   6   1   0   0   0.077
# biogrid:1267483 PRPF40A 1   1   0   0   0   1
# biogrid:1057661 CUL7    2   2   0   0   0   1
# biogrid:869076  WASL    10  10  1   0   0   0.186
# biogrid:749106  EIF4B   7   7   4   0   0   < 0.001

5) Parse PARIS results

Now that we have easy-to-parse files, its time to extract only the pathways of interest: those that are significant, driven by multiple affected genes, and are of biological pertinance. We will do some parsing in R to pare down the list, and some lit review to identify candidate pathways for downstream analysis.

# read in your data
df <- read.table("prefix.paris-details-parsed", header=T, sep="\t") # setting the separator to tab prevents read.table from reading '< 0.0001' as two white-space delimited fields

# 1. recode < 0.001 as 0.0001
# just be sure to adjust your legend labels back to < 0.001
df$path.pval[df$path.pval=="< 0.001"] <- 0.0001 # recode p-min to numeric value
df$path.pval <- as.numeric(df$path.pval) # change class to numeric
table(df$path.pval)

df$gene.pval[df$gene.pval=="< 0.001"] <- 0.0001 # recode p-min to numeric value
df$gene.pval <- as.numeric(df$gene.pval) # change class to numeric
table(df$gene.pval)

# 2. restrict to pathways with p <= 0.05 and more than one gene present
df2 <- df[df$gene.count>1 & df$path.pval<=0.05,]
df3 <- df2[df2$gene.count<1000,] # some ontology groups like "protein binding" have thousands of genes so you may want to weed those out here

# 3. identify pathways driven by only one gene
out <- split(df2, df2$Group) # create list split by the Group/pathways

## to get to know this data format, try
head(out)
names(out) # or head(names(out)) for first 6 only
out[1](/pcgoddard/Burchardlab_Tutorials/wiki/1) # show entry 1 by calling number
out["'de novo' actin filament nucleation"](/pcgoddard/Burchardlab_Tutorials/wiki/"'de-novo'-actin-filament-nucleation") # show entry 1 by calling name (your order may differ)

## Set up variables
sig <- 0.05 # set gene signifiance threshold
pathways.to.remove <- c() # initialize list

## for if loop to find pathways with <2 significant genes
for (i in names(out)){ # iterate through pathway names
  if (sum(out[i](/pcgoddard/Burchardlab_Tutorials/wiki/i)$gene.pval < sig, na.rm=TRUE) < 2) { # if number of pathway genes with significant pval is < 2
    pathways.to.remove <- c(i,pathways.to.remove) # write the pathway name to the removal list
  }
}

## Pathway counts
length(unique(df2$Group)) # number of unique pathways (path.pval>0.05) in dataset # 6478
length(pathways.to.remove) # number of pathways driven by single gene # 6357

## subset
pathways.of.interest <- df3[-which(df3$Group %in% pathways.to.remove),]
length(unique(pathways.of.interest$Group)) # 121

6) Visualize Results

# tree map
if (!require(treemap)){ 
  install.packages("treemap")
  library(treemap)
} 

# note: if you cannot install the treemap package you may need to update your R/Rstudio environment

library(RColorBrewer)
palette <- rev(brewer.pal(9, "YlGnBu"))

treemap(pathways.of.interest,
        index="Group",
        vSize="gene.count",
        vColor="path.pval",
        type="value",
        palette=palette,
        title="PARIS test map",
        fontsize.labels=c(12),                    # size of labels. Give the size per level of aggregation: size for group, size for subgroup, sub-subgroups...
        fontcolor.labels=c("black"),             # Color of labels
        fontface.labels=c(2),                    # Font of labels: 1,2,3,4 for normal, bold, italic, bold-italic...
        bg.labels=c("transparent"),              # Background color of labels
        overlap.labels=0.5,                      # number between 0 and 1 that determines the tolerance of the overlap between labels. 0 means that labels of lower levels are not printed if higher level labels overlap, 1  means that labels are always printed. In-between values, for instance the default value .5, means that lower level labels are printed if other labels do not overlap with more than .5  times their area size.
        inflate.labels=F
)

The output will be a treemap (example below)

this

PARIS Pipeline

Here I have concatenated all the commands for the PARIS analysis pipeline with minimal commenting. Extensive explanations of each can be found above in the vignette.

## commandline

# build/update database
screen -S loki
loki-build.py --verbose --knowledge loki.db --update
# detach screen

## r

# get position file
df <- read.table("PATH/TO/mygwas.assoc.linear", sep = "\t", header = T)
paris <- df[ , c("CHR", "SNP", "A1", "BP", "CHR", "P")] # note repeated "CHR" column
colnames(paris) <- c("#chr", "label", "ref_allele", "position", "chr", "pval") # note commented header
write.table(paris, "PATH/TO/paris_snps.txt", row.names = F, quote = F)

## commandline

# LD regions
date="171009" # update current date
wrkdir="~/wrkdir" # path to working directory
geno="PATH/TO/genotypedata" # path to your plink genotype binary files
ldout="mygenodata_r2_0.8" # prefix for output

plink --bfile $gala2geno --r2 --ld-window-r2 0.8 --out $ldout # save only regions with r2 >= 0.8

## r

# reformat LD region file
setwd("~/wrkdir") # path to working directory
plink <- read.table("mygenodata_r2_0.8.ld", header=T)
paris <- with(plink, data.frame(chr=CHR_A, start=BP_A, stop=BP_B))
write.table(paris, "paris_mygenodata_r2_0.8.txt", quote=F, row.names=F)

## commandline

# run paris

#date="171009" # already set
#wrkdir="~/wrkdir" # already set
ldregions="paris_mygenodata_r2_0.8.txt"
positions="paris_snps.txt"
mydata="nameyouroutput"
sig="0.0000001556" # multiple testing corrected p-value (bonferroni 0.05/n)
sugg="0.00000311" # 0.1/n

time biofilter.py --knowledge loki-170831.db --grch-build-version 37 --paris --paris-position-file $positions --region-file $ldregions --paris-p-value $sig --paris-details --prefix ${mydata}

# reformat .paris-detail file

sed -i '/^\s*$/d' ${mydata}.paris-detail
sed -i '/^gene\t/d' ${mydata}.paris-detail
echo -e "Group\tgene.count\tpath.pval\tgene\tgene.features\tgene.simple.features\tgene.simple.sig\tgene.complex.features\tgene.complex.sig\tgene.pval" > ${mydata}.paris-detail-parsed 
awk 'BEGIN {FS="\t"; OFS="\t"}; $1=="Pathway Investigation:" {pathway=$2}; $1~/^[0-9]{1,10}/ {pval=$7; count=$1}; /Gene Breakdown:/{flag=1;next}/Pathway Investigation:/{flag=0} flag {line=$0; print pathway, count, pval, line}' ${mydata}.paris-detail >> ${mydata}.paris-detail-parsed

## r

# pare down results

# read in your data
df <- read.table("prefix.paris-details-parsed", header=T, sep="\t")

# 1. recode < 0.001 as 0.0001
df$path.pval[df$path.pval=="< 0.001"] <- 0.0001
df$path.pval <- as.numeric(df$path.pval)
df$gene.pval[df$gene.pval=="< 0.001"] <- 0.0001
df$gene.pval <- as.numeric(df$gene.pval)

# 2. restrict to pathways with p <= 0.05 and more than one gene present
df2 <- df[df$gene.count>1 & df$path.pval<=0.05,]
df3 <- df2[df2$gene.count<1000,] # optional

# 3. identify pathways driven by only one gene
out <- split(df2, df2$Group) # create list split by the Group/pathways

sig <- 0.05 # set gene signifiance threshold
pathways.to.remove <- c() # initialize list

for (i in names(out)){ # iterate through pathway names
  if (sum(out[i](/pcgoddard/Burchardlab_Tutorials/wiki/i)$gene.pval < sig, na.rm=TRUE) < 2) {
    pathways.to.remove <- c(i,pathways.to.remove)
  }
}

## Pathway counts
length(unique(df2$Group))
length(pathways.to.remove)

## subset
pathways.of.interest <- df3[-which(df3$Group %in% pathways.to.remove),]
length(unique(pathways.of.interest$Group))

## r 

# Plot results

# tree map
if (!require(treemap)){ 
  install.packages("treemap")
  library(treemap)
} 

library(RColorBrewer)
palette <- rev(brewer.pal(9, "YlGnBu"))

treemap(pathways.of.interest,
        index="Group",
        vSize="gene.count",
        vColor="path.pval",
        type="value",
        palette=palette,
        title="PARIS test map",
        fontsize.labels=c(12),
        fontcolor.labels=c("black"),
        fontface.labels=c(2),
        bg.labels=c("transparent"),
        overlap.labels=0.5,
        inflate.labels=F
)