Dowstreamer ‐ PascalX - molgenis/systemsgenetics GitHub Wiki
In order to run Downstreamer gene scores (p-values) need to be obtained from GWAS summary statistics using PascalX. If you intend to run Downstreamer using the EUR reference panel used in the manuscript you just need to follow Steps 1-3. If you wish to use a different reference panel an additional Step 4 is required.
The scripts and singularity image to achieve this are part of the Downstreamer download package. All commands should work when run within the directory pascalx_code/
.
Using PascalX to prepare a GWAS for Downstreamer
Step 1. Download reference panel of genotypes
First download the EUR reference panel of genotypes from 1000 Genomes Project. Please run the get1KGGRCh37_custom_samples.sh
script with the following command:
cd pascalx_code/
bash download_reference/get1KGGRCh37_custom_samples.sh pascalx_reference_storage/ EUR 4 tped download_reference/PascalX_phase3_Europeans_404.txt
You can provide a different selection of samples by providing a different sample file PascalX_phase3_Europeans_404.txt
, select another ancestry by modifying EUR
, or choose different output directory by modifying pascalx_reference_storage/
. If you wish to do this please read Step 4.
Note: We recommend to confine to European reference ancestry samples when working GWAS studies performed on European cohorts.
This script will download and transform the necessary files into TPED format and place them in pascalx_reference_storage/
.
Script location: download_reference/
Estimated Resources: ~36GB download, 4 threads, 8GB
Step 2. Loading reference panel into PascalX db format
After downloading the reference genotypes they still need to be transformed into the .db
format required by PascalX. You can do this with the following command:
singularity exec singularity/pascalx.sif python3 load_reference/run_import.py pascalx_reference_storage/EUR.1KG.GRCh37
You can provide a different location for the downloaded reference genotypes pascalx_reference_storage/EUR.1KG.GRCh37
but remember to include the prefix of the TPED files in the path.
Script location: pascalx_code/load_reference/
Estimated resources: +5 hours, 4 threads, 16GB
Please refer to PascalX documentation for further options and input formats for this step.
Step 3. Running PascalX on GWAS summary statistics.
After loading the genotypes in the db
format we are now ready to calculate gene scores (p-values) from GWAS summary statistics using PascalX. You will need to provide your GWAS sumary statistics file in a \t
delimited format and specify in the parameters which column contains the rsids and which column the pvalues (0-based). The pvalue column must not contain empty values or NAs. You can do this with the following command:
singularity exec singularity/pascalx.sif python3 runPascalX/runRealGWAS.py \
--refpanel pascalx_reference_storage/EUR.1KG.GRCh37 \
--gwas your_gwas.txt \
--annotation "gene_annotation/genes_Ensembl94.txt" \
--threads 8 \
--outfile "out.txt" \
--rscol 0 \
--pcol 1 \
--window 25000
Corresponding script: runPascalX/runRealGWAS.sh
Estimated Resources: 24 hours, 8 threads, 50GB
After obtaining the gene scores (p-values) you can now continue with the Downstreamer manual
Remarks: Here minimally adapted PascalX code is used to obtain additional metrics per gene (number of snps and minimum p-value). We pass Ensembl as IDs in the symbol column (csymb) using the gene annotation 'genes_Ensembl94.txt'
Step 4. Prepare null phenotypes for alternative PascalX configuration and/or reference panel of genotypes.
Note: This step is time and computationally intensive and is only necessary when using a different reference panel of genotypes or if you wish to run PascalX with other settings (e.g. window, scoring method) different from the ones used in the manuscript. If you intend to use Downstreamer in the same way as in the manuscript for a EUR ancestry GWAS only Steps 1-3 are needed.
For custom use cases, Downstreamer will additionally require as input both gene scores (p-values) calculated from null phenotypes and gene-scores from the tested phenotype (GWAS) using identical reference genotypes and PascalX parameters.
Remarks: The purpose of the null gene scores is to empirically estimate how often gene-pvalues can be correlated by chance. This can occur if for example genes are in close proximity and share variants when being scored by PascalX.
How to prepare a custom reference panel of genotypes and/or modify PascalX scorer settings
For example if you wish to use another reference panel of genotypes from 1000 Genomes Project you can modify the Step1 script and then follow Step2. In other cases please load your reference genotypes into a format accepted by PascalX by following the PascalX documentation.
Then score your GWAS as in Step3. (Optional) Modify the parameters for PascalX scoring in the script (window, annotation) and/or modify the script runRealGWAS.py
.
After scoring your GWAS you can continue to calculate gene-scores from null genotypes using the same reference genotypes and PascalX parameters used to score the GWAS. In the case you modified the runRealGWAS.py
you must also later modify runNullGWAS.py
in the same way.
Preparing null phenotypes for a custom reference set
This is done in 4 sub-steps.
Since PascalX is a gene scorer with a same chromosome window this step can be split for each chromosome to allow parallelization since the following substeps are memory/time intensive.
1: Prepare Null phenotypes matrix that is conformal to your reference panel
First we generate a matrix .npy
filled with values obtained from the "standard normal" distribution that serve as random phenotypes. The dimensions of this matrix are Permutations x Samples. We recommend 10000 for the permutations. The samples must match the number of samples in your custom reference panel of genotypes (here 404 as example). Please run the following script:
singularity exec singularity/pascalx.sif python3 runNullGWAS/prepareNullGWAS.py --permutations 10000 --samples 404 --outdir nullGWASdata/
Corresponding script:pascalx_code/runNullGWAS/prepareNullGWAS.py
Estimated resources: Seconds, 1 Thread, 8GB
2: Calculate null GWAS per chromosome (Memory intensive).
We then calculate a matrix *Null.npy
and *Null.cols.npy
for each chromosome. This resulting matrix contains the null summary statistics (null GWAS) obtained by Pearson correlation of the gene dosages of the reference panel of genotypes and the Null phenotypes matrix. Please run the following command:
for chromosome in {1..22}; do \
singularity exec singularity/pascalx.sif \
python3 -u runNullGWAS/prepareChromosome.py \
--refpanel "pascalx_reference_storage/EUR.1KG.GRCh37" \
--annotation "gene_annotation/genes_Ensembl94.txt" \
--phenotypes "nullGWASdata/NullGwasPhenotypes.npy" \
--chromosome $chromosome \
--maf 0.05 \
--outdir "chromosomes_null_GWAS/" \
;done
Corresponding script: runNullGWAS/prepareChromosome.sh
Estimated Resources: 2 hours, 1 cpu, 100Gb
Remarks: No PascalX scoring ocurrs on this step but we use PascalX to obtain the dosages from the .db
file
3: Score all Null GWAS with PascalX (Time intensive)
This is the most time-consuming step.
This script scores the Null GWAS precalculated on the previous step. It uses a special function of PascalX to load a single chromosome and score 1 gene at a for multiple GWAS (here multiple null GWAS). We want to thank Daniel Krelf (lead author of PascalX) for implementing this functionality. To further reduce the time to score null GWAS we subdivide chromosomes into chunks of genes.
Produces: 'genePvaluesNullGwasX*.tsv' 'geneMinPvaluesNullGwasX*.tsv' 'NullGwasX*.cols.txt' 'NullGwasX*.rows.txt' 'geneResultsNullGwasX*.csv'
Corresponding script:runNullGWAS/runNullGWAS.sh
Estimated resources: 24h (1 chunk), 1 thread, 16GB + 60GB (local storage)
Example:
PHENOTYPES=$PHENOTYPESDIR/chr1Null.npy
PHENOTYPESCOLS=$PHENOTYPESDIR/chr1Null.cols.npy
cp $PHENOTYPES ${TMPDIR}/
cp $PHENOTYPESCOLS ${TMPDIR}/
singularity exec -B /local/ \
singularity/pascalx.sif \
python3 -u runNullGWAS/runNullGWAS.py \
--refpanel pascalx_reference/EUR.1KG.GRCh37 \
--annotation gene_annotation/output_chunks/chr1_chunk1.txt \
--part 1 \
--chromosome 1 \
--maf 0.05 \
--outdir nullGWASdata \
--scratch ${TMPDIR}
Remarks To facilitate IO. If possible the corresponding '*Null.npy' and '*Null.cols.npy' files can be placed in the local storage when running in a cluster (i.e. scratch in SLURM).
4: Correlate the gene p-values for use in Downstreamer
For each chunk in step 3 the following files are created that will be used here
genePvaluesNullGwasX_CHR_CHUNK.tsv
a tab-seperated matrix with p-valuesNullGwasX_CHR_CHUNK.rows.txt
the row names (genes) of the p-value matrixNullGwasX_4_16.cols.txt
the column names (dummy phenotype names) of the p-value matrix
We also need a list of the chunks. This can be obtained as follows.
pathToChunkFolder=""
ls -1 ${pathToChunkFolder} | gzip > chunkList.txt.gz
Update the paths in: pascalx_code/runNullGWAS/correlateGenePvalues.sh
Note: the gene definition file needs to be in the same format as: gene_annotation/genes_Ensembl94_protein_coding.txt
Run the script
sh correlateGenePvalues.sh
Appendix
If need be the image can be updated using this command.
sudo singularity build pascalx.sif pascalx.def
If you need chunks for a different gene annotation please run
singularity exec singularity/pascalx.sif python3 gene_annotation/makeChunks.py gene_annotation/genes_Ensembl94.txt gene_annotation/output_chunks