MultipleOrganisms - GeneMANIA/pipeline GitHub Wiki

Creating a multi-organism dataset, such as the one that runs http://genemania.org, currently requires an extra processing step to merge the individual datasets. Here's an example workflow:

mkdir parent/ # or whatever name you like
cd parent

git clone https://github.com/GeneMANIA/pipeline.git human
git clone https://github.com/GeneMANIA/pipeline.git yeast

git clone https://github.com/GeneMANIA/pipeline.git merged

So we've created a pipeline instance for each organism, in this case human and yeast, and an additional pipeline instance for the merged dataset. as usual

cd human
# populate data/ folder human
snakemake

cd ../yeast
# populate data/ folder for yeast
snakemake

cd ../merged
snakemake --config merge=1

The additional step is, after successfully building the individual organisms, run the pipeline enabling the merge config option. Results will appear under parent/merged/results/. Unfortunately this will results in lots of data copying (with new indices) and regeneration of some artifacts to match.

The merge config automatically scans for other data pipeline instances that are peers of the merged folder. If you prefer you can provide a list of such folders (e.g. when you want to merge just a subset of the organisms you have prepared):

snakemake --config merge=1 orgs=../yeast,../human,../anotherone

notice the list of organisms is comma delimited with no spaces.