Setup and Initialization - GenomicsDB/GenomicsSampleAPIs GitHub Wiki
The available ansible playbooks in the infrastructure/
directory are provided for automated installs. If you plan to use Ansible, skip this wiki page and migrate to ansible details listed below:
-
Ansible playbook
genomicsdb
automates all of the steps described in this page, with the exception of Loading data into GenomicsDB section below.- Details of Ansible playbook
genomicsdb
are at https://github.com/Intel-HLS/GenomicsSampleAPIs/wiki/Install-Genomics-DB-Infrastructure-using-Ansible - Tips on how to run this playbook can be found in the AnsibleTips section.
- Details of Ansible playbook
-
Once data is loaded into Genomics DB (using Loading data into GenomicsDB) and if
genomicsdb
Ansible playbook was used to setup the infrastructure, then a web server instance can be quickly instantiated using the setup_webserver.py script. See wiki -
Iff the Genomics DB has been setup already and needs to be replicated then the
genomicsdb-webserver
playbook can be used instead of executing the steps under Loading data into GenomicsDB.- Details of Ansible playbook
genomicsdb-webserver
is at https://github.com/Intel-HLS/GenomicsSampleAPIs/wiki/Import-Genomics-DB-data-using-Ansible - Tips on how to run this playbook can be found in the AnsibleTips section.
- Details of Ansible playbook
If you are not planning to use Ansible, please follow the instructions on this page.
NOTE: The following instructions are specifically for setting up on CentOS7. Refer to ansible for Debian installs.
The Genomics Sample APIs is dependent on several repositories, which will be handled by the submodule update
command below. The main two repositories are:
- GenomicsDB has the core TileDB implementation that is augmented with the variant specific handling.
- GenomicsSampleAPIs has the abstraction library that is written in C++ and the python API that is built with ctypes to interface with the C++ Library
For the C++ code (GenomicsDB and the Genomics Sample API search library). GenomicsDB dependencies are included with a recursive pull of the repo. More details on GenomicsDB requirements can be found here.
-
gcc version 4.9.1 or higher. Below is suggested easiest way to install:
yum install centos-release-scl devtoolset-3 scl enable devtoolset-3 bash
-
openmpi compiler and library. On CentOS7, run:
yum install openmpi-devel
and set your environment variableLD_LIBRARY_PATH=/usr/lib64/openmpi/bin/
. If you do not want to set the environmental variable,yum install environment-modules
andmodule load mpi/openmpi-x86_64
. -
libcsv. On CentOS7 run:
yum install libcsv-devel
For the GA4GH API
- Python Version 2.7
- All the packages that are necessary are under
GenomicsSampleAPIs/requirements.txt
.- We suggest a python virtual environment for this. Instructions below.
- PostGreSQL for MetaDB
- nginx and mod_wsgi for web deployment
- If nginx is not installed, then run
yum install nginx
- If mod_wsgi is not installed, then run
yum install mod_wsgi
- If nginx is not installed, then run
The third command will recursively pull the GenomicsDB dependencies.
git clone [email protected]:Intel-HLS/GenomicsSampleAPIs.git
cd GenomicsSampleAPIs
git submodule update --recursive --init
Note that a recursive pull for GenomicsDB will get RapidJSON, htslib, and TileDB which are all required for GenomicsDB.
cd <relative path to GenomicsSampleAPIs repo>/search_library
make BUILD=release GENOMICSDB_DIR=$PWD/dependencies/GenomicsDB/
To use the import script to load your data, add the GenomicsDB utilities directory to your path:
export PATH=$PATH:$PWD/dependencies/GenomicsDB/bin
-
To create a virtualenv in a place of your choosing and install required packages:
pip install virtualenv virtualenv venv . venv/bin/activate cd /path/to/GenomicsSampleAPIs/ pip install -r requirements.txt
-
Make all GenomicsSampleAPIs modules available:
python setup.py develop
-
Set your PYTHONPATH to include the Genomic Sample APIs modules:
export PYTHONPATH=$PYTHONPATH:/path/to/GenomicsSampleAPIs:/path/to/GenomicsSampleAPIs/web
This step will require that you are running inside the virtual environment, as specified in the GA4GH API requirements above.
-
Create a PostGres database:
createdb <db_name>
. -
Tell PostGres to use triggers:
createlang plpgsql <db_name>
-
Copy alembic.ini.example to alembic.ini
-
Edit the line
sqlalchemy.url = driver...
in alembic.ini. If you have issues, see the alembic reference docuemntation. For example:sqlalchemy.url = postgresql+psycopg2://@:5432/metadb
-
In the GenomicsSampleAPIs repo (where alembic.ini is) run
alembic upgrade head
To Import VCFs - see Loading VCF into Genomics DB section for details.
To import MAFs - see Loading MAF into Genomics DB section for details.
To setup the GA4GH Server follow these steps in order
If you need only a local instance for testing and development purposes, then do steps #1, #2, and #3 and from the web
directory run:
python <relative path to GenomicsSampleAPIs repo>/web/runserver.py
This will start a server running at http://localhost:5000.
If you haven't already done so, run:
virtualenv venv
. venv/bin/activate
pip install -r <relative path to GenomicsSampleAPIs repo>/requirements.txt
GenomicsSampleAPIs/web/ga4gh_test.conf
file needs to be updated with the system configuration for the GA4GH API, and TileDB. The [auto_configuration]
section of the ga4gh_test.conf
will be autopopulated when the install.py
script is run. Other sections of the ga4gh_test.conf
have the following entries that need to be updated by the user:
Field | Description |
---|---|
workspace | Path to where the Tile DB has been setup |
arrayname | Name of the array in the Tile DB |
fields | Fields is a comma separated list of fields that are valid for the Array that was loaded in Tile DB |
sqlalchemy_database_uri | URI of the database that was used to store all the header and meta information about the samples |
virtualenv | Path to the activate_this.py script for the virtualenv that was created |
site_packages | Paths that need to be included during apache httpd deployment. Include the site-package path for the virtual environment |
debug | Run server in debug mode (Unsupported) |
Here is an example file:
[web_configuration]
debug = False
host = localhost
[tiledb]
workspace = /home/variantdb/DB
arrayname = test
fields = END,REF,ALT,QUAL,FILTER,BaseQRankSum,ClippingRankSum,MQRankSum,ReadPosRankSum,DP,MQ,MQ0,DP_FORMAT,MIN_DP,GQ,SB,AD,PL,AF,AN,AC,GT,PS
sqlalchemy_database_uri = postgresql+psycopg2://@:5432/metadb
[virtualenv]
virtualenv = /home/variantdb/venv
site_packages = /home/variantdb/venv/lib/python2.7/site-packages
[auto_configuration]
Example:
Before running the installation script, populate the below file with your information and save it in web/ga4gh_test.conf
.
Inside GenomicsSampleAPIs/web
run:
python install.py
This fills in the [auto_configuration] section of your ga4gh_test.conf file, prints a ga4gh.ini file that will be used for wsgi from the nginx service, and prints a ga4gh.service file to set up the nginx ga4gh service (setup below). Port ID defaults to 8008, but can be changed by the user.
Follow the instructions that are printed to the console.
-
Copy the nginx server config to a config file (for example:
nginx_ga4gh.conf
) that is made in/etc/nginx/conf.d
. The socket path specified in this config (/var/uwsgi
) needs to exist and the user running the application (which is where the Genomics Sample APIs code will be running from) needs to have permissions in order to write the socket file to this location. -
Copy ga4gh.service to your /etc/systemd/system/ga4gh.service.
-
In order for the nginx service to work in CentOS7, selinux needs to be disabled. Run:
setenforce 0
service ga4gh restart
Note: If an existing ga4gh.service file existed, you may need to run systemctl daemon-reload
before the above command.