Setup and Initialization - GenomicsDB/GenomicsSampleAPIs GitHub Wiki

System Setup and Initialization

Ansible to automate Genomics DB setup

The available ansible playbooks in the infrastructure/ directory are provided for automated installs. If you plan to use Ansible, skip this wiki page and migrate to ansible details listed below:

  1. Ansible playbook genomicsdb automates all of the steps described in this page, with the exception of Loading data into GenomicsDB section below.

  2. Once data is loaded into Genomics DB (using Loading data into GenomicsDB) and if genomicsdb Ansible playbook was used to setup the infrastructure, then a web server instance can be quickly instantiated using the setup_webserver.py script. See wiki

  3. Iff the Genomics DB has been setup already and needs to be replicated then the genomicsdb-webserver playbook can be used instead of executing the steps under Loading data into GenomicsDB.

If you are not planning to use Ansible, please follow the instructions on this page.

NOTE: The following instructions are specifically for setting up on CentOS7. Refer to ansible for Debian installs.

Requisites

Repositories

The Genomics Sample APIs is dependent on several repositories, which will be handled by the submodule update command below. The main two repositories are:

  1. GenomicsDB has the core TileDB implementation that is augmented with the variant specific handling.
  2. GenomicsSampleAPIs has the abstraction library that is written in C++ and the python API that is built with ctypes to interface with the C++ Library

Software

For the C++ code (GenomicsDB and the Genomics Sample API search library). GenomicsDB dependencies are included with a recursive pull of the repo. More details on GenomicsDB requirements can be found here.

  1. gcc version 4.9.1 or higher. Below is suggested easiest way to install:

    yum install centos-release-scl devtoolset-3
    scl enable devtoolset-3 bash
  2. openmpi compiler and library. On CentOS7, run: yum install openmpi-devel and set your environment variable LD_LIBRARY_PATH=/usr/lib64/openmpi/bin/. If you do not want to set the environmental variable, yum install environment-modules and module load mpi/openmpi-x86_64.

  3. libcsv. On CentOS7 run: yum install libcsv-devel

For the GA4GH API

  1. Python Version 2.7
  2. All the packages that are necessary are under GenomicsSampleAPIs/requirements.txt.
    • We suggest a python virtual environment for this. Instructions below.
  3. PostGreSQL for MetaDB
  4. nginx and mod_wsgi for web deployment
    1. If nginx is not installed, then run yum install nginx
    2. If mod_wsgi is not installed, then run yum install mod_wsgi

Setup the Software and Environment

Fetching the repos

The third command will recursively pull the GenomicsDB dependencies.

git clone [email protected]:Intel-HLS/GenomicsSampleAPIs.git
cd GenomicsSampleAPIs
git submodule update --recursive --init

Note that a recursive pull for GenomicsDB will get RapidJSON, htslib, and TileDB which are all required for GenomicsDB.

Preparing the repos

cd <relative path to GenomicsSampleAPIs repo>/search_library
make BUILD=release GENOMICSDB_DIR=$PWD/dependencies/GenomicsDB/

To use the import script to load your data, add the GenomicsDB utilities directory to your path:

export PATH=$PATH:$PWD/dependencies/GenomicsDB/bin

Setup Python Environment

  1. To create a virtualenv in a place of your choosing and install required packages:

    pip install virtualenv
    virtualenv venv
    . venv/bin/activate
    cd /path/to/GenomicsSampleAPIs/
    pip install -r requirements.txt
  2. Make all GenomicsSampleAPIs modules available:

    python setup.py develop
  3. Set your PYTHONPATH to include the Genomic Sample APIs modules:

    export PYTHONPATH=$PYTHONPATH:/path/to/GenomicsSampleAPIs:/path/to/GenomicsSampleAPIs/web

Setting up MetaDB

This step will require that you are running inside the virtual environment, as specified in the GA4GH API requirements above.

  1. Create a PostGres database: createdb <db_name>.

  2. Tell PostGres to use triggers: createlang plpgsql <db_name>

  3. Copy alembic.ini.example to alembic.ini

  4. Edit the line sqlalchemy.url = driver... in alembic.ini. If you have issues, see the alembic reference docuemntation. For example:

    sqlalchemy.url = postgresql+psycopg2://@:5432/metadb
  5. In the GenomicsSampleAPIs repo (where alembic.ini is) run

    alembic upgrade head

Loading Genomics Data into GenomicsDB

To Import VCFs - see Loading VCF into Genomics DB section for details.
To import MAFs - see Loading MAF into Genomics DB section for details.

GA4GH Setup

To setup the GA4GH Server follow these steps in order

  1. Setup virtual environment
  2. Setup GA4GH configuration file
  3. Run the install script
  4. Start httpd service

If you need only a local instance for testing and development purposes, then do steps #1, #2, and #3 and from the web directory run:

python <relative path to GenomicsSampleAPIs repo>/web/runserver.py

This will start a server running at http://localhost:5000.

Setup virtual environment

If you haven't already done so, run:

virtualenv venv
. venv/bin/activate
pip install -r <relative path to GenomicsSampleAPIs repo>/requirements.txt

Setup GA4GH configuration file

GenomicsSampleAPIs/web/ga4gh_test.conf file needs to be updated with the system configuration for the GA4GH API, and TileDB. The [auto_configuration] section of the ga4gh_test.conf will be autopopulated when the install.py script is run. Other sections of the ga4gh_test.conf have the following entries that need to be updated by the user:

Field Description
workspace Path to where the Tile DB has been setup
arrayname Name of the array in the Tile DB
fields Fields is a comma separated list of fields that are valid for the Array that was loaded in Tile DB
sqlalchemy_database_uri URI of the database that was used to store all the header and meta information about the samples
virtualenv Path to the activate_this.py script for the virtualenv that was created
site_packages Paths that need to be included during apache httpd deployment. Include the site-package path for the virtual environment
debug Run server in debug mode (Unsupported)

Here is an example file:

[web_configuration]
debug = False
host = localhost

[tiledb]
workspace = /home/variantdb/DB
arrayname = test
fields = END,REF,ALT,QUAL,FILTER,BaseQRankSum,ClippingRankSum,MQRankSum,ReadPosRankSum,DP,MQ,MQ0,DP_FORMAT,MIN_DP,GQ,SB,AD,PL,AF,AN,AC,GT,PS
sqlalchemy_database_uri = postgresql+psycopg2://@:5432/metadb

[virtualenv]
virtualenv = /home/variantdb/venv
site_packages = /home/variantdb/venv/lib/python2.7/site-packages

[auto_configuration]

Example:

  1. Setup GA4GH configuration file

Auto populate ga4gh_test.conf and set up nginx configuration

Before running the installation script, populate the below file with your information and save it in web/ga4gh_test.conf.

Inside GenomicsSampleAPIs/web run:

python install.py

This fills in the [auto_configuration] section of your ga4gh_test.conf file, prints a ga4gh.ini file that will be used for wsgi from the nginx service, and prints a ga4gh.service file to set up the nginx ga4gh service (setup below). Port ID defaults to 8008, but can be changed by the user.

Follow the instructions that are printed to the console.

  1. Copy the nginx server config to a config file (for example: nginx_ga4gh.conf) that is made in /etc/nginx/conf.d. The socket path specified in this config (/var/uwsgi) needs to exist and the user running the application (which is where the Genomics Sample APIs code will be running from) needs to have permissions in order to write the socket file to this location.

  2. Copy ga4gh.service to your /etc/systemd/system/ga4gh.service.

  3. In order for the nginx service to work in CentOS7, selinux needs to be disabled. Run:

setenforce 0

Start nginx service

service ga4gh restart

Note: If an existing ga4gh.service file existed, you may need to run systemctl daemon-reload before the above command.

⚠️ **GitHub.com Fallback** ⚠️