System Setup and Initialization

Ansible to automate Genomics DB setup

The available ansible playbooks in the infrastructure/ directory are provided for automated installs. If you plan to use Ansible, skip this wiki page and migrate to ansible details listed below:

Ansible playbook genomicsdb automates all of the steps described in this page, with the exception of Loading data into GenomicsDB section below.
- Details of Ansible playbook genomicsdb are at https://github.com/Intel-HLS/GenomicsSampleAPIs/wiki/Install-Genomics-DB-Infrastructure-using-Ansible
- Tips on how to run this playbook can be found in the AnsibleTips section.
Once data is loaded into Genomics DB (using Loading data into GenomicsDB) and if genomicsdb Ansible playbook was used to setup the infrastructure, then a web server instance can be quickly instantiated using the setup_webserver.py script. See wiki
Iff the Genomics DB has been setup already and needs to be replicated then the genomicsdb-webserver playbook can be used instead of executing the steps under Loading data into GenomicsDB.
- Details of Ansible playbook genomicsdb-webserver is at https://github.com/Intel-HLS/GenomicsSampleAPIs/wiki/Import-Genomics-DB-data-using-Ansible
- Tips on how to run this playbook can be found in the AnsibleTips section.

If you are not planning to use Ansible, please follow the instructions on this page.

NOTE: The following instructions are specifically for setting up on CentOS7. Refer to ansible for Debian installs.

Requisites

Repositories

The Genomics Sample APIs is dependent on several repositories, which will be handled by the submodule update command below. The main two repositories are:

GenomicsDB has the core TileDB implementation that is augmented with the variant specific handling.
GenomicsSampleAPIs has the abstraction library that is written in C++ and the python API that is built with ctypes to interface with the C++ Library

Software

For the C++ code (GenomicsDB and the Genomics Sample API search library). GenomicsDB dependencies are included with a recursive pull of the repo. More details on GenomicsDB requirements can be found here.

gcc version 4.9.1 or higher. Below is suggested easiest way to install:

yum install centos-release-scl devtoolset-3
scl enable devtoolset-3 bash

openmpi compiler and library. On CentOS7, run: yum install openmpi-devel and set your environment variable LD_LIBRARY_PATH=/usr/lib64/openmpi/bin/. If you do not want to set the environmental variable, yum install environment-modules and module load mpi/openmpi-x86_64.
libcsv. On CentOS7 run: yum install libcsv-devel

For the GA4GH API

Python Version 2.7
All the packages that are necessary are under GenomicsSampleAPIs/requirements.txt.
- We suggest a python virtual environment for this. Instructions below.
PostGreSQL for MetaDB
nginx and mod_wsgi for web deployment
1. If nginx is not installed, then run yum install nginx
2. If mod_wsgi is not installed, then run yum install mod_wsgi

Setup the Software and Environment

Fetching the repos

The third command will recursively pull the GenomicsDB dependencies.

git clone [email protected]:Intel-HLS/GenomicsSampleAPIs.git
cd GenomicsSampleAPIs
git submodule update --recursive --init

Note that a recursive pull for GenomicsDB will get RapidJSON, htslib, and TileDB which are all required for GenomicsDB.

Preparing the repos

cd <relative path to GenomicsSampleAPIs repo>/search_library
make BUILD=release GENOMICSDB_DIR=$PWD/dependencies/GenomicsDB/

To use the import script to load your data, add the GenomicsDB utilities directory to your path:

export PATH=$PATH:$PWD/dependencies/GenomicsDB/bin

Setup Python Environment

To create a virtualenv in a place of your choosing and install required packages:

pip install virtualenv
virtualenv venv
. venv/bin/activate
cd /path/to/GenomicsSampleAPIs/
pip install -r requirements.txt

Make all GenomicsSampleAPIs modules available:
```
python setup.py develop
```

Set your PYTHONPATH to include the Genomic Sample APIs modules:

export PYTHONPATH=$PYTHONPATH:/path/to/GenomicsSampleAPIs:/path/to/GenomicsSampleAPIs/web

Setting up MetaDB

This step will require that you are running inside the virtual environment, as specified in the GA4GH API requirements above.

Create a PostGres database: createdb <db_name>.
Tell PostGres to use triggers: createlang plpgsql <db_name>
Copy alembic.ini.example to alembic.ini
Edit the line sqlalchemy.url = driver... in alembic.ini. If you have issues, see the alembic reference docuemntation. For example:
```
sqlalchemy.url = postgresql+psycopg2://@:5432/metadb
```
In the GenomicsSampleAPIs repo (where alembic.ini is) run
```
alembic upgrade head
```

Loading Genomics Data into GenomicsDB

To Import VCFs - see Loading VCF into Genomics DB section for details.
To import MAFs - see Loading MAF into Genomics DB section for details.

GA4GH Setup

To setup the GA4GH Server follow these steps in order

Setup virtual environment
Setup GA4GH configuration file
Run the install script
Start httpd service

If you need only a local instance for testing and development purposes, then do steps #1, #2, and #3 and from the web directory run:

python <relative path to GenomicsSampleAPIs repo>/web/runserver.py

This will start a server running at http://localhost:5000.

Setup virtual environment

If you haven't already done so, run:

virtualenv venv
. venv/bin/activate
pip install -r <relative path to GenomicsSampleAPIs repo>/requirements.txt

Setup GA4GH configuration file

GenomicsSampleAPIs/web/ga4gh_test.conf file needs to be updated with the system configuration for the GA4GH API, and TileDB. The [auto_configuration] section of the ga4gh_test.conf will be autopopulated when the install.py script is run. Other sections of the ga4gh_test.conf have the following entries that need to be updated by the user:

Field	Description
workspace	Path to where the Tile DB has been setup
arrayname	Name of the array in the Tile DB
fields	Fields is a comma separated list of fields that are valid for the Array that was loaded in Tile DB
sqlalchemy_database_uri	URI of the database that was used to store all the header and meta information about the samples
virtualenv	Path to the activate_this.py script for the virtualenv that was created
site_packages	Paths that need to be included during apache httpd deployment. Include the site-package path for the virtual environment
debug	Run server in debug mode (Unsupported)

Here is an example file:

[web_configuration]
debug = False
host = localhost

[tiledb]
workspace = /home/variantdb/DB
arrayname = test
fields = END,REF,ALT,QUAL,FILTER,BaseQRankSum,ClippingRankSum,MQRankSum,ReadPosRankSum,DP,MQ,MQ0,DP_FORMAT,MIN_DP,GQ,SB,AD,PL,AF,AN,AC,GT,PS
sqlalchemy_database_uri = postgresql+psycopg2://@:5432/metadb

[virtualenv]
virtualenv = /home/variantdb/venv
site_packages = /home/variantdb/venv/lib/python2.7/site-packages

[auto_configuration]

Example:

Setup GA4GH configuration file

Auto populate ga4gh_test.conf and set up nginx configuration

Before running the installation script, populate the below file with your information and save it in web/ga4gh_test.conf.

Inside GenomicsSampleAPIs/web run:

python install.py

This fills in the [auto_configuration] section of your ga4gh_test.conf file, prints a ga4gh.ini file that will be used for wsgi from the nginx service, and prints a ga4gh.service file to set up the nginx ga4gh service (setup below). Port ID defaults to 8008, but can be changed by the user.

Follow the instructions that are printed to the console.

Copy the nginx server config to a config file (for example: nginx_ga4gh.conf) that is made in /etc/nginx/conf.d. The socket path specified in this config (/var/uwsgi) needs to exist and the user running the application (which is where the Genomics Sample APIs code will be running from) needs to have permissions in order to write the socket file to this location.
Copy ga4gh.service to your /etc/systemd/system/ga4gh.service.
In order for the nginx service to work in CentOS7, selinux needs to be disabled. Run:

setenforce 0

Start nginx service

service ga4gh restart

Note: If an existing ga4gh.service file existed, you may need to run systemctl daemon-reload before the above command.

Setup and Initialization - GenomicsDB/GenomicsSampleAPIs GitHub Wiki

System Setup and Initialization

Ansible to automate Genomics DB setup

Requisites

Repositories

Software

Setup the Software and Environment

Fetching the repos

Preparing the repos

Setup Python Environment

Setting up MetaDB

Loading Genomics Data into GenomicsDB

GA4GH Setup

Setup virtual environment

Setup GA4GH configuration file

Auto populate ga4gh_test.conf and set up nginx configuration

Start nginx service

⚠️ GitHub.com Fallback ⚠️

Setup and Initialization - GenomicsDB/GenomicsSampleAPIs GitHub Wiki

System Setup and Initialization

Ansible to automate Genomics DB setup

Requisites

Repositories

Software

Setup the Software and Environment

Fetching the repos

Preparing the repos

Setup Python Environment

Setting up MetaDB

Loading Genomics Data into GenomicsDB

GA4GH Setup

Setup virtual environment

Setup GA4GH configuration file

Auto populate ga4gh_test.conf and set up nginx configuration

Start nginx service

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️