Python API Setup and Initialization - GenomicsDB/GenomicsSampleAPIs GitHub Wiki

Setup and Initialization

Pre-requisite

The instructions below assume that the ansible playbook has been run. If not, setup instructions can be found here

Loading data into TileDB

Follow instructions here to load data into TileDB

####API Configuration
GenomicsSampleAPIs/python_api/tiledb.cfg file needs to be updated with the configuration that was used to load data into Tile DB. [mpi] section of the tiledb.cfg has the following entries that need to be updated by the user

Field Optional Description
mpirun No path to mpirun executable. e.g. /usr/lib64/openmpi/bin/mpirun
num_processes No Number of MPI processes that was used to load the data
hostflag No depending on which MPI implementation you have pick the flag for the host file. For e.g. use -hostfile for opemmpi, or -f for mpich
hosts No Hosts file that lists the hosts and the max number of arrays in each host. This is the same file that was used in the loading step.
btl_tcp_if_include Yes Some machines may have multiple network interfaces. For e.g. 192.168.100.0/24 specifies that when using the TCP/IP interface, MPI should use the network interface with the IP address in the range 192.168.100.1-254. This is highly specific to the network and will likely not work on other clusters. If your cluster has InfiniBand (IB), consider using the IB interface for higher bandwidth. See the OpenMPI page for information.
include_env Yes This ensures that the env variable is the same across all nodes
executable No gt_mpi_gather is the executable that will be used, and it is available under the TileDB repo at TileDB/variant/example/bin/gt_mpi_gather
output No Specifies the output format. Cotton-JSON is the only supported format now.
temp_dir No temporary directory where the input json files can be generated for mpirun. The files that are created will be deleted after mpirun is complete.
loader_json No JSON file that was used in the loading step

####Setup for Remote Invocation
virtualenv can be used to create a virtual environment for the Pyro4 installation or you can install it on the system, and the choice is left to the user. Details on Pyro4 can be found here. Follow these steps to install Pyro4.

cd /home/variantdb/ 
. venv/bin/activate # skip this step if you want to install pyro4 on your system directly
pip install pyro4

Once Pyro4 is installed, start the Pyro4 NameServer. A detailed list of configuration for Pyro4 can be found here. The NameServer process is an always running process that acts as a Pyro4 DNS. So by definition, the node/network that you run the NameSever on should be reachable by the nodes you plan to run

  1. the slaves (Tile DB instances), and
  2. the consumer program (that consumes Tile DB data).

You can either use screen or nohup to detach the process, and prevent it from closing when you exit the terminal. Pick your choice of screen or nohup and start the NameServer as

pyro4-ns --host <hostname or ip>

####Starting MPI Service In this step, GenomicsSampleAPIs/python_api/mpi_service.py is registered as a tile master service with NameServer. To start the mpi_service, run 'GenomicsSampleAPIs/python_api/start_mpi_service.py` with the following options.

Option Optional Description
-h, --help NA show this help message and exit
-H HOST, --host HOST No Hostname or the ip that the master is running on

The object is registered with the Pyro4 NameServer in the format

"tile.master.{0}".format(os.getenv('HOSTNAME'))

Example command to start the master:

./start_mpi_service.sh -H 192.168.100.20