Python API Setup and Initialization - GenomicsDB/GenomicsSampleAPIs GitHub Wiki
Setup and Initialization
Pre-requisite
The instructions below assume that the ansible playbook has been run. If not, setup instructions can be found here
Loading data into TileDB
Follow instructions here to load data into TileDB
####API Configuration
GenomicsSampleAPIs/python_api/tiledb.cfg
file needs to be updated with the configuration that was used to load data into Tile DB. [mpi]
section of the tiledb.cfg
has the following entries that need to be updated by the user
Field | Optional | Description |
---|---|---|
mpirun | No | path to mpirun executable. e.g. /usr/lib64/openmpi/bin/mpirun |
num_processes | No | Number of MPI processes that was used to load the data |
hostflag | No | depending on which MPI implementation you have pick the flag for the host file. For e.g. use -hostfile for opemmpi, or -f for mpich |
hosts | No | Hosts file that lists the hosts and the max number of arrays in each host. This is the same file that was used in the loading step. |
btl_tcp_if_include | Yes | Some machines may have multiple network interfaces. For e.g. 192.168.100.0/24 specifies that when using the TCP/IP interface, MPI should use the network interface with the IP address in the range 192.168.100.1-254. This is highly specific to the network and will likely not work on other clusters. If your cluster has InfiniBand (IB), consider using the IB interface for higher bandwidth. See the OpenMPI page for information. |
include_env | Yes | This ensures that the env variable is the same across all nodes |
executable | No | gt_mpi_gather is the executable that will be used, and it is available under the TileDB repo at TileDB/variant/example/bin/gt_mpi_gather |
output | No | Specifies the output format. Cotton-JSON is the only supported format now. |
temp_dir | No | temporary directory where the input json files can be generated for mpirun. The files that are created will be deleted after mpirun is complete. |
loader_json | No | JSON file that was used in the loading step |
####Setup for Remote Invocation
virtualenv
can be used to create a virtual environment for the Pyro4 installation or you can install it on the system, and the choice is left to the user. Details on Pyro4 can be found here. Follow these steps to install Pyro4.
cd /home/variantdb/
. venv/bin/activate # skip this step if you want to install pyro4 on your system directly
pip install pyro4
Once Pyro4 is installed, start the Pyro4 NameServer. A detailed list of configuration for Pyro4 can be found here. The NameServer process is an always running process that acts as a Pyro4 DNS. So by definition, the node/network that you run the NameSever on should be reachable by the nodes you plan to run
- the slaves (Tile DB instances), and
- the consumer program (that consumes Tile DB data).
You can either use screen or nohup to detach the process, and prevent it from closing when you exit the terminal. Pick your choice of screen or nohup and start the NameServer as
pyro4-ns --host <hostname or ip>
####Starting MPI Service
In this step, GenomicsSampleAPIs/python_api/mpi_service.py
is registered as a tile master service with NameServer. To start the mpi_service, run 'GenomicsSampleAPIs/python_api/start_mpi_service.py` with the following options.
Option | Optional | Description |
---|---|---|
-h, --help | NA | show this help message and exit |
-H HOST, --host HOST | No | Hostname or the ip that the master is running on |
The object is registered with the Pyro4 NameServer in the format
"tile.master.{0}".format(os.getenv('HOSTNAME'))
Example command to start the master:
./start_mpi_service.sh -H 192.168.100.20