Training Infrastructure - TarteelAI/tarteel-ml GitHub Wiki

We have one machine with a couple GPUs available for training. Please contact one of the core developers to request access to the server. Once you have access, you can use the following instructions to get started:

SSH'ing

To ssh into the server, you can use the following command:

ssh <username>@<server-name> -p <port-number>

Directory Structure

The root filesystem is organized as follows:

- data               # Instead of having the audio files spread across directories, we centralize all of it in the data directory. This way we avoid wasted space from duplicate audio files, as well as make sure that we are all working with consistent preprocessing
  - original_audio   # Audio in its original form as available in the tarteel v1 dataset
  - converted_audio  # Preprocessed audio to match 16000 sampling rate, single channel wavs
- software           # All common stable code/toolkits should go in here to enable sharing
  - miniconda
  - OpenNMT-py
- recitation2text    # All recitation to text experiments
  - fatihah-baseline
  - juz-amma-system
- gender-detector    # All gender detector experiments

Transferring files to/from server

You can use rsync to transfer files to/from the server:

# Copy to server
rsync --progress -avz -e 'ssh -p <port-number>' <file/folder-to-copy> <username>@<server-name>:<absolute-destination-path-on-server> 

# Copy from server
rsync --progress -avz -e 'ssh -p <port-number>' <username>@<server-name>:<absolute-source-path-on-server> <local-destination-path>

Training

Environments

See we use Python for most of our code, the environments for each (sub)projects should be managed using conda. conda is already installed and ready to be used on the server. Currently installed environments (that are meant for shared use) are:

opennmtpy

GPU Selection

Since we have more than one GPU on the server (Run nvidia-smi for details), you can use the environment variable CUDA_VISIBLE_DEVICES to set a specific GPU that you want to use. If you command is python train.py ...., you can change it to the following to use the first GPU:

CUDA_VISIBLE_DEVICES=0 python train.py ....

Recitation2Text

Currently, we use OpenNMT-py for all of the training for recitation2text. A conda environment is setup for convenience. Just run the following to enable the environment, after which all OpenNMT-py code can be run without any setup:

source activate opennmtpy

OpenNMT-py is currently installed at ~/software/OpenNMT-py. See ~/recitation2text/fatihah-baseline/exp0-run2/train.sh for a sample bash script that loads the environment and starts training using OpenNMT-py.

Accessing web servers/apps hosted on the server

You can use SSH tunneling to access ports on the machine that are otherwise blocked by the firewall. If your webapp (tensorboard, gradio, etc...) is running on port 8593, you can use the following command to start the ssh tunnel:

ssh <username>@<server-name> -p <port-number> -L 3001:0.0.0.0:8593

The app will now be accessible at localhost:3001 on your local machine.