Training Infrastructure - TarteelAI/tarteel-ml GitHub Wiki
We have one machine with a couple GPUs available for training. Please contact one of the core developers to request access to the server. Once you have access, you can use the following instructions to get started:
To ssh into the server, you can use the following command:
ssh <username>@<server-name> -p <port-number>
The root filesystem is organized as follows:
- data # Instead of having the audio files spread across directories, we centralize all of it in the data directory. This way we avoid wasted space from duplicate audio files, as well as make sure that we are all working with consistent preprocessing
- original_audio # Audio in its original form as available in the tarteel v1 dataset
- converted_audio # Preprocessed audio to match 16000 sampling rate, single channel wavs
- software # All common stable code/toolkits should go in here to enable sharing
- miniconda
- OpenNMT-py
- recitation2text # All recitation to text experiments
- fatihah-baseline
- juz-amma-system
- gender-detector # All gender detector experiments
You can use rsync
to transfer files to/from the server:
# Copy to server
rsync --progress -avz -e 'ssh -p <port-number>' <file/folder-to-copy> <username>@<server-name>:<absolute-destination-path-on-server>
# Copy from server
rsync --progress -avz -e 'ssh -p <port-number>' <username>@<server-name>:<absolute-source-path-on-server> <local-destination-path>
See we use Python for most of our code, the environments for each (sub)projects should be managed using conda
. conda
is already installed and ready to be used on the server. Currently installed environments (that are meant for shared use) are:
opennmtpy
Since we have more than one GPU on the server (Run nvidia-smi
for details), you can use the environment variable CUDA_VISIBLE_DEVICES
to set a specific GPU that you want to use. If you command is python train.py ....
, you can change it to the following to use the first GPU:
CUDA_VISIBLE_DEVICES=0 python train.py ....
Currently, we use OpenNMT-py for all of the training for recitation2text. A conda
environment is setup for convenience. Just run the following to enable the environment, after which all OpenNMT-py code can be run without any setup:
source activate opennmtpy
OpenNMT-py is currently installed at ~/software/OpenNMT-py
. See ~/recitation2text/fatihah-baseline/exp0-run2/train.sh
for a sample bash script that loads the environment and starts training using OpenNMT-py.
You can use SSH tunneling to access ports on the machine that are otherwise blocked by the firewall. If your webapp (tensorboard, gradio, etc...) is running on port 8593
, you can use the following command to start the ssh tunnel:
ssh <username>@<server-name> -p <port-number> -L 3001:0.0.0.0:8593
The app will now be accessible at localhost:3001 on your local machine.