Working with the DSI computation cluster - utwente-interaction-lab/interaction-lab GitHub Wiki
This guide contains notes on setting up everything for working with BERT (language model) on the computing cluster of the UT, which is managed by Digital Society Institute (DSI), formerly known as CTIT.
HPC means High Performance Computing (everything involving heavy computations, such machine learning projects and benchmarking) and Slurm is the name of the software that manages the computation jobs. The cluster also has a Hadoop subsystem, but we're not using Hadoop in our NLP research.
The rest of this guide uses 'Korenvliet' instead of cluster, as Korenvliet is the head node of the HPC subsystem of the cluster. This is the pc where you log in and do all the setting up. When you send in a computation job, Korenvliet distributes the computational work to other nodes (computers) in the cluster.
Korenvliet (the main pc) runs Ubuntu. This guide assumes you know how to work with the Linux commandline.
- Lines prefixed with "
$
" (user) and "#
" (root) are commands for the linux commandline. - Lines prefixed with "
>
" are Python interpreter lines.
Cluster documentation can be found on the HPC wiki wiki (read at least the basic instructions & workflow before you start doing anything).
The cluster is powerful but generally very busy and harder to debug (your program cannot run in interactive mode). For smaller projects (e.g. things that require less than 16-24Gb of VRAM) it is easier and faster to try with alternative platform, such as those suggested here.
First discuss with the supervisor of your project/course and the lab manager that you would like to use the resources of the cluster. Then send an email with your AD username (the username you use to log in to your email) to Jan Flokstra (Flokstra, J. (EEMCS) [email protected]) so he can activate your account (add your supervisor and the lab manager in CC).
Example: if your name is "First Middle Lastname" then your AD username is "LastnameFM". We will use your_ad_username
throughout this tutorial, replace it with your own credentials.
You need to be at the campus or connected to the VPN of the UT if you want ssh access to Korenvliet:
Access Korenvliet by logging in via SSH, use your regular UTwente password when prompted. Download Putty if you use Windows. If you use MacOS or Linux you can enter this command in the terminal to connect with SSH: $ ssh [email protected]
We need to make sure our development environment on Korenvliet can successfully run the code/experiments we want to submit to the cluster.
Log in to Korenvliet and check whether you have a (non-empty) ~/.profile
and ~/.bashrc
in your home folder.
On korenvliet:
$ ls -la
if you don't see these files in your home folder, copy them from /etc/skel/.profile
and /etc/skel/.bashrc
(!)
$ cd
$ cp /etc/skel/.profile .
$ cp /etc/skel/.bashrc .
We will use these files to activate certain properties of the development environment, such as a specific versions of Python and CUDA.
Cluster nodes do not always have the latest release version available, so you need to check here which is the most recent installed version in the cluster.
At the time of writing, the latest official release of Python is 3.11. The latest version available on Korenvliet however is version 3.10, so we'll just have to work with that.
Add this line to your .bashrc file: /home/your_ad_username/.bashrc
on Korenvliet:
module load python/3.10.7
Watch out: when you want to run the Python interpreter, always use $ python3
or python3.10
, as $ python
might in unexpected cases revert to an older python (3.8.5 being the default version at the time of writing), leading to weird errors.
We are going to need CUDA support (support for running pytorch/transformers on a GPU instead of CPU). Check the software list of the HPC cluster to find the latest version of CUDA that is supported by the cluster.
Add the following line to your .bashrc file: /home/your_ad_username/.bashrc
module load nvidia/cuda-11.2
Log out of Korenvliet and log in again to reload your environment and make the changes to your bashrc persistent. You can check whether it worked by running $ python -V
after login. It should show Python version 3.10.7.
Before installing or running anything, double-check that you are running the right version of Python and Pip:
-
$ python3 -V
should print "Python 3.10.7
" -
$ pip -V
should print "pip 19.x.x from /deepstore/software/python/3.10.7/lib/python3.10/site-packages/pip (python 3.10)
"
To work with BERT we need Python library transformers (by huggingface). Transformers documentation. Transformers uses either pytorch or Tensorflow for the back-end. In this guide we use pytorch.
If you want to use CUDA and Tensorflow, check out this compatibility overview.
To install Transformers, follow the regular installation instructions.
TL;DR:
- create a venv with Python version 3.10:
$ python3.10 -m venv venv
$ source venv/bin/activate
- install Pytorch according to the docs here (select the CUDA version that you added to .bashrc in an earlier step)
- You command will look something like this (with different torch versions):
$ pip install torch==1.7.0+cu101 torchvision==0.8.1+cu101 torchaudio==0.7.0 -f https://download.pytorch.org/whl/torch_stable.html
- You command will look something like this (with different torch versions):
- install transformers with pip:
$ pip install transformers
Now we have Pytorch and Transformers with CUDA support.
Make sure the venv is active: $ source venv/bin/activate
And then run the following oneliner to check transformer-functionality: $ python3 -c "from transformers import pipeline; print(pipeline('sentiment-analysis')('we love you'))"
Output:
/home/your_ad_username/git/affective_BERT/venv/lib/python3.10/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 8000). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
return torch._C._cuda_getDeviceCount() > 0
Downloading: 100%|██████████████████████████████████████████████████████| 629/629 [00:00<00:00, 303kB/s]
Downloading: 100%|███████████████████████████████████████████████████| 268M/268M [00:04<00:00, 56.5MB/s]
Downloading: 100%|████████████████████████████████████████████████████| 232k/232k [00:00<00:00, 618kB/s]
Downloading: 100%|██████████████████████████████████████████████████████| 230/230 [00:00<00:00, 200kB/s]
[{'label': 'POSITIVE', 'score': 0.9998704791069031}]
This takes 3 seconds to run (=slow!), but it works! Python complains that it does not use the right module for CUDA (8 instead of 10.1). That may be because we run this python script interactively on Korenvliet (just once, for testing), and not as job on the cluster nodes (which is what we would do normally). You can see Transformers automatically downloads the default models for performing sentiment analysis (current: BERT models) to our hard drive. Finally, it gives us a sentiment rating for "We love you": a POSITIVE label, with a valence rating of 0.99987(...).
You can use Git (e.g. github or Gitlab) to get your code to Korenvliet. You can also use scp to copy files from your local pc to Korenvliet.
If you code has extra dependencies outside of Transformers, install these to the virtual environment as well! For example, if requirements.txt
contains your dependencies, use:
$ source venv/bin/activate
$ pip install -r requirements.txt
Sidenote: If you run into problems with the venv, using anaconda instead of pip could also be an option, as it is installed on Korenvliet. We have NOT used it for this guide. Note that currently, transformers
is not currently working with anaconda and should always be installed with pip.
Normally, Transformers automatically downloads the models it needs if we use 'from transformers import pipeline'. However, I think the cluster nodes are air-gapped and/or not allowed to automatically download large files.
One of our python code/slurm jobs gave the following error:
Output of slurm job:
ctit082
Gpu devices: 1
Test 2/?
Python version: Python 3.7.3
Venv:
certifi==2020.11.8 chardet==3.0.4 click==7.1.2 dataclasses==0.6 filelock==3.0.12 future==0.18.2 idna==2.10 joblib==0.17.0 numpy==1.19.4 packaging==20.7 Pillow==8.0.1 pyparsing==2.4.7 regex==2020.11.13 requests==2.25.0 sacremoses==0.0.43 six==1.15.0 tokenizers==0.9.4 torch==1.7.0+cu101 torchaudio==0.7.0 torchvision==0.8.1+cu101 tqdm==4.54.0 transformers==4.0.0 typing-extensions==3.7.4.3 urllib3==1.26.2
Starting worker:
Traceback (most recent call last):
File "/deepstore/software/python/3.7.3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/deepstore/software/python/3.7.3/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/your_ad_username/git/affective_BERT/src/one_sentiment.py", line 5, in <module>
print(pipeline('sentiment-analysis')('we love you'))
File "/home/your_ad_username/git/affective_BERT/venv/lib/python3.7/site-packages/transformers/pipelines.py", line 2969, in pipeline
tokenizer = AutoTokenizer.from_pretrained(tokenizer, revision=revision, use_fast=use_fast)
File "/home/your_ad_username/git/affective_BERT/venv/lib/python3.7/site-packages/transformers/models/auto/tokenization_auto.py", line 343, in from_pretrained
return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
File "/home/your_ad_username/git/affective_BERT/venv/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 1747, in from_pretrained
local_files_only=local_files_only,
File "/home/your_ad_username/git/affective_BERT/venv/lib/python3.7/site-packages/transformers/file_utils.py", line 1007, in cached_path
local_files_only=local_files_only,
File "/home/your_ad_username/git/affective_BERT/venv/lib/python3.7/site-packages/transformers/file_utils.py", line 1177, in get_from_cache
"Connection error, and we cannot find the requested files in the cached path."
ValueError: Connection error, and we cannot find the requested files in the cached path. Please try again or make sure your Internet connection is on.
We fixed this problem by only working with pre-downloaded models, instead of asking Transformers to download the necessary models on the fly. For this you need to manually specify in your script which model Transformers should use, and make sure this model is already present on Korenvliet before you start the job. You can find out how to do this by reading the HuggingFace transformer tutorial, or looking at the code example below (under submitting jobs on the HPC cluster).
To download a model (in the case of preloaded_sentiment.py
, the "distillbert-base-uncased
" model):
-
$ git lfs install
#this enable large file sizes with git -
$ cd src
# this is the folder with your own code $ git clone https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english
This downloads the distilbert-base-uncased-finetuned-sst-2-english
model to a folder with the same name. Refer to the new local model file in your Python script.
Please take a look at the information about the different GPU's and information on usage. Before submitting jobs please note the maximum number of jobs and maximum number of job steps per job which can be scheduled. These numbers can be obtained using the scontrol show config
command on the Korenvliet. sbatch
is used to submit a job script for later execution. The script will typically contain one task or (if required) multiple srun
commands to launch parallel tasks.
See the slurm_sbatch and slurm_srun wiki page for more details:
To run a job, you create a .sbatch
file in which you specify the resources needed for your program. For example, this is our main Python program src/preloaded_sentiment.py
, with a simple example taken from the Transformers tutorial:
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForSequenceClassification
# this time, we set up everything by hand
# the models are preloaded (git clone from model repository)
# model source: https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english/tree/main
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)
print(classifier('I hate you'))
and this is preload_sentiment.sbatch
, the SBATCH file which describes the job to SLURM:
#!/bin/bash
#SBATCH -J BERT_SA_prep # job name, don't use spaces
#SBATCH -c 1 # number of cores, 1
#SBATCH --mail-type=END,FAIL # email status changes
#SBATCH --time=1:00:00 # time limit 1h
# add nvidia cuda
module load nvidia/cuda-10.1
module load python/3.7.3
source venv/bin/activate
# log hostname
hostname
echo "Gpu devices: "$CUDA_VISIBLE_DEVICES
echo "Test 2/?"
pyv=`python -V`
echo "Python version: "$pyv
echo "Venv:"
echo `pip freeze`
echo "Starting worker: "
python3 -m src.preloaded_sentiment
Good luck with your experiments!