Getting Started - ucsf-wynton/tutorials GitHub Wiki
- Getting a Wynton HPC Account
- Connecting to Wynton HPC
- Linux operating system on Wynton HPC
- Using the Linux command line
- Storage
- Overview of the different kinds of nodes on Wynton
- A little about Linux environment modules
- Submitting a job to Wynton
- Interactive sessions on Wynton HPC
- Parallel jobs
- GPU scheduling
- Best practices
- Troubleshooting tips
- Getting Additional Help
- Fill out the Wynton account request form
- note: if you are from Gladstone, ask IT for a UID/GID and check the box for "Gladstone" in the form
- After the form is submitted, the Wynton admins will set up your Wynton HPC account and work with you to make sure you can access the cluster
- Read the User Agreement
- If you need to change your password
- go to password change
- For password resets
-
ssh
-
Open an ssh client. An ssh client is already installed if you are using OS X or Linux. On Windows you might need to download an ssh client application.
- Mac
- Terminal (built-in)
- iTerm2
- Linux
- Terminal (built-in)
- Windows
- Mac
-
In the example below, replace
alice
with your actual Wynton user name. Type the first line and your password, when prompted:
-
{local}$ ssh [email protected]
[email protected]'s password:
Last login: Thu Jul 16 17:03:28 2020
[alice@wynlog2 ~]$
-
sftp
-
sftp is a common method used to transfer files between 2 computers. If you are using OS X or Linux, an sftp client application is already installed. Under Windows, you might need to download an additional application.
-
In the example below, replace
alice
with your actual Wynton user name. Type the first line and your password, when prompted:
-
{local}$ sftp [email protected]
[email protected]'s password:
Connected to log2.wynton.ucsf.edu.
sftp>
-
For more information on how to transfer files on Wynton see Wiki - How to move files
-
Troubleshooting logging in to Wynton HPC
- If you have difficulty connecting, make sure you have received confirmation that your account has been created and the username you are using is correct.
- Make sure the server hostname you are connecting to is correct. You can only log into the Wynton login nodes or the data transfers directly from the outside.
- If your password needs to be reset, please contact the Wynton system administrators.
- Wynton HPC runs the Linux operating system, specifically CentOS 7 Linux
- Becoming comfortable using the Linux command line and the bash "shell" are very useful skills to interact with the Wynton HPC environment
- A good intro to using the Linux command line is available at Software Carpentry - The Unix Shell
-
Video recording of Software Carpentry - The Unix Shell (UCSF login required) [running time 2:17:48] by Geoffrey Boushey at UCSF, a member of the Library's Data Science Initiative team
- Topics covered in the 2 hour recording include
- Introducing the Shell
- Navigating Files and Directories
- Working with Files and Directories
- Finding things
- Topics covered in the 2 hour recording include
IMPORTANT: Wynton storage is NOT backed up. If your data are important, do not keep the only copy on Wynton.
- BeeGFS is the parallel file system used by Wynton. It is optimized for HPC
- Home directory
- mounted under
/wynton/home
- user home directory quotas are 500GiB
- mounted under
- Group directory
- mounted under
/wynton/group
- to check quota for group members
beegfs-ctl --getquota --git <group>
- For example, 100TB of Gladstone space under
/wynton/group/gladstone
. Here's how it works.
- mounted under
- Global scratch space
- mounted as
/wynton/scratch
and is available as a shared directory from all Wynton nodes - If you are copying files that will only be needed temporarily, for example as input to a job, then you have the option of copying them directly to a global scratch space at
/wynton/scratch
. There is 492TiB of space available for this purpose. - /wynton/scratch is automatically purged after 2 weeks, but you should go ahead and delete the files when you no longer need them.
- note: it is good practice to first create your own subdirectory here and copy to that location
- mounted as
mkdir /wynton/scratch/my_own_space
scp filename.tsv [email protected]:/wynton/scratch/my_own_space
- Local scratch space
- mounted as
/scratch
- each node has it's own /scratch directory that is not shared with other nodes
- it is good practice to create a directory under /scratch to write to
- https://wynton.ucsf.edu/hpc/scheduler/using-local-scratch.html
- mounted as
- There are a few different kind of nodes (Linux hosts): login, development, data transfer, compute, gpu compute
-
login nodes
- login nodes can be logged into directly
- minimal compute resource
- dedicated solely to basic tasks such as copying and moving files on the shared file system, submitting jobs, and checking the status on existing jobs
- node names
log1.wynton.ucsf.edu
log2.wynton.ucsf.edu
-
development nodes
- cannot log into development nodes directly. They can be accessed from the login nodes
- node names
dev1.wynton.ucsf.edu
dev2.wynton.ucsf.edu
dev3.wynton.ucsf.edu
gpudev1.wynton.ucsf.edu
- validating scripts, prototyping pipelines, compiling software, etc
- interactive jobs (Python, R, Matlab)
-
data transfer node
- like login nodes, the data transfer nodes can be logged into directly
- node names
dt1.wynton.ucsf.edu
dt2.wynton.ucsf.edu
- have access to the outside internet
- data transfer nodes each have 10 Gbps network connections and can be logged into directly like the login nodes. For comparison, the login nodes have 1 Gbps network connections
- for large transfers, making use of Globus would be the preferred transfer method
- Gladstone user have additional options for high-speed data transfers to/from Gladstone, local and Dropbox locations: See internal confluence docs.
-
compute nodes
- can not log in to compute nodes directly
- the scheduler will send jobs to compute nodes
- the majority of compute nodes have Intel processors, a few have AMD
- local /scratch
- either hard disk drive (HDD), solid state drive (SSD), or Non-Volitile Memory Express (NVMe) drive
- each node has a tiny /tmp (4-8 GiB)
-
gpu (for GPU computation)
- cannot log in to gpu nodes directly
- as of 2019-09-20
- 38 GPU nodes with a total of 132 GPUs available to all users
- Among these, 31 GPU nodes, with a total of 108 GPUs, were contributed by different research groups
- GPU jobs are limited to 2 hours in length when run on GPUs not contributed by the running user's lab.
- Contributors are not limited to 2-hour GPU jobs on nodes they contributed
- There is also one GPU development node that is available to all users
- 38 GPU nodes with a total of 132 GPUs available to all users
- https://wynton.ucsf.edu/hpc/software/software-modules.html
- available module repositories (need to be loaded)
- CBI : Repository of software shared by the Computational Biology and Informatics (http://cbi.ucsf.edu) at the UCSF Helen Diller Family Comprehensive Cancer Center
- Sali: Repository of software shared by the UCSF Sali Lab
- A list of the available modules in the CBI and Sali repositories is available at https://wynton.ucsf.edu/hpc/software/software-repositories.html
or by using the
module avail
command after loading a module withmodule load
To list all the modules in the CBI repository:module load CBI
followed bymodule avail
- Loading a module use:
module load
For example to load the R module from the CBI module repository:module load CBI r
- To see what gets set when a module is loaded use:
module show
For example, to see what gets set when the mpi module is loaded:module show mpi
- To see what software modules you have currently loaded use:
module list
- To see what software modules are currently available (in the software repositories you have loaded), use:
module avail
- To disable (“unload”) a previously loaded module. , use:
module unload
For example, to unload the R module if it had been loaded previously:module unload r
- To disable all loaded software modules and repositories:
module purge
- Other ways of loading software
- Centos Software Collections (SCL)
- Build the software in your home directory
- Use a Singularity container (similar to Docker and Docker container images can be converted to Singularity images)
- The current job scheduler used is SGE 8.1.9 (Son of Grid Engine), however Wynton will be transitioning to the Slurm job scheduler in Q4 2020
- The scheduler coordinates distributing jobs, which get submitted as batch scripts, to the compute nodes of the cluster
- Example SGE job submission
qsub -l h_rt=00:01:00 -l mem_free=1G my_job.sge
(replace time, memory and file name with your choices)-
my_job.sge
= the batch script file to be submitted -
-l h_rt
= maximum runtime (hh:mm:ss or seconds) -
-l mem_free
= maximum memory (K
for kilobytes,M
for megabytes,G
for gigabytes)
-
- Jobs always run on the compute nodes whether they are submitted from a login node or from a development node.
- To check on the job
- Current status:
qstat
orqstat -j 191442
(replace 191442 with the actual SGE job id) - After the job ran successfully:
grep "usage" my_job_sge.0284740
(replace the output file name with the actual output file name) - After a failed job:
tail -100000 /opt/sge/wynton/common/accounting | qacct -f -j 191442
(replace 191442 with the actual SGE job id)
- Current status:
- How much memory to request when submitting a job?
- With experience and trial & error, you can estimate the memory requirements for various types of jobs
- Logs, reports and accountings can help provide clues
- Wynton is relatively forgiving on memory estimates
- If unsure, try 8GB and then increase/decrease accordingly
- Tips on submitting jobs
- For intensive jobs during busy times, you can reserve resources for your job as soon as they become available by including this parameter
-R y
- Compute nodes do not have access to the internet, i.e., you can not run jobs that include steps like downloading files from online resources.
- Development nodes DO have access to the internet.
- If your script or pipeline requires access to the internet, consider splitting up the work
- run a script on a dev node that retrieves online files and then submits jobs to be run on compute nodes.
- Also cron jobs can be run on a dev node to periodically download files separate from compute-heavy jobs that can be submitted to compute nodes
- For intensive jobs during busy times, you can reserve resources for your job as soon as they become available by including this parameter
- To check the job queue metrics of the cluster, go to https://wynton.ucsf.edu/hpc/status/index.html
- https://wynton.ucsf.edu/hpc/scheduler/submit-jobs.html
- A parallel environment for multithreaded (SMP) jobs is available for use on the cluster
- This environment must be used for all multithreaded jobs. Such jobs not running in this PE are subject to being killed by the cluster systems administrator without warning.
- Example submission for a parallel BLAST job
#!/bin/bash
#
#$ -S /bin/bash
#$ -l arch=linux-x64 # Specify architecture, required
#$ -l mem_free=1G # Memory usage, required. Note that this is per slot
#$ -pe smp 2 # Specify parallel environment and number of slots, required
#$ -R yes # SGE host reservation, highly recommended
#$ -cwd # Current working directory
blastall -p blastp -d nr -i in.txt -o out.txt -a $NSLOTS
Notes on the example
- In the above example, the '-a' flag tells blastall the number of processors it should use.
- $NSLOTS is the number of slots requested for the parallel environment
- more information on parallel and MPI jobs, https://salilab.org/qb3cluster/Parallel_jobs
- Compiling GPU applications
- The CUDA Toolkit is installed on the development nodes
- Several versions of CUDA are available via software modules. To see the currently available versions, run the command:
module avail cuda
- more information on GPU jobs, https://wynton.ucsf.edu/hpc/scheduler/gpu.html
- It is currently not possible to request interactive jobs via the scheduler
- There are dedicated development nodes (dev1, dev2, dev3, gpudev1) that can be used for short-term interactive development such as building software and prototyping scripts before submitting them to the scheduler.
- Interactive python session
1) ssh to a login node
2) ssh to a dev node
3) typepython3
to enter the Python REPL for an interactive session
4) when done, typeexit()
to quit session - Interactive R session
1) ssh to a login node
2) ssh to a dev node
3) typeR
to enter the R interactive session
4) when done, typeq()
to quit the session - Interactive MATLAB session
1) ssh to a login node
2) ssh to a dev node
3) typemodule load Sali matlab
4) typematlab
More information on working with
-
GUI apps / X-Windows / X2Go
- X2Go is accelerated remote desktop software
- It should be significantly faster than using X Windows tunneled through ssh
- To use X2Go on Wynton one will need to install the X2Go client on their computer
- Backup your data if it is important
- Login nodes (or dev nodes) to submit batch jobs to the cluster, dev nodes for interactive work
- Use local scratch for staging data and computations
- If using conda environments in Anaconda Python, this is best done inside a Singularity container
- If writing many files to the file system, for example 1000's or more, avoid writing all the files to a single directory.
- instead, spread out the files into a number of different directories for better performance
- For interactively using GUI applications, using X2Go will have better performance than X-forwarding
- Check the job scheduler logs
-
error log: unless otherwise specified, this will be in the directory that the job was launched from and the file name will be formatted as the job script name followed by
.e<jobid>
-
output log: unless otherwise specified, this will be in the directory that the job was launched and the file name will be formatted as the job script name followed by
.o<jobid>
.
-
error log: unless otherwise specified, this will be in the directory that the job was launched from and the file name will be formatted as the job script name followed by