A DEA Guide for using Gadi - GeoscienceAustralia/dea-notebooks GitHub Wiki

What’s a Gadi?

Gadi is the word for 'to search for' in the language of the Ngunnawal people, the traditional custodians of the land where the National Computational Infrastructure (NCI) is located in Canberra. It is also the name for the new(ish) supercomputer at the NCI.

How do I get onto gadi?

There are resources available which discuss how to get onto gadi, see this page.

First, you need to have an account with the National Computational Infrastructure (NCI), and the instructions to do that can be found here.

Once you have an account you can request access to the projects that you will need to be a part of. This is so that you can access storage space, data, and the time to run jobs.

The above link also has information about the Virtual Desktop Interface (VDI). This is one of the best ways to access the NCI, especially if your machine does not have native terminal and ssh abilities. This page has information about how to set up the VDI. There was a new version of the VDI released by NCI on 28/10/2020, there is an FAQ page if you are having issues now that this has been updated.

Once you are onto the VDI you can open a terminal and ssh into Gadi with ssh gadi and then entering your NCI password. This is often necessary if you want to launch jobs. For more information, see this page.

If you’re going to be using the VDI a lot, something that might be helpful for you is to get the system monitor blocks to appear at the top of your screen, these tell you the memory usage and other handy features - go to this page and install the Firefox extension it asks for, refresh the page, toggle the On/Off button to get the blocks to show up. If you were using the old VDI and had some desktop icons that you would like to have again, there are a couple of things you can do:

  • Move the files from your ~/old_vdi_home/Desktop folder to ~/.local/share/applications
  • At this point they should show up in the 'Applications' menu, probably in the 'Other' submenu.
  • You can restart the gnome-shell by pressing Alt+F2 and then typing in either "restart"

Okay I can log in, now what?

Once you can log in to gadi either through the VDI or through a command line, there are some set up things that will be helpful. (If you aren’t sure about using the command line, see below – Help! The command line is confusing!)

The installation and software setup page and the command line usage page have some good advice on getting set up with DEA and some settings you can do to get the required modules to load every time.

If there are other modules you would like to load, you can check for them with module avail and see what you already have loaded with module list and if there is something you would like to load you can do module load <name of module> and to unload module unload <name of module>. If you have some knowledge of how to add these to your bash or login then you can also do that so that they are loaded automatically. See the NCI page for more information on modules.

Something a few people have noticed recently, especially if they already had an NCI account and then start using DEA, or if they used Raijin (Gadi’s predecessor) but haven’t used Gadi yet (or after the VDI update), is that you might have some trouble because you don’t have a .pgpass file in your home directory. To check for this file use ls .pgpass or ls -a (to see the hidden files – if they have a dot in front of them then they are hidden, so .pgpass won’t show up if you just type ls). If this is happening to you, there are a few things to try:

  • See if there is one in your Raijin home directory, these were migrated to Gadi.
  • See if there is one in your VDI home directory.
  • See if there is one in your VDI old_vdi_home directory.
  • See if there is one in your Gadi home directory.

Hopefully you will have one in one of these places. If you do, then copy it to the home directory where you don’t have one. Use scp <path/to/file> <path/to/destination>. The .pgpass file should be of the format described on this page.

To run/submit a job script to Gadi you need to either ssh into it, or from the VDI, you can ssh in or run remote-hpc-cmd init, as described here.

Help! The command line is confusing!

If you’re feeling a bit rusty or lost with your command line skills, this website goes through some frequently used commands that might be handy.

How much compute am I using?

The supercomputer at the NCI is called Gadi and it has A LOT of compute cores and nodes. Each node has 48 CPUs and 4GB of memory per CPU.

We will talk more about running things on Gadi later – see “I want to run a job”, but here is some handy information on the computational costs associated with running on Gadi:

When working out how much you think a job will cost the formula is ncpu*walltime*2, where walltime is how long the job will take to run, and the *2 is the “gadi charge”.

This means that a job that is going to run for 10 hours on 16 cpus has a cost of 320SU, and you would request 64GB of memory.

The costs of jobs are measured in service units or SU. In some cases the quota for everyone for the quarter might be 1MSU (1,000,000SU), and your job might take up 30kSU (30,000SU).

If your job terminates early, then the remaining quota (reserved up to the requested wall-time) is refunded (pro rata). Try to avoid greatly overestimating the wall-time, in case the code inadvertently hangs (wasting that full amount of time and quota). On the other hand, if you under-estimate the memory or wall-time requirements, then the job will get killed after it exceeds them (potentially losing unsaved progress and therefore also wasting quota).

If you request more memory than ncpus*4GB, which you might do for a big remote sensing job, it gets charged as 2 * (memory/4GB) * walltime instead of by the number of CPUs.

I want to run a job!

To run a job, you need to submit a job script. So before we talk about submitting jobs, we need to discuss what should go into our job script.

There are also a couple of options, you can include everything in the job script or you can allow for some input from the command line.

There are also a number of ways to do this and there are a few different queues available on Gadi such as normal, express and copyq. These have differing charges, for example express is x3 more expensive than the normal queue, and most of the time is probably not necessary (other than for quick debugging).

On Gadi, your job scripts need a few additional things, most notable is the -l storage=<project/path+project/path>. Otherwise you will run into issues with modules and scripts being unable to be found once the job reaches the node.

So the preamble to your job script will look something like:

#!/bin/bash 
#PBS -P <project> 
#PBS -l walltime=00:30:00 
#PBS -l ncpus=16 
#PBS -l mem=64GB 
#PBS -l storage =gdata/<project>+gdata/rs0+gdata/v10 
#PBS -l software=python 
#PBS -l wd 

And after this you put in the command to run: ie python my_script.py. This has a walltime of 30 minutes.

You can also add these at the command line when you are submitting the job, but then you wouldn’t have them in the preamble.

If you have never made a job script or never run one before, ask someone if you can see one of theirs to use as a template or ask them to check over yours to see if it will work.

So how do I submit a job?

To submit a job you use the PBS command qsub. (PBS is what is used on many supercomputers to organise the scheduling of jobs.)

The qsub command stands for submit to queue. You can expect a typical command to look something like qsub job.pbs, where job.pbs is the name of your job script. Or if you are adding things from the command line qsub -P <project> -l storage=gdata/<project>+gdata/rs0+gdata/v10.

PBS has a lot of commands that are associated with using the queues on gadi (and other supercomputers). This page and this page have some useful commands listed, note that not all of these commands might be in usage on Gadi.

How do I know if my job is running?

To check the jobs that are currently running for yourself use qstat, which checks the status in the queue. Use this as qstat -u $USER or qstat -u <username>. The jobs you have running or in the queue will be listed. If none are listed, then you have no jobs running. If the job has a Q next to it then it is in the queue, if it has a R it is running and if it has a H it is on hold. Hold happens when the project has run out of space or time, or if there is maintenance happening on the supercomputer.

You can also use nqstat_anu, which will give you more information about jobs on a variety of queues.

OMG I didn’t want to run that job!!!!!!

Do not fear, there are options if you submitted the wrong job or you think it’s broken or something has gone wrong and you accidentally submitted too many jobs.

If it is just a couple of jobs, you can grab the job id when you do a qstat (above) and then use qdel to delete it from the queue - qdel <job id>.

If there are a bunch of jobs, then you will be better off using something that selects them and then deletes them from the queue qselect -u <username> | xargs qdel

How do I know how much space/time a project has available?

A handy way of checking this is to use the nci_account command. To see how much time a project has available: nci_account -P <project>, then to see how much each user has used: nci_account -P <project> -v.

To check on space, you can use lqouta and to check your usage of space df or du -h in the directory you are checking the space of.

Navigating on gadi is hard :(

If you aren’t used to doing a lot of navigating from the command line, or are getting tired of remembering a bunch of long paths to the directories that you frequently visit, you can set up some shortcuts or aliases to make navigating easier.

Create a symbolic link (a shortcut) in your home directory to your project directory. For example having a directory that links to /g/data/<project>/users/<user>/path/to/dir in your home directory on the VDI would be handy. There is some info on how to do this here but running ln -s <sourcedir> <linkdir> should work.

Remember that your VDI home directory is separate from your Gadi home directory. (Only the latter, and project directories, are accessible by FTP.)

If you store too much in your home directory, you may be blocked from logging back in.

If you ever need to change file/directory permissions, do so from Gadi rather than VDI. (VDI has only partial insight into the access control lists that govern filesystems at the NCI, and so can be misleading.)

I use the sandbox, but want to put something on the VDI (or vice versa)

You can open the sandbox from the VDI, by opening Firefox and navigating to the sandbox page and logging in. You can find information on using the Sandbox here.

If you want to download/upload data from/to the sandbox to the VDI/gadi then you can do so using the download/upload buttons in the sandbox. It might be a good idea to have a symbolic/soft link for the directory you want to get things to/from on the VDI to your home directory on the VDI. See above for symbolic link instructions.

What about something from my computer

This really depends on what it is and how familiar you are with the command line and your storage arrangements on gadi. You should be able to scp something from your computer to somewhere on Gadi, but you need to know where you’re planning to put it. You could expect this command to look something like scp <file> <username>@gadi-dm.nci.org.au</g/data/path/to/directory>.

You can also use FileZilla for this (available via Software Centre on your PC if you work for GA or via the internet if not). It allows you to sftp things up and down to VDI and Gadi, and you can use the data-mover queue to get big stuff up and down on Gadi faster. Handy addresses are sftp://<you>@vdi-sftp.nci.org.au, sftp://<you>@gadi-dm.nci.org.au, and sftp://<you>@gadi.nci.org.au. Filezilla gives you a visual interface with folder structures so is handy for getting stuff up and down. This is also super useful if you can’t get into your VDI because you put too much stuff in your home directory and got locked out and need to clean it up before you can log in. Use port 22.

Do you have any tips of helpful things?

Yes, yes we do:

ASK FOR HELP! If you’re confused or unsure or need some help, ask in one of the slack channels. We are friendly folk who are happy to help, and if you ask in DEA beginners, that’s an excellent place to start.

See this gist for some handy hints from Robbi.

Test things thoroughly before trying to run at scale, make sure you know the expected behaviour and test a range of situations.

You may like to configure an ssh key or similar for git and GitHub, so that you can push commits without needing to transmit your account password.

⚠️ **GitHub.com Fallback** ⚠️