Wharton High Performance Computing Cluster - bblockwood/lab GitHub Wiki

Applying for an Account

To apply for an account, follow the link here. All you need is a PennKey account and a Wharton domain account (i.e., your email ends in wharton.upenn.edu). You can contact [email protected] with any questions. For more information on how to apply for an account, see the "Getting an Account" page.

Note that an RA's account will be linked to a faculty member, so resource limits apply to a faculty member and all of their RA's combined. The current limits (e.g., 64 slots per user with 1 core per slot) are outlined on this page.

Connecting to the HPCC (also referred to as HPC3)

Terminal

To access the cluster, whether on Penn's campus or at home, you will first need a VPN to establish a remote connection. Wharton uses the Forticlient VPN software, which can be downloaded here. Instructions on how to download, install and configure, and use the VPN depend on your computer's operating system and can be found here.

Once you have established the necessary network connection, how you connect to the cluster will depend on your computer's operating system.

Windows

For Windows users, the preferred client is MobaXterm (Home Edition). Instructions on how to download MobaXterm can be found here. A short instructional video on MobaXterm installation can also be viewed here.

Once you have MobaXterm installed, open it and click the "Start local terminal" button.

Mac

For Mac users, the preferred client is the Terminal, which should not require any additional installation (you can do a Spotlight Search for "Terminal" to locate it).

Both

Then, enter:

ssh [email protected]

(where username is your username) and hit enter. Finally, enter your PennKey password to connect. If you are on Windows and using MobaXterm, your password will not display as you type it.

More information on accessing the HPCC for both Windows and Mac users can be found here. Note: these instructions may differ slightly for those not based in the Business Economics and Public Policy department.

Using a graphical interface

For most work, it is most expedient to use the terminal. However, a graphical interface may sometimes be useful for running certain software. For example, setting up Rclone to sync Dropbox folders requires this interface. See this page on Rclone for further details.

Setting up a Project

Windows

On MobaXterm, you can view your directories and files on the cluster in the Sftp tab in the sidebar (pictured below).

Images/MobaXterm.PNG

At the top of the tab are several buttons that will allow you to, e.g., upload or download files from your computer onto the cluster (the up and down arrows, respectively), create new directories (the folder with a plus in the bottom right corner), or delete files (the circle with an X).

Mac

To move files to and from the hpcc home folder, you should use a FTP/SFTP transfer utility. Cyberduck is a good one that is open source and free.

Submitting Jobs

To submit a job, you will need two main ingredients:

A job script: This script is a few lines of code that contain instructions about what you want to be done on the cluster. To create a job script, you can open a code/text editor (like Atom, BBEdit) or VS Code, write your code, and then save it as a .sh file (to do this in Atom or VS Code, just append .sh to the end of your file name when saving). Below we have an example of a script that runs the MATLAB file run_simulations.m (note that we don't write .m in the script), with a few additional options specified. Specifically:
- -N simulations_test names the job simulations_test
- -j y joins the output and error files (when the job is complete, a file is outputted to your working directory that shows what would have appeared in the MATLAB command window; the file will have a name like simulations_test.o#######).
- -P ProfessorTeamName uses the resources of a specific faculty member's allotted RAM for this job. This is required if you are submitting a job for a faculty member who is not the same as the faculty member to which your account is linked (i.e. the faculty member you registered your account with) in order to use the proper resources to run your job. Ask the faculty member what their HPCC team name is to use in place of ProfessorTeamName in your bash file. If you are using an interactive session for a different faculty member, you must put -P ProfessorTeamName after the qlogin command to use their resources.
When saving your file, make sure it has Unix (not Windows) line endings. In Atom or VS Code, check to make sure LF rather than CRLF is displaying at the bottom of the window (you can see LF displaying in the image below). If CRLF is displaying, simply click it and choose the LF option.

A list of additional options can be found under the "More qsub Options" heading on this page.

Images/VSCode_example.png
The code that you want to run: This code needs to be accessible on the cluster. In MobaXterm, this means uploading the file you need to the correct location. In the example above, run_simulations is in the working directory, so we don't need to specify a path to it. To change your working directory, use the cd path command where path is the path of the directory you want relative to your current working directory.

Note: If you are outputting content (e.g., a plot) and using MobaXterm, make sure that your code is designed to output it to a location relative to the directories you have set up on the cluster.

To learn how to submit a job, view the short instructional video here.

The primary command you will be using is of the form: qsub options job_script.sh, where options is any additional options you would like to specify that you didn't already write in your job script and job_script is the name/location of your job script (in this example, it is in our current working directory). This command submits the job, as specified in job_script.sh.

One useful option to specify is -m e -M [email protected]. This option sends an email to [email protected] once the job is complete and contains information like the runtime.

Other helpful commands include:

qstat: This provides information about your submitted jobs, such as when they were submitted and if they're running or in the queue.
cat filename: This displays the contents of filename, which is helpful for reviewing the contents of output files, for example for errors.
qdel jobID: This deletes the job with ID or name jobID. You can view the job IDs or names of your jobs using qstat.

Parallelization

Parallelization in MATLAB

With access to the Parallel Computing Toolbox in MATLAB (you will likely have this downloaded by default), you can run operations in parallel using different "workers" in what is called a parallel pool. Each core on your system gives you access to one worker in MATLAB, which can run operations in parallel with other workers in the same pool. Your computer likely has only a few cores, which limits the number of operations you can run in parallel. With the HPCC, you have access to up to 64 cores at a time (and potentially more if you contact [email protected] to discuss your needs). You can also increase the amount of RAM you have per worker on the HPCC (see the instructions under "Using More RAM" here; the default limit is 1024GB of total RAM).

However, using more workers comes at the cost of additional overhead, which means that maximizing the number of workers does not always guarantee minimized runtime and runtime does not decrease linearly in the number of workers. It is recommended that you start with a lower number of workers (e.g., 15 or 20) and see how changing the number affects runtime. It may also be more beneficial to run several different jobs at the same time (i.e., submit different job scripts) or run an array job (see this video for a discussion of array vs. parallel jobs) than to parallelize and increase your number of workers.

To start up a parallel pool and choose your number of workers, you will need to add some code like in the example below to the beginning of your MATLAB code (note that the number of workers does not need to be specified in your job script), where we've started a parallel pool with 20 workers:

% Start parallel pool
mypool = parpool(20);

At the end of your code, you can close the pool, like in the example below:

% Close parallel pool
delete(mypool);

In order to take advantage of your additional workers, you will need to parallelize some components of your code. One common way to parallelize your code is to replace a for loop with a parfor loop in your MATLAB code (see here for the parfor documentation). In a parfor loop, each loop runs in parallel rather than in sequence, which means that every for loop cannot be converted into a parfor loop. For example, a for loop that takes the output of the previous loop as an input in the next loop would not work as a parfor loop.

See here for more discussion of MATLAB on the HPCC, and see here for more general information about parallel computing in MATLAB.

Parallelization in STATA

To run a STATA script as an array job, you just need to specify the number of tasks you would like to run in parallel along with the amount of memory used for each task at the top of the bash file. You can do this in the following way:

#!/bin/bash
#$ -t 1-100
#$ -l m_mem_free=5G

stata-se test_parallel_script.do $SGE_TASK_ID

-t 1-100 sets the number of tasks to run. This will run 100 scripts with task IDs of 1 to 100.
-l m_mem_free=5G sets the amount of memory per task. This will allocate 5GB of RAM per task.
$SGE_TASK_ID gives the task ID to STATA if you would like to use it as an argument. You can get the task ID from the bash script and store it as a local in STATA in the following way:

* Get the task ID from the bash script
args task_id
* Store it as a local
local array_day = `task_id'

Many of the same principles described above for MATLAB apply to Stata as well. In particular, when working with array jobs in Stata on the HPCC, be mindful that increasing the number of tasks may cost additional overhead and does not always lead to proportionally reduced runtime. It is often useful to experiment with different array sizes or consider running multiple separate jobs in parallel. Also note that each project team is limited to 1024GB of total RAM across all jobs and individual compute hosts are capped at 240GB, so, for example, submitting a 100-task array job requesting 50GB per task will result in only 20 tasks running concurrently, with the rest queued until resources free up. Refer to the same recommendations above when evaluating how best to distribute your Stata workloads across resources.

Check out the HPCC Training page for additional useful tips and tricks.

Cloning a GitHub Repository

To connect to GitHub on the cluster, you will first need to generate an SSH key for authentication purposes (see here for more info about connecting to GitHub with SSH). Use the following command to generate a key:

ssh-keygen -t rsa

You can choose a passphrase that you will be required to enter when connecting to GitHub for additional security. The key will be saved at user/.ssh/id_rsa.pub in MobaXterm. (If you encounter issues with these commands in MobaXterm, go to Setting -> Configuration -> Terminal and make sure that "Use Windows PATH" is checked under "Local shell settings.")

Next, you will need to add your SSH key to your GitHub account. You can do so at https://github.com/settings/keys by clicking the "New SSH key" button and entering the private key saved in the id_rsa.pub file from the previous step (the entire contents of the file constitute the key).

Now, to clone a repository, you can use a command of the following form:

git clone [email protected]:username/reponame

where username and reponame are used to identify the repository as on GitHub; if you chose a passphrase, you will need to enter it again. The repository will be cloned to your working directory. Once the repository is cloned, you can use all your usual Git commands to modify and get information about the repository (e.g., git commit, git push, git log). You can search this site to learn more about various commands and how to use Git.