How to run on the FAU HPC - amosproj/amos2025ss04-ai-driven-testing GitHub Wiki

Instructions to connect:

Since our team already has access, this section is not elaborated.

FAU Students can log in to the HCP-Portal with a connection via their "Benutzererkennung". There you can accept pending invitations and your account will be set up within the next day. Once your account has been set up, you can also view it here.

2. SSH setup:

Create an SSH Key via a Terminal/Powershell/ect ssh-keygen -t ed25519 -f ~/.ssh/id_ed25519_nhr_fau You need to enter a password twice. Then you will receive a message like:

Your identification has been saved in EXAMPLE_LOCATION_fau Your public key has been saved in EXAMPLE_LOCATION_fau.pub*** The key fingerprint is: *** The key's randomart image is: ***

Save the information.

Navigate to the EXAMPLE_LOCATION and open the .pub file. More Information

3. Secure Connection

  • Go back to the HCP-Portal.
  • Click on Add new SSH Key
  • Copy everything from the .pup file into Key Content.
  • Give your connection a name, such that you can access it.
  • Click on submit. The result should be return a Distribution Interval Time notice. Additionally it should display your Alias and a Fingerprint. More information

4. Configure the Template

The config is essential for a connection.

  • The NHR provides a template. This should be copied and saved as config. Do not save this as a file that only has the name config , not config. or anything like config.txt.
  • The connection to every computing cluster has a section User <HPC account>. Replace all occurrences of <HPC account> in the file with the HCP-Account-name. This can be found under HCP-Portal Your accounts $\longrightarrow$ Active accounts $\longrightarrow$ Account details $\longrightarrow$ HPC-Account:.
  • Place the file in the same location , where the private and public SSH Keys are saved.
  • If the name of your private SSH-Key is not id_ed25519_nhr_fau or if it is not in the ~/.ssh/ folder, make sure to set your private SSH-Key location instead of ~/.ssh/id_ed25519_nhr_fau at the IdentityFile occurrences.

SSH21

5. Connecting to csnhr.nhr.fau.de

The following steps work when you are in an official FAU network (FAU.fm or Eurodam), and when you are on a personal network. After creating the config file and the public and private SSH keys. This is a prerequisite setup step, to configure everything.

  • Paste this command in the terminal: ssh csnhr.nhr.fau.de
  • You will receive a message Enter passphrase for key 'EXAMPLE_LOCATION_fau':
  • Enter your password (the one for that specific SSH key).
  • Then you will revive a lot of System information and a Welcome at "csnhr.nhr.fau.de"..

6. Connect to a specific cluster.

(use ctrl + D to exit csnhr)

  • Connect to the correct cluster.

We want to use TinyGPU, since the others require more specific application forums and may shoot above our use-case.

  • Use the command ssh tinyx in the terminal. More commands to connect to other clusters can be fund at (Connecting to one of the cluster frontend nodes)[https://doc.nhr.fau.de/access/ssh-command-line/#testing-the-connection].

Now you should be having a USERNAME@tinyx as the beginning of your terminal.


How to execute programs from the terminal

1.1 Create the file you want to execute

touch myFile.py

1.2 Edit this file with

cat >> myFile.py

1.3 Enter all the content you want and press cntr + D to exit the editor mode.

print("Hello World!")

2.1 Check for available modlues

module avail

2.2 Altivate the currrent python version

module load python/3.12-conda

2.3 Check if it is actually loaded

python --version

3.1 Create a .sh file to execute the other file

touch submit.sh

3.2 Enter the necessary sbatch information and the commands to execute. Enter the exact python version you just checked for.

#!/bin/bash

#SBATCH --job-name=hello_world

#SBATCH --nodes=1

#SBATCH --ntasks-per-node=1

#SBATCH --cpus-per-task=1

#SBATCH --time=00:01:00

#SBATCH --output=%x_%j.out

#SBATCH --error=%x_%j.err

#SBATCH --gres=gpu:1

module load python/3.12.9

python myFile.py

3.3 Make the submission script executable in the terminal

chmod +x submit.sh

3.4 Submit the job

sbatch submit.sh

4.1 Check if it worked in the HCP-Portal you can also see that a job has been run and the details of the job.

4.2 When a job is done you can see the results when logged into the @tinyx cluster. You can see all files with

ll

4.3 You can choose to view a file called hello_worls_JOBNUMBER.err containing any error outputs or hello_worls_JOBNUMBER.out containing any normal outputs.

cat FILENAME

More info on monitoring your jobs can be found at the Job monitoring with ClusterCockpit

How the cluster works (very basic understanding)

"All clusters at NHR@FAU use the batch system Slurm for resource management and job scheduling. When logging into an HPC system, you are placed on a login node. From there, you can manage your data, set up your workflow, and prepare and submit jobs."

A batch-job consist of resource requirements, max job runtime, setup of environment, commands for applications to run. This is written into the job script.

sbatch [options] <job_script>

We will probably use tinyGPU, so we should use sbatch.tinygpu instead of sbatch.

After submission, sbatch will output a unique job ID

Definitions for better understanding:

NHR user = a scientist or researcher affiliated with a German university who utilizes the computing resources and services provided by the National High Performance Computing (NHR) network.

Batch System Slurm = for resource management and job scheduling, aka handling querying and priority of jobs.

The HCP Hardware:

Cluster name #nodes target applications Parallel filesystem Local harddisks description
Fritz (NHR+Tier3) 992 high-end massively parallel Yes No open for NHR and Tier3 after application
Alex (NHR+Tier3) 304 Nvidia A100 and 352 Nvidia A40 GPGPUs in 82 nodes Yes (but only via Ethernet) Yes (NVMe SSDs) open for NHR and Tier3 after application
Meggie (Tier3) 728 parallel no longer No RRZE' main working horse, intended for parallel jobs
Woody (Tier3) 288 serial throughput No Yes Cluster with fast (single- and dual-socket) CPUs for serial throughput workloads
TinyGPU (Tier3) 35 nodes 1638 GPUs GPU No Yes (SSDs) nodes equipped with NVIDIA GPUs (mostly with 4 GPUs per node)
TinyFat (Tier3) 47 large memory requirements No Yes (SSDs) for applications requiring large amounts of memory; each node has 256 or 512 gigabytes of main memory
  • This cluster consist of many different hosts that have Nvidia GPUs (Nvidia RTX 2080 Ti (11 GB), Nvidia Tesla V100 (32GB), Nvidia Geforce RTX3080 (10GB), Nvidia A100 (40GB)) and different CPUs (Intel Xeon Gold 6134, Intel Xeon Gold 6226R, AMD EPYC 7662)
  • When running a Job, you need to specify how many GPUs you want to run on
  • it is possible to run interactive jobs

Are we able to train an LLM?

  • As students we have the Tier3 Grundversorgung. There are restriction to what clusters we have access, but we have access to Tinyx and mainly need this for our application, since training a LLM is mostly done on GPUs.
  • We need to stick to the maximum number of usage for a job, which is 24h. An individual training-cycle can not exceed this. (This is not a problem, but just something to keep in mind)
  • The training we are proposing dose not seem to be prohibited by the FAU-NHR, since this is part of our student-work for AMOS.

Important Links:

AMOS HCP Info HCP Homepage

⚠️ **GitHub.com Fallback** ⚠️