Monitoring Usage - uwsph/hpcusers GitHub Wiki

The monitoring system tracks disk, cpu, and memory (RAM) utilization. Disk space is tracked on a per-user and group basis, while CPU & RAM are tracked on a per job, user, and account basis.

Disk Space

Directory or File Size

Via CLI (SSH)

Login to the cluster via SSH, or through the Open OnDemand Shell Access feature.
Using the "du" utility, along with the file path you can get the size of a folder or file. However, to make the output much easier to understand we'll add the options "-hs", which gives a human readable summary. For example, to get the usage of the file "/home/users/jtyocum/TEST":

du -hs /home/users/jtyocum/TEST

This will give you an output that looks like:

19M     /home/users/jtyocum/TEST/

Alternatively, if you'd like to get a breakdown of the usage within the first-level of that directory, you could add an "*" to the path:

du -hs /home/users/jtyocum/TEST/*

Which would give an output that looks like:

4.0K    /home/users/jtyocum/TEST/2024_inventory.yml
19M     /home/users/jtyocum/TEST/ansible_playbooks

User Quota

TODO: Will depend on production storage system.

Group Quota

TODO: Will depend on production storage system.

CPU & RAM Utilization

Usage by Job

The "jobstats" group of tools, show usage both during and after a job has run. The system regularly records the jobs usage as it runs, allowing you view the usage over time.

Via CLI (SSH)

The "jobstats" command line tool will show you a text based overview of your job's CPU and RAM usage. It'll also perform basic analysis, comparing the usage with your requested resources. For example, lets look up the usage for job ID 208:

Connect to the cluster's login node via SSH or use the Shell Access feature in Open OnDemand.
From the prompt, we can run the "jobstats" utility:

/opt/jobstats/bin/jobstats 208

The utility will generate an output similar to the one below. At the top, is an overview of the requested resources for the job, runtime information, etc. Following that is overall summary of usage, along with a more detailed breakdown. Finally, at the end are some notes about the usage. For example, is the job only using a single CPU core, etc. If you want to know more about the usage, you can follow the link to OnDemand, where you can view the data in graph.

================================================================================
                              Slurm Job Statistics
================================================================================
         Job ID: 208
  NetID/Account: jtyocum/sph
       Job Name: sys/dashboard/sys/rstudio
          State: RUNNING
          Nodes: 1
      CPU Cores: 4
     CPU Memory: 4GB (1GB per CPU-core)
  QOS/Partition: normal/12c128g
        Cluster: sph-demo
     Start Time: Mon Oct 7, 2024 at 10:09 AM
       Run Time: 04:50:29 (in progress)
     Time Limit: 08:00:00

                              Overall Utilization
================================================================================
  CPU utilization  [                                                0%]
  CPU memory usage [                                                0%]

                              Detailed Utilization
================================================================================
  CPU utilization per node (CPU time used/run time)
      sph-n01: 00:00:56/19:21:59 (efficiency=0.1%)

  CPU memory usage per node - used/allocated
      sph-n01: 8.1MB/4.0GB (2.0MB/1.0GB per core of 4)

                                     Notes
================================================================================
  * The overall CPU utilization of this job is less than 1%. This value is low
    compared to the target range of 90% and above. Please investigate the
    reason for the low efficiency.

  * For additional job metrics including metrics plotted against time:
      https://ondemand.hpc.sph.washington.edu/pun/sys/jobstats  (VPN required)

  * Have a nice day!

Via OnDemand (Web Interface)

OnDemand has views for both current job information, and the ability to query for past jobs. In addition to job specific resource usage, you can also get a view of overall cluster and node utilization.

Currently Running Job

Login to Open OnDemand (requires Husky OnNet).
From the "Jobs" menu, select "Active Jobs"
If you have a large number of jobs, you can use the "Filter" field to narrow down the list.
Click on the arrow adjacent to your job, to expand the view. From here, you'll see some details about the job (requested resources, nodes, account, etc). Near the bottom, will be a series of graphs showing the CPU and Memory utilization over time. You can also select "Detailed Metrics" for additional information.

Already Completed Job

Login to Open OnDemand (requires Husky OnNet).
From the "Jobs" menu, select "Job Stats Helper"
In the text field, enter the Job ID.
Click on the link provided, which will take you a detailed view of the job's resource utilization.