slurmrestd - hokiegeek2/slurm-cloud-integration GitHub Wiki

Installation

Installing slurmrestd-enabled slurmctld

The Slurm build that enables slurmrestd is detailed starting here

Create slurm User

Creating a slurm user involves associating the user with an account, which is accomplished via the following commands:

# Execute as the slurm user
sacctmgr create user name=<username> account=<account name> admin=<admin level>

An example is as follows:

sacctmgr create user name=kjyost account=root admin=Admin

Create slurmrestd User

As of slurm 21.08.5, slurmrestd must be run as a user other than slurm or root. Accordingly, create the slurmrestd user

useradd -m slurmrestd

Generate JWT Key and Set Permissions

Since slurmrestd uses JSON Web Token (JWT) for authentication, a JWT key must be generated and (1) configured in slurmctld to enable generation of JWT tokens for slurm users and (2) configured in slurmctld and slurmdbd to validate user tokens submitted with each REST request. The slurmrestd key is generated and loaded on the slurmctld host:

dd if=/dev/random of=/etc/slurm/jwt_hs256.key bs=32 count=1
chown slurm:slurm /etc/slurm/jwt_hs256.key
chmod 600 /etc/slurm/jwt_hs256.key

Configure slurmctld and slurmdbd to Start with JWT Authentication Enabled

The following configuration entries are required in the slurm.conf and slurmdbd.conf files to enable HS256 JWT authentication in slurmctld and slurmdbd, respectively:

AuthAltTypes=auth/jwt
AuthAltParameters=jwt_key=/etc/slurm/jwt_hs256.key

To enable RS256 JWT authentication, the following entries are required in the slurm.conf and slurmdbd.conf files, respectively:

AuthAltTypes=auth/jwt
AuthAltParameters=jwks=/etc/slurm/jwks.json

Starting slurmrestd

As stated above, as of slurm 21.08.5 slurmrestd must be executed by a user other than slurm or root. Either start slurmrestd as a service (service file example here, or via the following command sequence from the slurm-cloud-integration README as the slurmrestd user:

export SLURMRESTD_HOST=0.0.0.0
export SLURMRESTD_PORT=6820
export SLURM_JWT=daemon

slurmrestd -vvvv -a rest_auth/jwt $SLURMRESTD_HOST:$SLURMRESTD_PORT

New for Slurm 22.05.x

Updated configure Command Execution

./configure --prefix=/storage-restd/slurm-build --sysconfdir=/etc/slurm --enable-pam  --with-pam_dir=/lib/x86_64-linux-gnu/security/ --with-http-parser=/usr/ --with-yaml=/usr/ --with-jwt=/usr/ 

Usage Requirements

User

The user attempting to access the slurmrestd API must have a valid JWT token and must be a slurm user.

General Structure of REST Call

export SLURMRESTD_HOST=0.0.0.0
export SLURMRESTD_PORT=6820
 
export SLURM_JWT={JWT token provided to user}
 
username$ curl -H "X-SLURM-USER-NAME:username" -H "X-SLURM-USER-TOKEN:${SLURM_JWT}" http://${SLURMRESTD_HOST}:${SLURMRESTD_PORT}/<api call>

Example Commands

Determining slurmrestd Version

curl -H "X-SLURM-USER-NAME:slurm" -H "X-SLURM-USER-TOKEN:${SLURM_JWT}" http://${SLURMRESTD_HOST}:${SLURMRESTD_PORT}/openapi/v3

Get Slurm Nodes

slurm@einhorn:/home/kjyost$ curl -H "X-SLURM-USER-NAME:slurm" -H "X-SLURM-USER-TOKEN:${SLURM_JWT}" http://einhorn:6820/slurm/v0.0.37/nodes
{
   "meta": {
     "plugin": {
       "type": "openapi\/v0.0.37",
       "name": "Slurm OpenAPI v0.0.37"
     },
     "Slurm": {
       "version": {
         "major": 21,
         "micro": 5,
         "minor": 8
       },
       "release": "21.08.5"
     }
   },
   "errors": [
   ],
   "nodes": [
     {
       "architecture": "x86_64",
       "burstbuffer_network_address": "",
       "boards": 1,
       "boot_time": 1642155129,
       "comment": "",
       "cores": 4,
       "cpu_binding": 0,
       "cpu_load": 345,
       "extra": "",
       "free_memory": 13243,
       "cpus": 8,
       "last_busy": 1642417782,
       "features": "",
       "active_features": "",
       "gres": "",
       "gres_drained": "N\/A",
       "gres_used": "",
       "mcs_label": "",
       "name": "ace",
       "next_state_after_reboot": "invalid",
       "address": "ace",
       "hostname": "ace",
       "state": "idle",
       "state_flags": [
         "NOT_RESPONDING"
       ],
       "next_state_after_reboot_flags": [
       ],
       "operating_system": "Linux 4.15.0-166-generic #174-Ubuntu SMP Wed Dec 8 19:07:44 UTC 2021",
       "owner": null,
       "partitions": [
         "debug"
       ],
       "port": 6818,
       "real_memory": 8192,
       "reason": "",
       "reason_changed_at": 0,
       "reason_set_by_user": "root",
       "slurmd_start_time": 1642417782,
       "sockets": 1,
       "threads": 2,
       "temporary_disk": 0,
       "weight": 1,
       "tres": "cpu=8,mem=8G,billing=8",
       "slurmd_version": "21.08.5",
       "alloc_memory": 0,
       "alloc_cpus": 0,
       "idle_cpus": 8,
       "tres_used": null,
       "tres_weighted": 0
     },
     {
       "architecture": "x86_64",
       "burstbuffer_network_address": "",
       "boards": 1,
       "boot_time": 1642155158,
       "comment": "",
       "cores": 2,
       "cpu_binding": 0,
       "cpu_load": 125,
       "extra": "",
       "free_memory": 3861,
       "cpus": 4,
       "last_busy": 1642419861,
       "features": "",
       "active_features": "",
       "gres": "",
       "gres_drained": "N\/A",
       "gres_used": "",
       "mcs_label": "",
       "name": "einhorn",
       "next_state_after_reboot": "invalid",
       "address": "einhorn",
       "hostname": "einhorn",
       "state": "idle",
       "state_flags": [
       ],
       "next_state_after_reboot_flags": [
       ],
       "operating_system": "Linux 4.15.0-166-generic #174-Ubuntu SMP Wed Dec 8 19:07:44 UTC 2021",
       "owner": null,
       "partitions": [
         "debug"
       ],
       "port": 6818,
       "real_memory": 8192,
       "reason": "",
       "reason_changed_at": 0,
       "reason_set_by_user": null,
       "slurmd_start_time": 1642168816,
       "sockets": 1,
       "threads": 2,
       "temporary_disk": 0,
       "weight": 1,
       "tres": "cpu=4,mem=8G,billing=4",
       "slurmd_version": "21.08.5",
       "alloc_memory": 0,
       "alloc_cpus": 0,
       "idle_cpus": 4,
       "tres_used": null,
       "tres_weighted": 0
     },
     {
       "architecture": "x86_64",
       "burstbuffer_network_address": "",
       "boards": 1,
       "boot_time": 1642155165,
       "comment": "",
       "cores": 2,
       "cpu_binding": 0,
       "cpu_load": 165,
       "extra": "",
       "free_memory": 2562,
       "cpus": 4,
       "last_busy": 1642417600,
       "features": "",
       "active_features": "",
       "gres": "",
       "gres_drained": "N\/A",
       "gres_used": "",
       "mcs_label": "",
       "name": "finkel",
       "next_state_after_reboot": "invalid",
       "address": "finkel",
       "hostname": "finkel",
       "state": "idle",
       "state_flags": [
       ],
       "next_state_after_reboot_flags": [
       ],
       "operating_system": "Linux 4.15.0-166-generic #174-Ubuntu SMP Wed Dec 8 19:07:44 UTC 2021",
       "owner": null,
       "partitions": [
         "debug"
       ],
       "port": 6818,
       "real_memory": 8192,
       "reason": "",
       "reason_changed_at": 0,
       "reason_set_by_user": null,
       "slurmd_start_time": 1642338798,
       "sockets": 1,
       "threads": 2,
       "temporary_disk": 0,
       "weight": 1,
       "tres": "cpu=4,mem=8G,billing=4",
       "slurmd_version": "21.08.5",
       "alloc_memory": 0,
       "alloc_cpus": 0,
       "idle_cpus": 4,
       "tres_used": null,
       "tres_weighted": 0
     }
   ]

submit: submits a job via a JSON payload:

curl -H "X-SLURM-USER-NAME:${SLURMRESTD_USER}" -H "X-SLURM-USER-TOKEN:${SLURM_JWT}" \
-H "Content-Type: application/json" --request POST --data @simple-slurm-json.slurm \
http://localhost:6820/slurm/v0.0.37/job/submit

The JSON payload is as follows:

{
    "job": {
    "name": "slurm-rest-test",
    "ntasks":3,
    "nodes": 3,
    "memory_per_cpu": 500,
    "current_working_directory": "/tmp",
    "standard_input": "/dev/null",
    "standard_output": "/tmp/slurm-rest-test.out",
    "standard_error": "/tmp/slurm-rest-test_error.out",
    "environment": {
        "PATH": "/bin:/usr/bin/:/usr/local/bin/",
        "LD_LIBRARY_PATH": "/lib/:/lib64/:/usr/local/lib"}
    },
    "script": "#!/bin/bash\necho 'I am from the REST API'"
}

Slurm Python Interface

Background

Since Slurm uses JSON Web Token (JWT) authentication, there are two elements to a Python interface to Slurm (1) a slurm REST client and (2) JWT Python library

JWT Python Libraries

There are three Python libraries used to generate and retrieve RS256 JWT authentication artifacts used to provide access to slurmrestd:

  1. jwcrypto--contains several JWT-related modules including for JSON Web Key (JWK), which is used to sign JWT tokens
  2. python-jwt--encapsulates JWT RS256 token generation logic.
  3. jwt--contains JWT HS256 token generation utilities

RS256 JWT Python Convenience Methods

Below are several convenience methods that leverage the jwcrypto, python-jwt, and jwt libraries to generate and manage all RS256 authentication artifacts.

Retrieving Key ID (kid)

The Key ID (kid) is retrieved by the slurm user from the jwks.json file as follows:

def getKeyId(jwksFilePath : str) -> str:
    import json
    with open(jwksFilePath, 'rb') as fh:
        jwksString = fh.read()
    jwks = json.loads(jwksString)
    return jwks['keys'][0]['kid']

Retrieving RS256 JSON Web Key (JWK)

The JSON Web Key (JWK) is retrieved by the slurm user from the private.pem file generated above:

from jwcrypto.jwk import JWK
 
def getJwk(pemFilePath : str) -> JWK:
        with open(pemFilePath, 'rb') as fh:
            pem = fh.read()
        return JWK.from_pem(pem)

Generating RS256-Signed User Tokens

The following code executed as the slurm user generates user tokens for the supplied username, session length in minutes, and signing key

from jwcrypto.jwk import JWK
 
def getUserToken(userName : str, sessionInMinutes : int, signingKey : JWK, kid : str) -> str:
    import datetime
    import python_jwt as jwt
    expiration = int((datetime.datetime.now() + \
                              datetime.timedelta(minutes=sessionInMinutes)).timestamp() * 1000)
    try:
        return jwt.generate_jwt(claims={'sun': userName,'algorithm':'RS256','exp':expiration},
               priv_key=signingKey, algorithm='RS256',other_headers={'kid':kid})
    except Exception as e:
        raise ValueError('in generating user token {}'.format(e))

Python slurm-rest Client

SchedMD has an official slurmrestd client which became available on pypi in November, 2021

slurm-rest Install

pip install slurm-rest

Instantiating slurm-rest Client

The slurm_rest client is composed of the following object graph: Configuration, ApiClient, and SlurmApi. The sequence is as follows:

Instantiating a slurm_rest Client
The slurm_rest client is composed of the following object graph: Configuration, ApiClient, and SlurmApi and the client is of type SlurmApi. An example convenience method is shown below. Importantly, the RS256 JWT token string passed into the getSlurmClient is generated with the getUserToken method detailed above.

Representative slurm-rest Requests

ping:

>>> api_instance.slurmctld_ping()
{'errors': [],
 'meta': {'Slurm': {'release': '21.08.5',
                    'version': {'major': 21, 'micro': 5, 'minor': 8}},
          'plugin': {'name': 'Slurm OpenAPI v0.0.37',
                     'type': 'openapi/v0.0.37'}},
 'pings': [{'hostname': 'localhost',
            'mode': 'primary',
            'ping': 'UP',
            'status': 0}]}

diagnostics:

>>> api_instance.slurmctld_diag()
{'errors': [],
 'meta': {'Slurm': {'release': '21.08.5',
                    'version': {'major': 21, 'micro': 5, 'minor': 8}},
          'plugin': {'name': 'Slurm OpenAPI v0.0.37',
                     'type': 'openapi/v0.0.37'}},
 'statistics': {'agent_count': 0,
                'agent_queue_size': 0,
                'agent_thread_count': 0,
                'bf_active': False,
                'bf_backfilled_het_jobs': 0,
                'bf_backfilled_jobs': 1,
                'bf_cycle_counter': 1,
                'bf_cycle_last': 7338,
                'bf_cycle_max': 7338,
                'bf_cycle_mean': 7338,
                'bf_depth_mean': 1,
                'bf_depth_mean_try': 1,
                'bf_last_backfilled_jobs': 1,
                'bf_queue_len': 1,
                'bf_queue_len_mean': 1,
                'bf_when_last_cycle': 1643809050,
                'dbd_agent_queue_size': 0,
                'gettimeofday_latency': 27,
                'job_states_ts': 1643820717,
                'jobs_canceled': 0,
                'jobs_completed': 1,
                'jobs_failed': 0,
                'jobs_pending': 0,
                'jobs_running': 0,
                'jobs_started': 1,
                'jobs_submitted': 1,
                'parts_packed': 1,
                'req_time': 1643820721,
                'req_time_start': 1643797885,
                'schedule_cycle_last': 23,
                'schedule_cycle_max': 808,
                'schedule_cycle_mean': 46,
                'schedule_cycle_mean_depth': 0,
                'schedule_cycle_per_minute': 1,
                'schedule_cycle_total': 383,
                'schedule_queue_length': 0,
                'server_thread_count': 3}}

Aubmitting a Job

Submitting jobs via slurm_rest.slurmctld_submit_job involves three two objects: JobProperties and JobSubmission:

  1. JobProperties encapsulates env variables such as number of nodes within a dict as well as key/value pairs that define job parameters such as memory per node, cpu's per task, etc...

  2. JobSubmission is composed of a string detailing the script to be executed along with the aforementioned JobProperties object. There are several important potential gotchas:

  3. The JobSubmission script element must start with #!/bin/bash\n or the script will fail since the slurm job is being started as a shell script

  4. Setting JobProperties current_working_directory a directory the submitting user has read-execute-write privileges is a good idea as hard-to-trace slurm errors occur otherwise

The example job submission below specifies the following:

  1. execute the job across three nodes with one task per node
  2. one CPU core and 500 MB of RAM for each task
  3. locations for stdout and stderror output
  4. script that prints out the whoami, id, and hostname commands out to /tmp/slurm_rest.out
props = JobProperties({'nodes':3,'cpus_per_task':1}, current_working_directory="/tmp", standard_output="/tmp/slurm_rest.out", 
 standard_error="/tmp/slurm-rest-test_error.out",cpus_per_task=1,tasks_per_node=1, memory_per_node=500,name='slurm-rest-test')

sub = JobSubmission(script="#!/bin/bash\nwhoami\nid\nhostname", job=props)

result = client.slurmctld_submit_job(v0037_job_submission=sub)
result
{'errors': [],
 'job_id': 720,
 'job_submit_user_msg': '',
 'meta': {'Slurm': {'release': '21.08.5',
                    'version': {'major': 21, 'micro': 5, 'minor': 8}},
          'plugin': {'name': 'Slurm OpenAPI v0.0.37',
                     'type': 'openapi/v0.0.37'}},
 'step_id': 'BATCH'}

Cancelling a job

>>> client.slurmctld_get_jobs().to_dict()['jobs'][0]['job_id']
721

client.slurmctld_cancel_job(721)

Submitting Arkouda-on-Slurm job

arkouda_props = JobProperties(environment={"nodes":3,"CHPL_COMM_SUBSTRATE":"udp","GASNET_MASTERIP":"einhorn","SSH_SERVERS":"einhorn finkel shickadance", "GASNET_SPAWNFN":"S"}, current_working_directory='/opt/arkouda/', standard_output='/tmp/arkouda_rest.out', standard_error='/tmp/arkouda_rest.error', cpus_per_task=1, tasks_per_node=1, memory_per_node=4096, name='arkouda-rest-test', tasks=3, comment='{"service":{"name":"arkouda","namespace":"testing", "port": 5555, "target-port":5555},"endpoint": {"name":"arkouda","addresses":"ace","ports":5555}}')
sub = JobSubmission(script='#!/bin/bash\n/opt/arkouda/arkouda_server -nl 3', job=arkouda_props)
client.slurmctld_submit_job(v0037_job_submission=sub)
{'errors': [],
 'job_id': 757,
 'job_submit_user_msg': '',
 'meta': {'Slurm': {'release': '21.08.5',
                    'version': {'major': 21, 'micro': 5, 'minor': 8}},
          'plugin': {'name': 'Slurm OpenAPI v0.0.37',
                     'type': 'openapi/v0.0.37'}},
 'step_id': 'BATCH'}
⚠️ **GitHub.com Fallback** ⚠️