slurmrestd - hokiegeek2/slurm-cloud-integration GitHub Wiki
The Slurm build that enables slurmrestd is detailed starting here
Creating a slurm user involves associating the user with an account, which is accomplished via the following commands:
# Execute as the slurm user
sacctmgr create user name=<username> account=<account name> admin=<admin level>
An example is as follows:
sacctmgr create user name=kjyost account=root admin=Admin
As of slurm 21.08.5, slurmrestd must be run as a user other than slurm or root. Accordingly, create the slurmrestd user
useradd -m slurmrestd
Since slurmrestd uses JSON Web Token (JWT) for authentication, a JWT key must be generated and (1) configured in slurmctld to enable generation of JWT tokens for slurm users and (2) configured in slurmctld and slurmdbd to validate user tokens submitted with each REST request. The slurmrestd key is generated and loaded on the slurmctld host:
dd if=/dev/random of=/etc/slurm/jwt_hs256.key bs=32 count=1
chown slurm:slurm /etc/slurm/jwt_hs256.key
chmod 600 /etc/slurm/jwt_hs256.key
The following configuration entries are required in the slurm.conf and slurmdbd.conf files to enable HS256 JWT authentication in slurmctld and slurmdbd, respectively:
AuthAltTypes=auth/jwt
AuthAltParameters=jwt_key=/etc/slurm/jwt_hs256.key
To enable RS256 JWT authentication, the following entries are required in the slurm.conf and slurmdbd.conf files, respectively:
AuthAltTypes=auth/jwt
AuthAltParameters=jwks=/etc/slurm/jwks.json
As stated above, as of slurm 21.08.5 slurmrestd must be executed by a user other than slurm or root. Either start slurmrestd as a service (service file example here, or via the following command sequence from the slurm-cloud-integration README as the slurmrestd user:
export SLURMRESTD_HOST=0.0.0.0
export SLURMRESTD_PORT=6820
export SLURM_JWT=daemon
slurmrestd -vvvv -a rest_auth/jwt $SLURMRESTD_HOST:$SLURMRESTD_PORT
./configure --prefix=/storage-restd/slurm-build --sysconfdir=/etc/slurm --enable-pam --with-pam_dir=/lib/x86_64-linux-gnu/security/ --with-http-parser=/usr/ --with-yaml=/usr/ --with-jwt=/usr/
The user attempting to access the slurmrestd API must have a valid JWT token and must be a slurm user.
export SLURMRESTD_HOST=0.0.0.0
export SLURMRESTD_PORT=6820
export SLURM_JWT={JWT token provided to user}
username$ curl -H "X-SLURM-USER-NAME:username" -H "X-SLURM-USER-TOKEN:${SLURM_JWT}" http://${SLURMRESTD_HOST}:${SLURMRESTD_PORT}/<api call>
curl -H "X-SLURM-USER-NAME:slurm" -H "X-SLURM-USER-TOKEN:${SLURM_JWT}" http://${SLURMRESTD_HOST}:${SLURMRESTD_PORT}/openapi/v3
slurm@einhorn:/home/kjyost$ curl -H "X-SLURM-USER-NAME:slurm" -H "X-SLURM-USER-TOKEN:${SLURM_JWT}" http://einhorn:6820/slurm/v0.0.37/nodes
{
"meta": {
"plugin": {
"type": "openapi\/v0.0.37",
"name": "Slurm OpenAPI v0.0.37"
},
"Slurm": {
"version": {
"major": 21,
"micro": 5,
"minor": 8
},
"release": "21.08.5"
}
},
"errors": [
],
"nodes": [
{
"architecture": "x86_64",
"burstbuffer_network_address": "",
"boards": 1,
"boot_time": 1642155129,
"comment": "",
"cores": 4,
"cpu_binding": 0,
"cpu_load": 345,
"extra": "",
"free_memory": 13243,
"cpus": 8,
"last_busy": 1642417782,
"features": "",
"active_features": "",
"gres": "",
"gres_drained": "N\/A",
"gres_used": "",
"mcs_label": "",
"name": "ace",
"next_state_after_reboot": "invalid",
"address": "ace",
"hostname": "ace",
"state": "idle",
"state_flags": [
"NOT_RESPONDING"
],
"next_state_after_reboot_flags": [
],
"operating_system": "Linux 4.15.0-166-generic #174-Ubuntu SMP Wed Dec 8 19:07:44 UTC 2021",
"owner": null,
"partitions": [
"debug"
],
"port": 6818,
"real_memory": 8192,
"reason": "",
"reason_changed_at": 0,
"reason_set_by_user": "root",
"slurmd_start_time": 1642417782,
"sockets": 1,
"threads": 2,
"temporary_disk": 0,
"weight": 1,
"tres": "cpu=8,mem=8G,billing=8",
"slurmd_version": "21.08.5",
"alloc_memory": 0,
"alloc_cpus": 0,
"idle_cpus": 8,
"tres_used": null,
"tres_weighted": 0
},
{
"architecture": "x86_64",
"burstbuffer_network_address": "",
"boards": 1,
"boot_time": 1642155158,
"comment": "",
"cores": 2,
"cpu_binding": 0,
"cpu_load": 125,
"extra": "",
"free_memory": 3861,
"cpus": 4,
"last_busy": 1642419861,
"features": "",
"active_features": "",
"gres": "",
"gres_drained": "N\/A",
"gres_used": "",
"mcs_label": "",
"name": "einhorn",
"next_state_after_reboot": "invalid",
"address": "einhorn",
"hostname": "einhorn",
"state": "idle",
"state_flags": [
],
"next_state_after_reboot_flags": [
],
"operating_system": "Linux 4.15.0-166-generic #174-Ubuntu SMP Wed Dec 8 19:07:44 UTC 2021",
"owner": null,
"partitions": [
"debug"
],
"port": 6818,
"real_memory": 8192,
"reason": "",
"reason_changed_at": 0,
"reason_set_by_user": null,
"slurmd_start_time": 1642168816,
"sockets": 1,
"threads": 2,
"temporary_disk": 0,
"weight": 1,
"tres": "cpu=4,mem=8G,billing=4",
"slurmd_version": "21.08.5",
"alloc_memory": 0,
"alloc_cpus": 0,
"idle_cpus": 4,
"tres_used": null,
"tres_weighted": 0
},
{
"architecture": "x86_64",
"burstbuffer_network_address": "",
"boards": 1,
"boot_time": 1642155165,
"comment": "",
"cores": 2,
"cpu_binding": 0,
"cpu_load": 165,
"extra": "",
"free_memory": 2562,
"cpus": 4,
"last_busy": 1642417600,
"features": "",
"active_features": "",
"gres": "",
"gres_drained": "N\/A",
"gres_used": "",
"mcs_label": "",
"name": "finkel",
"next_state_after_reboot": "invalid",
"address": "finkel",
"hostname": "finkel",
"state": "idle",
"state_flags": [
],
"next_state_after_reboot_flags": [
],
"operating_system": "Linux 4.15.0-166-generic #174-Ubuntu SMP Wed Dec 8 19:07:44 UTC 2021",
"owner": null,
"partitions": [
"debug"
],
"port": 6818,
"real_memory": 8192,
"reason": "",
"reason_changed_at": 0,
"reason_set_by_user": null,
"slurmd_start_time": 1642338798,
"sockets": 1,
"threads": 2,
"temporary_disk": 0,
"weight": 1,
"tres": "cpu=4,mem=8G,billing=4",
"slurmd_version": "21.08.5",
"alloc_memory": 0,
"alloc_cpus": 0,
"idle_cpus": 4,
"tres_used": null,
"tres_weighted": 0
}
]
submit: submits a job via a JSON payload:
curl -H "X-SLURM-USER-NAME:${SLURMRESTD_USER}" -H "X-SLURM-USER-TOKEN:${SLURM_JWT}" \
-H "Content-Type: application/json" --request POST --data @simple-slurm-json.slurm \
http://localhost:6820/slurm/v0.0.37/job/submit
The JSON payload is as follows:
{
"job": {
"name": "slurm-rest-test",
"ntasks":3,
"nodes": 3,
"memory_per_cpu": 500,
"current_working_directory": "/tmp",
"standard_input": "/dev/null",
"standard_output": "/tmp/slurm-rest-test.out",
"standard_error": "/tmp/slurm-rest-test_error.out",
"environment": {
"PATH": "/bin:/usr/bin/:/usr/local/bin/",
"LD_LIBRARY_PATH": "/lib/:/lib64/:/usr/local/lib"}
},
"script": "#!/bin/bash\necho 'I am from the REST API'"
}
Since Slurm uses JSON Web Token (JWT) authentication, there are two elements to a Python interface to Slurm (1) a slurm REST client and (2) JWT Python library
There are three Python libraries used to generate and retrieve RS256 JWT authentication artifacts used to provide access to slurmrestd:
- jwcrypto--contains several JWT-related modules including for JSON Web Key (JWK), which is used to sign JWT tokens
- python-jwt--encapsulates JWT RS256 token generation logic.
- jwt--contains JWT HS256 token generation utilities
Below are several convenience methods that leverage the jwcrypto, python-jwt, and jwt libraries to generate and manage all RS256 authentication artifacts.
The Key ID (kid) is retrieved by the slurm user from the jwks.json file as follows:
def getKeyId(jwksFilePath : str) -> str:
import json
with open(jwksFilePath, 'rb') as fh:
jwksString = fh.read()
jwks = json.loads(jwksString)
return jwks['keys'][0]['kid']
The JSON Web Key (JWK) is retrieved by the slurm user from the private.pem file generated above:
from jwcrypto.jwk import JWK
def getJwk(pemFilePath : str) -> JWK:
with open(pemFilePath, 'rb') as fh:
pem = fh.read()
return JWK.from_pem(pem)
The following code executed as the slurm user generates user tokens for the supplied username, session length in minutes, and signing key
from jwcrypto.jwk import JWK
def getUserToken(userName : str, sessionInMinutes : int, signingKey : JWK, kid : str) -> str:
import datetime
import python_jwt as jwt
expiration = int((datetime.datetime.now() + \
datetime.timedelta(minutes=sessionInMinutes)).timestamp() * 1000)
try:
return jwt.generate_jwt(claims={'sun': userName,'algorithm':'RS256','exp':expiration},
priv_key=signingKey, algorithm='RS256',other_headers={'kid':kid})
except Exception as e:
raise ValueError('in generating user token {}'.format(e))
SchedMD has an official slurmrestd client which became available on pypi in November, 2021
pip install slurm-rest
The slurm_rest client is composed of the following object graph: Configuration, ApiClient, and SlurmApi. The sequence is as follows:
Instantiating a slurm_rest Client
The slurm_rest client is composed of the following object graph: Configuration, ApiClient, and SlurmApi and the client is of type SlurmApi. An example convenience method is shown below. Importantly, the RS256 JWT token string passed into the getSlurmClient is generated with the getUserToken method detailed above.
ping:
>>> api_instance.slurmctld_ping()
{'errors': [],
'meta': {'Slurm': {'release': '21.08.5',
'version': {'major': 21, 'micro': 5, 'minor': 8}},
'plugin': {'name': 'Slurm OpenAPI v0.0.37',
'type': 'openapi/v0.0.37'}},
'pings': [{'hostname': 'localhost',
'mode': 'primary',
'ping': 'UP',
'status': 0}]}
diagnostics:
>>> api_instance.slurmctld_diag()
{'errors': [],
'meta': {'Slurm': {'release': '21.08.5',
'version': {'major': 21, 'micro': 5, 'minor': 8}},
'plugin': {'name': 'Slurm OpenAPI v0.0.37',
'type': 'openapi/v0.0.37'}},
'statistics': {'agent_count': 0,
'agent_queue_size': 0,
'agent_thread_count': 0,
'bf_active': False,
'bf_backfilled_het_jobs': 0,
'bf_backfilled_jobs': 1,
'bf_cycle_counter': 1,
'bf_cycle_last': 7338,
'bf_cycle_max': 7338,
'bf_cycle_mean': 7338,
'bf_depth_mean': 1,
'bf_depth_mean_try': 1,
'bf_last_backfilled_jobs': 1,
'bf_queue_len': 1,
'bf_queue_len_mean': 1,
'bf_when_last_cycle': 1643809050,
'dbd_agent_queue_size': 0,
'gettimeofday_latency': 27,
'job_states_ts': 1643820717,
'jobs_canceled': 0,
'jobs_completed': 1,
'jobs_failed': 0,
'jobs_pending': 0,
'jobs_running': 0,
'jobs_started': 1,
'jobs_submitted': 1,
'parts_packed': 1,
'req_time': 1643820721,
'req_time_start': 1643797885,
'schedule_cycle_last': 23,
'schedule_cycle_max': 808,
'schedule_cycle_mean': 46,
'schedule_cycle_mean_depth': 0,
'schedule_cycle_per_minute': 1,
'schedule_cycle_total': 383,
'schedule_queue_length': 0,
'server_thread_count': 3}}
Aubmitting a Job
Submitting jobs via slurm_rest.slurmctld_submit_job involves three two objects: JobProperties and JobSubmission:
-
JobProperties encapsulates env variables such as number of nodes within a dict as well as key/value pairs that define job parameters such as memory per node, cpu's per task, etc...
-
JobSubmission is composed of a string detailing the script to be executed along with the aforementioned JobProperties object. There are several important potential gotchas:
-
The JobSubmission script element must start with #!/bin/bash\n or the script will fail since the slurm job is being started as a shell script
-
Setting JobProperties current_working_directory a directory the submitting user has read-execute-write privileges is a good idea as hard-to-trace slurm errors occur otherwise
The example job submission below specifies the following:
- execute the job across three nodes with one task per node
- one CPU core and 500 MB of RAM for each task
- locations for stdout and stderror output
- script that prints out the whoami, id, and hostname commands out to /tmp/slurm_rest.out
props = JobProperties({'nodes':3,'cpus_per_task':1}, current_working_directory="/tmp", standard_output="/tmp/slurm_rest.out",
standard_error="/tmp/slurm-rest-test_error.out",cpus_per_task=1,tasks_per_node=1, memory_per_node=500,name='slurm-rest-test')
sub = JobSubmission(script="#!/bin/bash\nwhoami\nid\nhostname", job=props)
result = client.slurmctld_submit_job(v0037_job_submission=sub)
result
{'errors': [],
'job_id': 720,
'job_submit_user_msg': '',
'meta': {'Slurm': {'release': '21.08.5',
'version': {'major': 21, 'micro': 5, 'minor': 8}},
'plugin': {'name': 'Slurm OpenAPI v0.0.37',
'type': 'openapi/v0.0.37'}},
'step_id': 'BATCH'}
Cancelling a job
>>> client.slurmctld_get_jobs().to_dict()['jobs'][0]['job_id']
721
client.slurmctld_cancel_job(721)
Submitting Arkouda-on-Slurm job
arkouda_props = JobProperties(environment={"nodes":3,"CHPL_COMM_SUBSTRATE":"udp","GASNET_MASTERIP":"einhorn","SSH_SERVERS":"einhorn finkel shickadance", "GASNET_SPAWNFN":"S"}, current_working_directory='/opt/arkouda/', standard_output='/tmp/arkouda_rest.out', standard_error='/tmp/arkouda_rest.error', cpus_per_task=1, tasks_per_node=1, memory_per_node=4096, name='arkouda-rest-test', tasks=3, comment='{"service":{"name":"arkouda","namespace":"testing", "port": 5555, "target-port":5555},"endpoint": {"name":"arkouda","addresses":"ace","ports":5555}}')
sub = JobSubmission(script='#!/bin/bash\n/opt/arkouda/arkouda_server -nl 3', job=arkouda_props)
client.slurmctld_submit_job(v0037_job_submission=sub)
{'errors': [],
'job_id': 757,
'job_submit_user_msg': '',
'meta': {'Slurm': {'release': '21.08.5',
'version': {'major': 21, 'micro': 5, 'minor': 8}},
'plugin': {'name': 'Slurm OpenAPI v0.0.37',
'type': 'openapi/v0.0.37'}},
'step_id': 'BATCH'}