Resetting GPUs on the compute nodes - adaptivecomputing/torque GitHub Wiki

Some sites reset GPUs periodically or between jobs because of observed degradation in job performance over time, which is alleviated by resetting the GPUs. Starting in Torque 6.0.4 and 6.1.1, pbs_mom daemons no longer maintain constant access to the GPUs on their local compute nodes. This allows the resetting of GPUs when the mom is configured to use nvml. Prior to these versions, it is not possible to reset the GPUs as long as a pbs_mom daemon is active unless you have configured the moms to use nvidia-smi to gather the GPU information.

This sample code resets a GPU: #Reset GPU 0 on this node nvidia-smi --gpu-reset --id=0

Note: the mom daemon will still access the GPU from time to time. For example, if the mom is gathering the status information for the GPU, (by default this is done every 45 seconds) changing the GPU's mode, or otherwise interacting with the GPU, the reset command will fail with a message stating that it is being used by another process. However, as long as you are using a version with this bug resolved, the mom will release the GPUs once it is done, so retrying once or twice on a failure will likely cause the GPU reset to succeed.