GPU Support - dmwm/WMCore GitHub Wiki

Starting with WMCore release 1.5.3.pre4, support to GPU workflows has been added to Request Manager for the ReReco and TaskChain workflow types. Implementation at the job description level, such that the request GPU parameters can be consumed, is planned for the coming weeks and will be made available in the next WMAgent stable release (likely in the 1.5.4 series).

This wiki is meant to provide a short overview of how GPU workflows are supported in WMCore. Most of its discussion and specification happened in this GH issue #10388, which links an even more important discussion under the CMSSW repository in #cmssw/33057.

Note that these new GPU parameters have to be defined during the workflow creation process, and they cannot be changed in a later stage. Further discussion needs to happen to see what the use cases are and whether further flexibility needs to be provided, either during workflow assignment or workflow resubmission (ACDC).

GPU Parameter Specification

Two new optional request parameters have been created to support GPU in central production workflows. They are:

  • RequiresGPU: this parameter defines whether the workflow requires GPU resources or not. It's expected to be a python string data type and the allowed values are:
    • forbidden: workflow will not request for GPU resources, thus cmsRun will not use any GPUs during the data processing (default)
    • optional: use of GPU resources is optional. If there are any GPUs available in the worker node, cmsRun process shall leverage it, otherwise only CPUs will be used.
    • required: GPU resources must be provisioned and cmsRun process expects it to be available in the worker node.
  • GPUParams: this is actually a set of key/value pairs defining what are the GPU hardware requirements, which should be used during the resource provisioning and job matchmaking in the grid. It's expected to be a JSON encoded python dictionary (with a default value as None JSON encoded). This parameter can have up to 6 key/value pairs, where the mandatory arguments are:
    • GPUMemoryMB: the minimum amount of GPU memory in MB, that must be available in the worker node. It's expected to be a python integer data type and greater than 0.
    • CUDACapabilities: a list of CUDA Compute Capability that must be available in the worker node. It's expected to be a list of python strings data type, where each CUDA Capability is up to 100 characters and it matches this regular expression r"^\d+\.\d+(\.\d+)?$".
    • CUDARuntime: it defines which CUDA Runtime (API?) version must be available in the worker node. It's expected to be a python string data type, up to 100 characters and matching this regular expression r"^\d+\.\d+(\.\d+)?$".

while these 3 extra optional arguments - within GPUParams - are supported as well:

  • GPUName: the GPU name that must be available in the worker node. It's expected to be a python string data type and up to 100 characters.
  • CUDADriverVersion: it defines which CUDA Driver Version must be available in the worker node. It's expected to be a python string data type, up to 100 characters and matching this regular expression r"^\d+\.\d+(\.\d+)?$".
  • CUDARuntimeVersion: it defines which CUDA Runtime Version must be available in the worker node. It's expected to be a python string data type, up to 100 characters and matching this regular expression r"^\d+\.\d+(\.\d+)?$".

GPU Parameter Validation

There are two levels of parameter validation, which will be enforced by Request Manager, and workflow creation will fail in case any of these input parameter validation is not successful.

High level validation

This top level validation is performed between RequiresGPU and GPUParams. Their relationship is such that, if RequiresGPU is set to either optional or required, then the GPUParams parameter must be provided at the request level. When RequiresGPU is set to forbidden, then GPUParams should not be provided and its default value will be assigned.

GPU requirements validation

The GPU requirements parameters are provided through the GPUParams request parameter. It goes through data type and regular expression checks, where each key/value pair is going to be validated according to their specification in the section above. It will enforce the 3 mandatory arguments to be within the GPUParams parameter, including their data type and whether its content matches the regular expression. As well as the optional key/value pairs, in case they are provided. If any of these checks fail, workflow creation should fail in Request Manager as well.

GPU Support in ReReco Workflows

ReReco workflows are specified in the format of a flat python dictionary, so these two new parameters can be provided at the top-level request specification, and such GPU settings will be applied to the data processing task (and skims, if defined in the workflow).

Any of the other WMCore specific tasks, such as Merge, LogCollect, Cleanup and Harvesting shall remain with the default GPU parameter values (thus, keeping the default forbidden value). For informational purposes, this support has been implemented through this PR #10799

GPU Support in TaskChain Workflows

TaskChain workflows are specified in a nested python dictionary model, such that specific tasks can have their own settings. With that said, these two new GPU parameters can now be defined at both top-level and task-level (or actually only task-level, to avoid confusions!). If GPU parameters are not defined for a given task, it will have the default values, thus RequiresGPU=forbidden.

Any of the other WMCore specific tasks, such as Merge, LogCollect, Cleanup and Harvesting shall remain with the default GPU parameter values (thus, keeping the default forbidden value). For informational purposes, this support has been implemented through this PR #10805

GPU Support in StepChain Workflows

Requirements need to be clarified with the stakeholders

GPU Job Description in WMCore x GlideinWMS

This section describes how the workflow level GPU configuration gets reflected in the job classad description. Note that, as agreed with the Submission Infrastructure team, we decided not to support the optional value at this moment, and if a workflow is marked as RequiresGPU=optional, WMAgent will map that to not required.

We have implemented 5 new job parameters, they are:

  • RequestGPUs: it corresponds to the number of GPUs requested by the job. For this initial phase, we are hard-coding it to 1 whenever GPUs are required, else it has the value 0.
  • RequiresGPU: will be a string with value 1 in the case GPUs are required by that specific job. Otherwise it will have a 0 value, and GPUs won't be required.
  • GPUMemoryMB: a string with the value GPU memory (in MB) required by that job. It defaults to undefined.
  • CUDACapability: a comma-separated string with the CUDA capability required by that job. It defaults to undefined.
  • CUDARuntime: a string with the CUDA capability required by that job. It defaults to undefined.

Once a new stable WMAgent branch is released (likely in the 1.5.4 series), these 4 new classads will be present in every single central production job.