Possible performance degradation on ALinux2 when using ParallelCluster 2.11.0 and custom AMIs from 2.6.0 to 2.11.0 - mindvaultdev/aws-parallelcluster GitHub Wiki
The issue
The performance of tightly coupled / MPI workloads on clusters with Amazon Linux 2 operating system may be impacted by enabling CloudWatch logging.
Our preliminary analysis has found this is likely related to the CloudWatch Agent version 1.247348.0b251302, you can check which version you have installed by running the command: yum list amazon-cloudwatch-agent
This performance issue may affect workloads differently depending on cluster size and applications used.
The workaround
To overcome the issue there are multiple options.
Option 1: Downgrade CloudWatch agent with a post-install script
This option can be applied to new or existing clusters after an update operation. Instruction steps follow:
- Create a bash script, e.g.
disable-cw-script.sh
, with the following content (or add the code to your existing post installation script)
#!/bin/bash
. "/etc/parallelcluster/cfnconfig"
case "${cfn_node_type}" in
ComputeFleet)
sudo systemctl stop amazon-cloudwatch-agent.service
sudo yum -y downgrade amazon-cloudwatch-agent-1.247347.4-1.amzn2
sudo systemctl start amazon-cloudwatch-agent.service
;;
*)
;;
esac
- Upload the script to an S3 bucket with correct permissions, see: https://docs.aws.amazon.com/parallelcluster/latest/ug/pre_post_install.html
E.g.:
aws s3 cp disable-cw-script.sh s3://yourbucket/
- Add the following setting to your cluster configuration
[cluster yourcluster]
post_install = s3://yourbucket/disable-cw-script.sh
...
- Either create a new cluster or follow the next steps to update an existing cluster
Update an existing cluster with the post installation script configured in the previous steps.
- Stop the cluster with pcluster stop command
- Update the cluster with pcluster update command
- Restart the cluster with pcluster start command
All the compute nodes will start with a version of CloudWatch agent not impacting your cluster.
Option 2: Create the cluster with CloudWatch logging disabled
This option applies only to new clusters.
Create a cluster with the following configuration:
[cluster yourcluster]
cw_log_settings = custom-cw
...
[cw_log custom-cw]
enable = false
CloudWatch logging and the CloudWatch Agent service will be disabled by default, avoiding the possible performance degradation issue.
Option 3: Create a custom AMI with a downgraded CloudWatch Agent
This option applies only to new clusters.
- Follow the official documentation to modify an existing ParallelCluster AMI
- As part of the AMI customization step, connect to the instance and run the following command:
sudo yum -y downgrade amazon-cloudwatch-agent-1.247347.4-1.amzn2
- Complete the steps to create a custom AMI
- Create a cluster using the generated AMI with the
custom_ami
parameter.
Option 4: Downgrade CloudWatch agent within individual jobs (Slurm only)
This option can be applied to existing Slurm clusters.
Customize your job submission script by adding the steps to downgrade CloudWatch agent. Example:
#!/bin/bash
#SBATCH --job-name=yourjob
# add your options
# downgrade
for i in $(scontrol show hostnames $SLURM_JOB_NODELIST)
do
ssh $i "sudo systemctl stop amazon-cloudwatch-agent.service"
ssh $i "sudo yum -y downgrade amazon-cloudwatch-agent-1.247347.4-1.amzn2"
ssh $i "sudo systemctl start amazon-cloudwatch-agent.service"
done
# start your application
sleep 100