TroubleShooting - umccr/aws_parallel_cluster GitHub Wiki

Trouble shooting

Failed to build cluster

The following resource(s) failed to create: [MasterServerWaitCondition, ComputeFleet].

This has been seen with two main causes.

  1. The AMI is not compatible with parallel cluster see [this github issue][ami_parallel_cluster_issue]
  2. The pre_install script has failed to run successfully.
  3. The post_install script has failed to run successfully.

If you have used the --no-rollback flag in your start_cluster.py script you should be able to log into the master node via ssm. From here, you should check the file /var/log/cfn-init-cmd.log to see where your start up failed.

It's taking a long time for my job to start

Head to the ec2 console and check to see if a new compute node is running. Ensure that you can see the logs of the compute node by clicking on the console, if not, the compute node has probably not launched completely yet, give it another few minutes.

If however you can see the logs, and everything seems okay it may be worth doing the following.

  1. Run the sacct command to see the status of your job.
  2. Check that the compute node (or your node you've submitted to) is not in drain mode, scontrol show partition=compute.
  3. If you have used srun --pty bash to login to the node, use sinteractive instead due to a known bug.

Cannot log into AWS SSO whilst in pcluster env

Run which aws to see where aws is in your path. It may be installed in your pcluster conda path.
If this is the case, please feel free to delete it, it is not required.
If you are still having issues please run aws --version to determine which version of aws you're using.
You will need version 2.0.0 or higher in order to log into aws via sso.

If you have a reason for requiring aws v1 for your default aws, you may wish to symlink your aws2 installation path to /usr/local/bin/aws2 and check out my sso shortcut for logging in.