TroubleShooting - umccr/aws_parallel_cluster GitHub Wiki
Trouble shooting
Failed to build cluster
The following resource(s) failed to create: [MasterServerWaitCondition, ComputeFleet].
This has been seen with two main causes.
- The AMI is not compatible with parallel cluster see [this github issue][ami_parallel_cluster_issue]
- The pre_install script has failed to run successfully.
- The post_install script has failed to run successfully.
If you have used the --no-rollback
flag in your start_cluster.py
script you should be able to log into the master node via ssm.
From here, you should check the file /var/log/cfn-init-cmd.log
to see where your start up failed.
It's taking a long time for my job to start
Head to the ec2 console and check to see if a new compute node is running. Ensure that you can see the logs of the compute node by clicking on the console, if not, the compute node has probably not launched completely yet, give it another few minutes.
If however you can see the logs, and everything seems okay it may be worth doing the following.
- Run the
sacct
command to see the status of your job. - Check that the
compute
node (or your node you've submitted to) is not in drain mode,scontrol show partition=compute
. - If you have used
srun --pty bash
to login to the node, usesinteractive
instead due to a known bug.
Cannot log into AWS SSO whilst in pcluster env
Run which aws
to see where aws is in your path. It may be installed in your pcluster
conda path.
If this is the case, please feel free to delete it, it is not required.
If you are still having issues please run aws --version
to determine which version of aws you're using.
You will need version 2.0.0
or higher in order to log into aws via sso.
If you have a reason for requiring aws v1 for your default aws,
you may wish to symlink your aws2 installation path to /usr/local/bin/aws2
and check out my sso shortcut
for logging in.