Page Index - mindvaultdev/aws-parallelcluster GitHub Wiki
112 page(s) in this GitHub Wiki:
- Home
- System Administration
- Schedulers ⏱🗓
- Development 👨💻
- Best Practices 👩💻
- Debugging 🕷
- Logging 🖨
- Ninja Hacks 🚀
- Known Issues 3.x 🚨
- Known Issues (3.x and 2.x) 🚨
- Fixed
- Known Issues 2.x 🚨
- Deprecated 🦴
- (2.10.0 2.11.4) Tags in number interpreted as integer instead of string possible cause value error in Compute resource launch template
- (2.11.2 and earlier) Custom AMI creation (pcluster createami) fails when building SGE
- (2.11.7 and earlier) Cluster creation fails with awsbatch scheduler
- (2.2.1 3.3.0) Risk of deletion of managed FSx for Lustre file system when updating a cluster
- (2.8.1 and earlier) Slurm node list misconfiguration breaks termination of idle compute nodes
- (3.0.0 3.2.1) ParallelCluster API cannot create new cluster
- (3.0.0 3.1.2) build image stack deletion failed after image successfully created
- (3.0.0 3.1.3) AWSBatch Multi node Parallel jobs fail if no EBS defined in cluster
- (3.0.0 3.1.3) build image creates invalid images when using aws cdk.aws imagebuilder 1.153.0
- (3.0.0 3.1.3) Unable to create cluster or custom image when using API or CLI with documented user policies
- (3.0.0 3.1.4) ParallelCluster API Stack Upgrade Fails for ECR resources
- (3.0.0 3.1.4) Unable to perform cluster update when using API or documented user policies
- (3.0.0 3.2.1) Running nodes might be mistakenly replaced when new jobs are scheduled
- (3.0.0 3.5.1) ParallelCluster CLI raises exception “module 'flask.json' has no attribute 'JSONEncoder'”
- (3.0.0 3.6.0) Compute nodes belonging to more than one partition causes compute to overscale
- (3.0.0 3.6.0) Ptrace_scope not disabled for Ubuntu compute nodes
- (3.0.0 and later) Build image CloudFormation stacks fail to delete after images are successfully built
- (3.0.0‐3.7.2) Cluster update rollback can fail when modifying the list of instance types declared in the Compute Resources
- (3.0.0‐3.8.0) Interactive job submission through srun can fail after increasing the number of compute nodes in the cluster
- (3.0.2, 2.11.3 and earlier) Custom AMI creation (pcluster build image or createami) fails for centos7 and ubuntu1804
- (3.1.1 3.1.2) Profiles not loaded when connected through NICE DCV session
- (3.1.1) Issue with clusters in isolated networks
- (3.1.x) Termination of idle dynamic compute nodes potentially broken after performing a cluster update
- (3.10.0) Build image fails in China regions
- (3.11.0) Job submission failure caused by race condition in Pyxis configuration
- (3.11.x) Job submission failure with Amazon Linux 2023
- (3.2.0 3.5.1) GPU nodes not coming back online after scontrol reboot
- (3.3.0 3.4.0) Slurm cluster NodeName and NodeAddr mismatch after cluster scaling
- (3.3.0 3.4.1) Custom AMI creation fails on Ubuntu 20.04 during MySQL packages installation
- (3.3.0 3.5.0) Update cluster to remove shared EBS volumes can potentially cause node launching failures
- (3.3.0 3.5.1) Cluster updates can break Slurm accounting functionality
- (3.3.0‐3.9.0) Potential data loss issue when removing storage with update‐cluster in AWS ParallelCluster 3.3.0‐3.9.0
- (3.4.0‐3.9.0) Updating a cluster to include an EFS fs with encryption in transit fails
- (3.5.0 and earlier) DCV virtual session on Ubuntu 20.04 might show a black screen
- (3.6.0) NVIDIA GPU nodes fail to start with custom AMI built from DLAMI
- (3.6.0‐3.6.1) Slurm NodeHostName and NodeAddr mismatch for MultiNIC instance when managed DNS is disabled and EC2 Hostnames are used
- (3.7.0‐3.12.0) Cluster creation failure on custom Ubuntu AMIs shipping OpenSSH 9.7 , caused by unsupported DSA keys
- (3.7.0‐3.8.0) ParallelCluster API Deployment fails due to IAM Policy size exceeding service limits
- (3.8.0 ‐ 3.9.3) ParallelCluster Build Image Failing during Installation of Minitar Ruby Gem Dependency
- (3.8.0‐3.9.1) SharedStorageType: Efs not working on arm instances
- (3.8.0‐3.9.3) Slurmctld Does not Start with EFS SharedStorageType on reboot
- (3.9.0‐3.10.1) Cluster update intermittently fails because some compute nodes don’t execute update procedure
- (3.9.0‐3.9.1) Default ThreadsPerCore Slurm setting causes reduced CPU utilization
- (3.9.0‐current) Cluster creation fails on Rocky 9.4
- (3.9.0‐latest) SSH bootstrap cannot launch processes on remote host when using Intel MPI with Slurm 23.11
- (3.9.1 ‐ latest) Speculative Return Stack Overflow (SRSO) mitigations introducing potential performance impact on some AMD processors
- AWS Batch with a custom Dockerfile
- Batch cluster creation in China regions fails with parallelCluster 2.10.4 and earlier
- Best Practices
- Best Practices for Upgrading a Cluster
- CloudWatch Logs
- Cluster creation fails if enable_intel_hpc_platform=true is in the configuration file
- Cluster Update when EBS Snapshot Used at Cluster Creation doesn't Exist Anymore
- Configuration validation failure: architecture of AMI and instance type does not match
- Configuring all_or_nothing_batch launches
- Create cluster with encrypted root volumes
- Create Ubuntu AMI with Unattended Upgrades disabled
- Creating an Archive of a Cluster's Logs
- Custom AMI creation (pcluster createami or build image) fails with ARM architecture
- Custom AMI creation (pcluster createami) fails with ParallelCluster 2.9.1 and earlier
- Custom AMI creation (pcluster createami) fails with ParallelCluster versions from 2.6.0 to 2.10.3
- DCV Connection Through Web Browsers Does Not Work
- Deleting API Infrastructure produces CFN Stacks failure
- Deprecation of SGE and Torque in ParallelCluster
- EFS: best practices and known issues
- FSx: best practices and known issues
- Git Pull Request Instructions
- How to disable Intel Hyper Threading Technology
- How to enable slurmrestd on ParallelCluster
- Installing Alternate CUDA Versions on AWS ParallelCluster
- Intel HPC platform specification issue
- Intel MPI 2019 installation on 3.1 clusters
- Interactive Jobs with qlogin, qrsh (sge) or srun (slurm)
- Issue running Ubuntu 18 ARM AMI on first generation AWS Graviton instances
- Issue with CentOS 8 Custom AMI creation
- Issue with Ubuntu 18.04 Custom AMI creation
- Launch instances with ODCR (On Demand Capacity Reservations)
- Multi User Support
- Newer Linux kernels are no longer compatible with EFA and closed Source Nvidia drivers in instances with GPU Direct RDMA support
- NICE DCV integration
- NVIDIA Fabric Manager stops running on Ubuntu 18.04 and Ubuntu 20.04
- OpenMPI Install from Source and Uninstall
- P4d support on Amazon Linux 1
- ParallelCluster 3.0.0 on Ubuntu 18 and 20: scaling daemon is down after a head node reboot
- ParallelCluster Awesomeness
- ParallelCluster: Launching a Login Node
- Possible performance degradation due to log4j cve 2021 44228 hotpatch service on Amazon Linux 2
- Possible performance degradation on ALinux2 when using ParallelCluster 2.11.0 and custom AMIs from 2.6.0 to 2.11.0
- Public Private Networking
- Self patch a Cluster Used for Submitting Multi node Parallel Jobs through AWS Batch
- Slurm Issues
- Stack Creation Failures
- Transition from SGE to SLURM
- Ubuntu 22.04 does not support RSA keys by default any longer. Upgrade to ed25519.
- Upgrade Slurm in an AWS ParallelCluster cluster
- Upgrade the NVIDIA GPU driver on a Slurm cluster managed with AWS ParallelCluster
- Upgrade the OpenPMIx package on a Slurm cluster managed with AWS ParallelCluster
- Use an Existing Elastic IP
- Using a multi‐NIC instance as single NIC