SharedFileSystems - umccr/aws_parallel_cluster GitHub Wiki

Shared File Systems

Staging input and reference data
Uploading data back to s3

Staging input and reference data

You will likely need to download your input data and software.
Ensure that inputs are accessible to all nodes by placing it in the ${SHARED_DIR} folder.

If you specified --file-system-type as efs (default) in your start_cluster.py command, then the SHARED_DIR environment variable will be set to /efs. Alternatively if --file-system-type is set to fsx then the SHARED_DIR environment variable will be set to /fsx.

By default you will have read-only access to the s3 buckets linked to your aws account.
Use sbatch --wrap "aws s3 sync s3://<bucket_path> "${SHARED_DIR}/local_path" to download data into the shared file system.

The compute nodes have much higher band width than the head node which is why the command above is wrapped in an sbatch script.

Uploading data back to s3

Assumes yawsso is in your path yawsso is installed in the pcluster conda env

By default, parallel cluster does not have write access to s3 buckets.
A workaround is taking your short-term local SSO credentials and importing them into parallel cluster.

To do this you must have the following:

Logged in to AWS on your local computer via sso
Have your parallel cluster environment activated, OR at least have aws2-wrap or yawsso in your PATH
Have the ssm_run function sourced from [this GitHub repo][alexiswl_bashrc]

From your local computer run:

master_instance_id="<master_ec2_instance_id>"
shared_fs_path="</path/to/outputs>"
path_to_s3_bucket="<s3://bucket>"
export_env_vars="$(yawsso --export-vars --profile "${AWS_PROFILE}" | \
                   sed 's/export //g' | \
                   tr '\n' ',' | \
                   sed 's/,$//')"

echo " sbatch \
         --partition=\"copy\" \
         --export \"${export_env_vars},ALL\" \
         --wrap \"aws s3 sync \\\"${shared_fs_path}\\\" \\\"${path_to_s3_bucket}\\\" \"" | \
 ssm_run \
    --instance-id "${master_instance_id}"

The space before the sbatch is for security reasons.
Be aware you are running a command on a shared parallel cluster with your personal access tokens.
By prefixing the command with a space, this prevents the tokens being exposed in the ec2-user's bash history.
Please note this is not foolproof method.