Scalable Compute (HPC and Cloud) - earthlab/earth-lab-operations GitHub Wiki

What is Scalable Compute?

Scalable compute refers to clusters of networked computers that increase our capacity to run jobs with big data and require many cores and/or a lot of random access memory (RAM) for intermediate data storage mid-processing stream. Physically, HPC and "cloud" are the same thing! HPC stands for High-Performance Computing that are supercomputers owned by a single organization and often available to the employees within that organization. "Cloud" is often used to reference commercial cloud that offers supercomputer capability to anyone who has funds to pay for the service.

Why use scalable compute?

  • Parallel CPU/GPU intensive applications: you are running a job that requires many CPU/GPU cores, which are running in parallel.
  • High memory applications: you are reading in some huge dataset into memory (because there is no way to chunk it out into smaller tasks), and you need a lot of RAM.
  • GPU enabled linear algebra heavy/deep learning applications: you need access to a machine with one or more GPUs. For example, you're training a model in Pytorch or Tensorflow.
  • Long-running applications: your workload needs to run for hours or days.

When should I use HPC vs Cloud?

Cloud Advantages and Disadvantages

  • Advantages: Non-exclusive use by anyone with money as there are lower firewall barriers for cross-organization collaboration. Users can share data easily and not have to move data around, merely open permissions for access to their collaborators (or publicly). There is also a seemingly bottom-less pit to the amount of compute you can request at any given time, provided you have the funds to pay for it.
  • Disadvantages: It can be expensive and often has complex cost models that result in hidden costs and surprise bills

Use AWS for:

  • Forward-processing, streaming for real-time compute
  • Access to government data already in AWS without download or moving data around
  • Need for access to HPC data stored

Use Google Earth Engine for:

  • analyzing datasets that are already stored in Google Earth Engine
  • LIMITS: downloading lots of raster data is challenging. The learning curve for the programming language can be steep because there are many Earth Engine-specific syntax rules (even though the base languages are JavaScript and Python). However, there are libraries of other users' code, as well as dedicated tutorials that are constantly improving and becoming more available. You may find this to be a helpful article

HPC Advantages and Disadvantages

  • Advantages: The company hosting the HPC supercomputer often offer discounts for using these services and they have built-in firewalls to protect users.
  • Disadvantages: There are often limits on who can access the system, which can limit collaborations. Also, there are often limits on the supercomputing resources the organization has available so you may spend time waiting in queue to run your job or waiting for a renewed compute allocation. Furthermore, you may have to spend time waiting to move data round between different owned systems as it is often hard to expose just a part of the system to an outside collaborator.

Use HPC for:

  • Big compute (MCMC, simulations, future projections) that doesn’t require a ton of data for batch processing
  • Access to Petalibrary and GLOBUS

Trainings

HPC offered by CU Research Computing

Other

Earth Lab AWS Cloud Computing

Getting Started

To set up an account, please work with the Analytics Hub Staff. We will need to identify a project for the work you expect to perform. Note that every project must have its own account as this is linked to the payment of your use on that account. We use the CU Federated AWS with each grant linked to its own AWS project. Analytics Hub will help you set up your account through CU to ensure proper project tagging. Once you are set up, you can follow the instructions below on how to set up an instance and get to work!

There are many advantages to using the CU Federated AWS:

  1. Simplified user onboarding and account maintenance. You can use your CU Identikey log-in and DUO MFA (two-factor authentication). Also, You can have several AWS accounts for your single sign-on.
  2. Improved Security - CU supported IT security (i.e., removal of public access keys to Github)
  3. Each Account is linked to a single grant workspace, that will help you with reproducibility as you can easily (dockerize)[https://github.com/earthlab/analytics-hub/wiki/Docker] only the relevant code for the workflows specific to a grant
  4. It's cheaper! CU has a negotiated a discounted rate as a big institution. This can save you up to 5%! CU is part of a consortium that gets a bulk price discount from AWS on the order of 5% this Reduced overhead to Earth Lab staff for monthly charging.
  5. There is a direct line of communication between AWS and CU High Performance Computing so you can work with Analytics Hub staff to create a cost-optimized data processing and storage getting from both types of scalable compute.

Some caveats:

  1. You must be diligent about using the appropriate account for the work you are doing. It is illegal to use federal dollars on a grant for use on another project.
  2. AWS costs money, so we don’t want to be using it for long-term data storage (see HPC vs AWS use cases)
  3. Please notify Analytics Hub Director if you plan to do any big computing pushes as this costs money and we want to make sure you have the funds available.

Research Computing Resources about AWS

Set-Up

Get AWS credentials. To start using Amazon Web Services, you'll need AWS credentials. Contact the Analytics Hub directly to get these. Once your request is processed, you'll receive a CSV file with your user name, a temporary password, access key ID, secret access key, and a login link. Save this CSV file in a secure location on your computer. Your AWS credentials should be kept secret! Do not release them to the public, especially your password, access key ID, or secret access key. In particular, never commit these to a git repository. We have had these accidentally posted on GitHub in the past, and this is a major security risk. Log in to the AWS console. Once you have your user name and password, log in to the AWS console here. From here, you can access specific AWS services using the search bar. We will cover two:

  • EC2 ("Elastic Compute Cloud") - Amazon EC2 is basically a computer rental service. In other words, if you need access to a computer with 8 CPUs and 32 GB of RAM, or a computer with a GPU, you can use EC2 to temporarily gain access, do your analysis, and pay by the hour only for the resources that you've used. Learn more on AWS EC2
  • S3 ("Simple Storage Service") Amazon S3 is basically a place to store data, kind of like an external hard drive. If you need a space to store data temporarily, S3 is relatively cheap and the transfer speeds from S3 to EC2 and from EC2 to S3 are very fast. Learn more on AWS S3 Launch and connect to an EC2 instance. Next, we'll walk through our most common use case: setting up an RStudio or Jupyter Notebook instance on EC2. The goal here is to deploy a familiar development environment in the cloud, and use a web browser to connect and do your work. From the AWS console, find EC2 in the search bar and click on it. This will bring you to the EC2 Dashboard. Select the Oregon region. Make sure that "Oregon"/US West (Oregon) is selected as your AWS region in the upper right. If this is your first time logging in, your default region may be "Ohio"/US East (Ohio). Launching an instance. Next, click the big blue button "Launch Instance". Choose an Amazon Machine Image (AMI). On the left panel, choose "My AMIs", and select the most recent YYYY-mm-dd-earthlab-docker image (see the AMI Name field). Choose an Instance Type. Select the instance type that you need here. It is helpful to know the requirements for CPU cores (vCPUs), and memory. Choose the smallest instance that satisfies your requirements (instances that underutilize resources are automatically shut down). Learn more information on instance types.. Here are some suggestions for instance types:
  • m5.xlarge for general purpose work (4 vCPUs, 16 GB of memory)
  • r5.xlarge for memory intensive work (4 vCPUs, 32 GB of memory)
  • c5.xlarge for compute intensive work (4 vCPUs, 8 GB of memory) Once you've selected your instance type, click "Next: Configure Instance Details". Configure Instance Details. The defaults can be left alone at this step. Click "Next: Add Storage". Add Storage. This is where you specify the size of the hard disk associated with the instance. You can change the "Size" field to set the number of GB you need. Note that anything up to 30 GB is free. It is important to estimate the amount of disk space needed ahead of time. Once you've set your hard disk size, click "Next: Add Tags". Add Tags. This is where you can name your instance. Click "click to add a Name tag", and name your instance, e.g., goes-active-fire-model. Tags are key/value pairs, so that your name tag has key "Name" and value "goes-active-fire-model" for example. Then, click Next: Configure Security Group Configure Security Group. Security groups set the rules for how you and others can connect to your instance. Click "Select an existing security group", and choose one of the following groups:
  • jupyter-notebook to use a Jupyter notebook
  • rstudio to use RStudio Please do not create new security groups unless absolutely necessary. And if you do, give them an informative name. Then click the blue button "Review and Launch". Review Instance Launch. Confirm that your instance type, security group, etc. are correct then click "Launch".

Select/Create a key pair

Key pairs are necessary to connect to your instance. If you have never created a key pair, follow the instructions to create a new one, and give it the same name as your username (e.g., if your AWS username is fflinstone, call your key pair fflinstone. Save your keypair in a secure location on your hard disk. Do not commit this file to GitHub, or otherwise make it public. If you have a key pair already, select it from the drop down list. Check the box to acknowledge that you have the key pair then click "Launch Instances". This will take you to a Launch Status page where you can see that your instances are being launched. Click the instance ID link or the "View Instances" button to see your instance in the list.


Connect to your instance

Once you see that the Instance Sate is "running", you can connect to your instance. To do so, open a bash terminal and navigate to the directory where your keypair .pem file is located. If you have never used it before, you will need to change the permissions to not be publicly viewable:

cd directory/to/pem/file
chmod 400 fflinstone.pem

replacing fflinstone with your username. Then, from the EC2 console in your web browser, highlight your instance and copy the Public DNS (IPv4) address from the Description information at the bottom half of the page. This will be something like ec2-34-217-71-152.us-west-2.compute.amazonaws.com. To connect via ssh, from your terminal run the command:

ssh -i fflinstone.pem [email protected]

substituting your username for fflinstone and your instance's Public DNS address for ec2-34-217-71-152.us-west-2.compute.amazonaws.com. You will probably see a message in your terminal that the authenticity of host ec2... can't be established, and be prompted to enter yes/no to continue. Type yes and hit enter to connect. You should now be connected to your EC2 instance. Once you are connected, you are ready to launch your RStudio or Jupyter server via Docker. Continue to the Docker instructions.


Terminate an instance

To totally remove an instance and all of its data, you can terminate the instance.


Stop an instance

If you are done working with an instance for now, but want to return to it later, you have the option to stop the instance and then start it again later. While stopped, charges only accrue for disk space that is being held for the instance. Instances that are left stopped for multiple weeks will be terminated. Instance hard disks are meant only as temporary storage -- to save data for the long term, you should pull it down locally or save it in another cloud location. For medium-term storage, you have the option to use Amazon S3