1. Introduction - TGAC/knowledge_base GitHub Wiki

Introduction

Questions:

"When joining EI, what are the key things to know about our HPC environment?"

Objectives:

"Learn about the general structure of these training materials."

Keypoints:

"Set out the reason for these lessons and how to use them."

The High Performance Computing (HPC) facilities at EI serve as an invaluable resource for computational biology. This episode aims to introduce you to these facilities, providing insights into their structure and how they facilitate research.

The HPC, a dedicated multi-user system, comprises over 100 interconnected nodes (or more loosely, servers). Researchers access these nodes remotely from their local laptops or desktops, leveraging the system's computational power. The configuration of each node varies in terms of CPUs, RAM, and local storage, but graphical capabilities are limited, necessitating interaction via the command line interface (CLI). Given its multi-user nature, stringent measures are in place to uphold the integrity of users' files. Access to specific areas of the HPC is strictly controlled, requiring approval from a group leader or data champion.

Management of user access is centralised through a small number of submission nodes, also known as login or head nodes. These nodes represent potential bottlenecks within the system, and it is very important that resource-intensive applications are not run directly on them. Instead, users should submit tasks to the job scheduling system, SLURM (Simple Linux Utility for Resource Management), to allocate resources for their tasks. SLURM efficiently provisions required resources on worker nodes based on job specifications provided by users. Should a job exceed its allocated resources (such as run time, RAM or number of CPUs), SLURM intervenes by terminating it to prevent disruptions to other users' work.

Each node has a small amount of local hard drive space; this is very quick and can be utilised to speed up tasks that are I/O intensive (access files frequently), but the vast majority of storage is separate from the HPC nodes. We'll return to this later, but it's enough to state that when a node writes to a file, that file will not be on the node and most storage can be accessed without needing to be logged on to the HPC itself.

Nodes are organized into partitions or queues based on various criteria such as expected runtime, memory requirements, or the need for GPU architecture. Details regarding partition limits can be explored further, but as of October 2022, the available resources at EI include:

General work nodes: 82 nodes with 64 cores, 513GB RAM, and approximately 2TB of local storage; additionally, 3 nodes boasting 64 cores, 4TB RAM, and similar local storage capacity.

Large memory partition: Comprising 5 nodes with varying configurations, including 15TB or 18TB of RAM and a corresponding number of CPUs.

GPUs: 4 nodes equipped with A100 GPUs, 80GB VRAM, and 1TB RAM each.

In summary, the HPC facilities at EI offer a sophisticated infrastructure tailored to meet the demanding computational needs of researchers in the field of biology.

Questions:

What is the primary function of the High Performance Computing (HPC) facilities at EI, and how do researchers access these resources?

Describe the role of submission nodes within the HPC system, and why is it important for users to avoid running resource-intensive applications directly on them?

How does SLURM contribute to the efficient allocation of resources on the HPC system, and what actions does it take if a job exceeds its allocated resources?