MARCC - BenLangmead/jhu-compute GitHub Wiki
The Maryland Advanced Research Computing Center (MARCC) is a facility in east Baltimore on the Bayview campus of JHU. The Bluecrab, which is inside MARCC, is a big compute cluster that serves the schools of JHU as well as UMD. "MARCC functions are to generate, store, backup, analyze or visualize large datasets for members of the Hopkins and University of Maryland at College Park communities, and to provide the necessary infrastructure to transfer and share data at high bandwidths (100GB/s), storage and powerful computing processing resources."
The computing platform comprises approximately 19,000 cores and 96 GPU's for accelerated computing, with combined peak performance of over 900 Teraflops. A Lustre parallel file system provides over 2 Petabytes of disk storage, and 56 Gb/s Infiniband connectivity is used for all parallel applications messaging and I/O. A second ZFS file system with approximate 14 Petabytes capacity is available for storing and processing big data. MARCC is a shared system and will be initially used by 5 schools within Hopkins and UMCP.
MARCC is separate from HHPC and all other clusters currently at JHU. It is not replacing any of them.
https://www.marcc.jhu.edu/request-access/request-an-account/
See also: MARCC running jobs
- 13 TBs under
/scratch/groups/blangme2
and/scratch/users/*
combined- Lustre (distributed) scratch for larger jobs
- Not backed up
- In theory, can be increased to 50 TB
- 50 TB of ZFS storage on
/work-zfs
; for large files that are not frequently read/written- 50 TB is theoretical limit, but the filesystem is journaled and the journals can take a large chunk (up to 10TB) of this
- About 400 GB of local scratch in
/tmp
on each node, but it's shared among all node users - 2-to-3-day time limit on most queues (
shared
,parallel
,lrgmem
); shorter on others (scavenger
,gpup100
)
There are a number of potentially useful partition/queues defined in MARCC, a few of which are:
- lrgmem (up to 1TB memory)
- parallel (can ask for exclusive access)
- shared
- express (limited to <=4 cores, <=14GBs mem, <=12hrs)
- skylake (these are hit/miss, but are the newer architecture as the name implies)
The express queue is for small, short jobs, but uses the newer skylake architecture and tends to be idle.
The skylake queue itself is made up of a number of skylake architecture nodes, but a number of which don't have network access (compute0685,compute0686,compute0688,compute0689,compute0694,compute0704) while compute0702 appears to have limited memory problems. These also tend to be idle. Their max mem is ~88GBs (as a setting).
Also, MARCC, as of 10/31/2017, added the notion of constraints to allow for choices of architecture, e.g. Intel Broadwell vs. Intel Haswell as well as GPUs, within a partition:
https://www.marcc.jhu.edu/gpu-driver-updates-completed-for-cuda-9-updated-partitions/
The Lustre and ZFS partitions are discussed above. Besides these, we also have a lab-specific 66-TB disk array accessible from all MARCC machines at /net/langmead-bigmem-ib.bluecrab.cluster/storage
.
In the case that we run out of allocation or we just want to test/save our allocation we can use the "scavenger" queue on which jobs can be pre-empted and are limited to 12 hours.
to do this, add the following (or replace your normal queue designation) to your sbatch command:
-p scavenger --qos=scavenger
as of 9/18/2015 use of scavenger queue will still deduct from the PI's quota of core-hours, but also we're allowed to "go negative", though it's not clear if thats just on scavenger or all queues.
A standard compute node (C6320) consists of:
Two Intel Haswell E5-2680v3 12-core 2.5 GHz cpus, 128 GB RAM Single port Infiniband card and Infiniband cables One port (out of 18) on an Infiniband switch (SX6025F) Access to the Infiniband director switch (SX6518) and IB cards Two ports on 1 gbps management nodes (7048R) and cables Slots on PDU and power whips Slot on rack Warranty for 5 years
You can SSH into any compute node, however, your SSH session and any child processes will be killed in ~5-10 minutes. This happens even if you are running a job on the node via SLURM and you SSH into it in addition.
Screen/nohup/disown does not get around this.
- [
langmead-bigmem
](MARCC bigmem)
Your login should be your university ID @ university .edu (where university id is your JHED ID or directory ID) For example [email protected] or [email protected]
Your password should be set from your account request. If this is not correct passwords can be reset at https://password.marcc.jhu.edu/
If you need any assistance please contact [email protected]
SLURM <> SGE Command Mapping:
http://slurm.schedmd.com/rosetta.pdf
Basic interactive mode:
salloc -J interact -N min#nodes-max#nodes --ntasks-per-node=1 --cpus-per-task=#cpus --time=DD-HH:MM:SS --mem=#g -p queuename srun --pty bash
ex.:
salloc -J interact -N 1-1 --ntasks-per-node=1 --cpus-per-task=1 --time=1:00:00 --mem=4g -p debug srun --pty bash
to check your running/queued jobs:
squeue -l -u <userlogin>
ex:
squeue -l -u [email protected]
To get statistics on completed/running jobs for a user:
sacct -u <username> --format=JobID,JobName,MaxRSS,Elapsed
ex :
sacct -u [email protected] --format=JobID,JobName,MaxRSS,Elapsed
Refer to the manual at https://computing.llnl.gov/linux/slurm/man_index.html for more details on Slurm commands.
James Taylor says basic rsync
is really slow.
Data transfer nodes should be used:
dtn4.marcc.jhu.edu
dtn5.marcc.jhu.edu
Any files/filesets > a few 100 MBs should be
copied to /scratch/groups/blangme2
rather than
the home directory as the disk quota for home directories
is quite low.
While individual rsync gets ~10 MegaBytes/s top speed,
you can always manually split the file list up and
start many parallel rsync jobs.
This worked for me to transfer part of the
geuvadis set from HHPC to MARCC.
As well as from HHPC to JHPCE.
Globus is available, but probably requires it to be setup both at the source and destination.