Notes - parkeraddison/nasa-filesystem-benchmarks GitHub Wiki
Table of contents
- IOR
- IO-500
- Custom image for IO500 dependencies
- FIO benchmark
- Understanding IOR output
- Pleiades PBS Hello World
- Running IOR in a PBS job
- Nautilus namespace and IOR units
- Re-trying IOR on Pleiades
- Automated parameter sweeps
- Parsing and graphing IOR outputs
- Values for initial parameter sweeps
- FIO on NAS
- Darshan on NAS
- What MPI module to use
- Conducting initial parameter sweeps
- Initial sweeps on /nobackupp12
- MPI on PRP
- PRP SeaweedFS
- Single node CephFS parameter sweep
- IO Hints
- Trying out nbp2 and memory hogging
- Better understanding of the filesystem hardware
- No-cache read performance
- Designing stripe tests
- Some useful things to know about IOR
- The real reason for transfer size performance drops
- A quick multi-node test on PRP
- Darshan on PRP
- Darshan to observe an ML application
- Replicating Chowdhury et al IO Evaluation of BeeGFS for Deep Learning
- Darshan on NAS
- Psuedo pipeline to observe with Darshan
- Fire detection setup
- Validating Darshan outputs
- Flood detection profiling
First deploy the volume and a pod.
k create -f volumes/block.yml
k create -f minimal-deploy.yml
These commands are run in the pod.
Dependencies. See: https://github.com/hpc/ior/blob/main/testing/docker/ubuntu16.04/Dockerfile
apt-get update
apt-get install -y libopenmpi-dev openmpi-bin mpich git pkg-config gcc git vim less curl wget
apt-get install -y sudo
Downloading
wget -qO- https://github.com/hpc/ior/releases/download/3.3.0/ior-3.3.0.tar.gz | tar -zxv
Configuration. See ./configure --help
.
./configure
Installation
make
See: https://ior.readthedocs.io/en/latest/userDoc/tutorial.html
cd src
./ior ...
or
mpirun ...
Not sure how to really use it yet.
When I run ior it does a test instantly it seems.
When I tried doing
mpirun -n 64 ./ior -t 1m -b 16m -s 16
I got a ton of:
ior ERROR: open64("testFile", 66, 0664) failed, errno 13, Permission denied (aiori-POSIX.c:412)
...
[filebench-574869c787-pdn62:07749] PMIX ERROR: UNREACHABLE in file ../../../src/server/pmix_server.c at line 2193
...
Also note that I ran useradd testu
and su testu
because MPIrun doesn't want to be run as a root user. But this user has no permissions! I think that's the issue.
Seems like a chmod -R 777 .
as the root fixed this!
For example, run 10 tasks with a transfer size of 1m(egabyte?), a block size of 16m(egabyte?), and a segment count of 16:
mpirun -n 10 ./src/ior -t 1m -b 16m -s 16
Output:
IOR-3.3.0: MPI Coordinated Test of Parallel I/O
Began : Mon Mar 8 23:08:17 2021
Command line : ./src/ior -t 1m -b 16m -s 16
Machine : Linux filebench-574869c787-pdn62
TestID : 0
StartTime : Mon Mar 8 23:08:17 2021
Path : /storage/ior-3.3.0
FS : 8.0 GiB Used FS: 0.5% Inodes: 4.0 Mi Used Inodes: 0.0%
Options:
api : POSIX
apiVersion :
test filename : testFile
access : single-shared-file
type : independent
segments : 16
ordering in a file : sequential
ordering inter file : no tasks offsets
nodes : 1
tasks : 10
clients per node : 10
repetitions : 1
xfersize : 1 MiB
blocksize : 16 MiB
aggregate filesize : 2.50 GiB
Results:
access bw(MiB/s) IOPS Latency(s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter
------ --------- ---- ---------- ---------- --------- -------- -------- -------- -------- ----
write 678.06 678.07 0.118931 16384 1024.00 0.770489 3.78 3.59 3.78 0
read 4233 4234 0.019146 16384 1024.00 0.000035 0.604695 0.298354 0.604706 0
remove - - - - - - - - 3.20 0
Max Write: 678.06 MiB/sec (711.00 MB/sec)
Max Read: 4233.46 MiB/sec (4439.11 MB/sec)
Summary of all tests:
Operation Max(MiB) Min(MiB) Mean(MiB) StdDev Max(OPs) Min(OPs) Mean(OPs) StdDev Mean(s) Stonewall(s) Stonewall(MiB) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggs(MiB) API RefNum
write 678.06 678.06 678.06 0.00 678.06 678.06 678.06 0.00 3.77548 NA NA 0 10 10 1 0 0 1 0 0 16 16777216 1048576 2560.0 POSIX 0
read 4233.46 4233.46 4233.46 0.00 4233.46 4233.46 4233.46 0.00 0.60471 NA NA 0 10 10 1 0 0 1 0 0 16 16777216 1048576 2560.0 POSIX 0
Finished : Mon Mar 8 23:08:25 2021
The hpc/ior:ubuntu16.04 image (built locally and pushed to Docker Hub parkeraddison/ior:ubuntu16.04) almost passes ./prepare.sh
for the io500 repo -- it just needs to
apt-get install -y autoconf
Set up permissions
groupadd stor
chgrp -R stor /storage
chmod -R g+rwx /storage
useradd usr -G stor
su usr
mpiexec -np 2 ./io500 config-minimal.ini
Output:
Unexpected end of /proc/mounts line `overlay / overlay rw,relatime,lowerdir=/var/lib/docker/overlay2/l/LJJ3QEF6WLMS4VWPVV2XKL6JYS:/var/lib/docker/overlay2/l/24YRTXTAGXIULRRXZY4JB5WTGG:/var/lib/docker/overlay2/l/42MQ3LQTYO2IUBUARVOT7IPDQP:/var/lib/docker/overlay2/l/YGOPACSWOGMEMKHTJYCM6UU3FH:/var/lib/docker/overlay2/l/KPZVRXHJW6K2V5FL24TRWFWO6B:/var/lib/docker/overlay2/l/HX22FHOBPYU4GIEFU6V5JWP2FJ:/var/lib/docker/overlay2/l/GJP2A7A4T3XQYPZHHNZR3LC76R:/var/lib/docker/overlay2/l/TZLAGOYFXJHZETSVMY4KIDZ543:/var/lib/docker/overlay2/l/4NTW7PG2N53XK'
IO500 version io500-sc20_v3-6-gd25ea80d54c7
ERROR: write(12, 0x225c000, 2097152) failed, (aiori-POSIX.c:563)
Oof.
Look like: https://stackoverflow.com/questions/46138549/docker-openmpi-and-unexpected-end-of-proc-mounts-line
I'm trying a flattened image now (and including autoconf). https://tuhrig.de/flatten-a-docker-container-or-image/
Still getting
./ior: error while loading shared libraries: libmpi.so.40: cannot open shared object file: No such file or directory
Seems like a dependency problem -- openmpi3 is needed. Tried the centos7 image and the same thing happened. Note that to use mpirun in centos you need to first run module load mpi
. This image seems to work (locally). For some reason it caused an error when I tried to deploy on PRP. May try again.
Repo: https://github.com/joshuarobinson/docker_ior_mpi
I created a custom image to hold the ior/io500 dependencies so I'll have finer control over it. Then I went ahead and edited the ./ior --list > config-all.ini
output to disable all but the two easy IOR tests. I've put this into config.ini
.
I also changed the transfer and block size to very small values (proof of concept). I believe in the past when I was trying to run it I was using the defaults (very large values!).
bash-4.3$ mpiexec -np 2 ./io500 config.ini
bash-4.3$ mpiexec -np 2 ./io500 config.ini
ERROR INVALID (src/phase_dbg.c)stonewall-time != 300s
IO500 version io500-sc20_v3-6-gd25ea80d54c7
[RESULT-invalid] ior-easy-write 0.650314 GiB/s : time 0.008 seconds
[RESULT-invalid] mdtest-easy-write 0.000000 kIOPS : time 0.000 seconds
[RESULT-invalid] ior-hard-write 0.000000 GiB/s : time 0.000 seconds
[RESULT-invalid] mdtest-hard-write 0.000000 kIOPS : time 0.000 seconds
[RESULT-invalid] find 0.000000 kIOPS : time 0.000 seconds
[RESULT] ior-easy-read 3.403407 GiB/s : time 0.003 seconds
[RESULT-invalid] mdtest-easy-stat 0.000000 kIOPS : time 0.000 seconds
[RESULT-invalid] ior-hard-read 0.000000 GiB/s : time 0.000 seconds
[RESULT-invalid] mdtest-hard-stat 0.000000 kIOPS : time 0.000 seconds
[RESULT-invalid] mdtest-easy-delete 0.000000 kIOPS : time 0.000 seconds
[RESULT-invalid] mdtest-hard-read 0.000000 kIOPS : time 0.000 seconds
[RESULT-invalid] mdtest-hard-delete 0.000000 kIOPS : time 0.000 seconds
[SCORE-invalid] Bandwidth 0.000000 GiB/s : IOPS 0.000000 kiops : TOTAL 0.000000
The result files are stored in the directory: ./results/2021.03.14-23.26.49
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 150 RUNNING AT filebench-78c6c98d98-nrdlr
= EXIT CODE: 139
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions
Despite the warnings, I'm pretty sure it actually worked.
Was able to get FIO running by downloading the tar.gz from https://github.com/axboe/fio, installing the dependencies here (alpine), then running make
(ignoring a warning) and make install
.
Finally, I created a simple job file as write.fio
then ran fio write.fio
.
Output:
bash-4.3$ fio write.fio
job1: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
fio-3.26
Starting 1 process
job1: Laying out IO file (1 file / 128MiB)
job1: (groupid=0, jobs=1): err= 0: pid=4189: Sun Mar 14 23:51:53 2021
write: IOPS=273k, BW=1067MiB/s (1118MB/s)(128MiB/120msec); 0 zone resets
clat (nsec): min=1130, max=190684, avg=3212.52, stdev=8403.44
lat (nsec): min=1200, max=190734, avg=3271.35, stdev=8403.92
clat percentiles (nsec):
| 1.00th=[ 1304], 5.00th=[ 1352], 10.00th=[ 1384], 20.00th=[ 1464],
| 30.00th=[ 1544], 40.00th=[ 1624], 50.00th=[ 1688], 60.00th=[ 1768],
| 70.00th=[ 1896], 80.00th=[ 2096], 90.00th=[ 2512], 95.00th=[ 3280],
| 99.00th=[55552], 99.50th=[58624], 99.90th=[77312], 99.95th=[84480],
| 99.99th=[91648]
lat (usec) : 2=76.55%, 4=19.21%, 10=0.49%, 20=1.31%, 50=0.86%
lat (usec) : 100=1.58%, 250=0.01%
cpu : usr=10.08%, sys=89.92%, ctx=57, majf=0, minf=11
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,32768,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=1067MiB/s (1118MB/s), 1067MiB/s-1067MiB/s (1118MB/s-1118MB/s), io=128MiB (134MB), run=120-120msec
Disk stats (read/write):
rbd2: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
It looks like FIO is pretty popular! There are lots of repositories that have tools to work with FIO (both inputs and outputs).
Here are some repos which chart fio outputs or do other helpful things with FIO. They may all help get a better idea of how to understand the outputs and how to set up useful jobs!
- https://github.com/khailey/fio_scripts (example output)
- https://github.com/louwrentius/fio-plot
- https://github.com/wallnerryan/fio-tools
- https://github.com/intel/fiovisualizer
- https://github.com/xridge/fio-docker
- https://github.com/pcuzner/fio-tools
- https://github.com/javigon/fio_tests
- https://github.com/meganerd/fio-examples
- https://github.com/jan--f/fio_graphs
- https://github.com/amefs/fio-bench
- https://github.com/mcgrof/fio-tests
- https://github.com/mchad1/fio-parser
- https://github.com/storpool/fio-tests
- https://github.com/perftool-incubator/bench-fio
This is probably a good search: https://github.com/search?q=fio+benchmark&type=Repositories
"Exploration of IOR and FIO benchmarks; Noteful wiki" | HEAD -> main | 2021-03-14
Time to figure out how to start making sense of and plotting the outputs. That way I can make sure that IO500 and/or FIO are good choices to pursue.
Once that's done, we can start to figure out how to run this on Pleiades. Henry mentioned that a Python virtualenv would be one way to get specific software (I think one of the repos above is a Python wrapper...). Some packages should already be available. Also, I'd expect that as an HPC environment lots of the software needed for these HPC filesystem benchmarks should already be present!
Some description of IOR output: https://gitlab.msu.edu/reyno392/good-practices-in-IO/blob/dfcff70e9b9e39f1199f918d1a4000f44bc1b384/benchmark/IOR/USER_GUIDE#L686
Looks like the charts seen in some of the papers I came across earlier (e.g. this one) were made using an I/O profiler "Darshan". I'm sure there must be a profiler used at NAS.
Seems like a hopeful reference: https://cug.org/5-publications/proceedings_attendee_lists/2007CD/S07_Proceedings/pages/Authors/Shan/Shan_slides.pdf.
The useful outputs of IOR are simply read and write bandwidth in Me(bi/ga?)bytes per second and operations per second.
The charts seen in papers and presentations, such as here, are the result of multiple runs of IOR with different parameters.
For example, useful charts may demonstrate how bandwidth changes as transfer size, effective file size per processor, or number of processors increases.
This is something I could (hopefully easily) whip up and have it be useful -- run a bunch of IOR tests on a parameter grid. The Lustre docs do this exact thing in their example, going from 1,2,4,8 processors.
I might try this out right now on Nautilus... let me go ahead set up a slightly larger volume.
-
Talk with John/Dima about how large of a volume and how many pods I can set up for future benchmarking on Nautilus
Things are starting to make more sense and work more consistently with IOR and FIO runs in Nautilus.
One thing I'm not fully sure the importance of or how to use is specifying a file in IOR... For instance, if I create a file of random bytes like seen here is there any point to using that as an existing file to read from? Ah... perhaps there is a point. I could create multiple small files or one very large file... this coupled with filePerProc... maybe that's the point.
"Minimal IOR test script; Repo organization" | HEAD -> main | 2021-03-16
When I spoke with John he was interested in the IO500 leaderboard and FIO benchmark. He mentioned a few cool things:
- I can create a namespace to run the benchmarks on Nautilus
- They have used FIO a lot before!
- I should talk to Igor about the benchmarks/IO500
It's about time that I run a job on Pleiades! Then I'll try to run a minimal IOR and FIO run.
Alright, let's give this a go.
Log in to the enclave (secure front-end), then log in to a Pleiades front-end
ssh sfe
ssh pfe
Explanation of PBS on the HECC knowedlgebase: https://www.nas.nasa.gov/hecc/support/kb/portable-batch-system-(pbs)-overview_126.html
Batch jobs run on compute nodes, not the front-end nodes. A PBS scheduler allocates blocks of compute nodes to jobs to provide exclusive access. You will submit batch jobs to run on one or more compute nodes using the qsub command from an interactive session on one of Pleiades front-end systems (PFEs).
Normal batch jobs are typically run by submitting a script. A "jobid" is assigned after submission. When the resources you request become available, your job will execute on the compute nodes. When the job is complete, the PBS standard output and standard error of the job will be returned in files available to you.
When porting job submission scripts from systems outside of the NAS environment or between the supercomputers, be careful to make changes to your existing scripts to make them work properly on these systems.
A job is submitted to PBS using qsub
. Typing man qsub
gives a nice description of the expected job script format and capabilities. Here are some useful parts:
- The script can run Python, Sh, Csh, Batch, Perl
- A script consists of: 1) An optional shell specification, 2) PBS directives, 3) User tasks, programs, commands, applications, 4) Comments
- A shebang can be used to specify the shell, or the
-S
command line option can be used- E.g. Python can be used by having the first line of the script as
#!/usr/bin/python3
- E.g. Python can be used by having the first line of the script as
These are needed in a job script, and are written as #PBS
-prefixed lines at the top of the script file, or can be passed in as arguments to the qsub
command. It's probably best to include them in the script though!
With that said, the shell could be specified with #PBS -S
, too.
Common directives can be found here: https://www.nas.nasa.gov/hecc/support/kb/commonly-used-qsub-command-options_175.html. And other directives (options) can be seen with man qsub
.
Here's a basic script seen in the man pages, but I modified 'print' to 'echo' instead to avoid an invalid command!
#!/bin/sh
#PBS -l select=1:ncpus=1:mem=1gb
#PBS -N HelloJob
echo "Hello"
The script will be executed using the shell based on the first line shebang. The PBS -l
directive specifies resources. It asks for 1 'chunk' of resources with 1 cpu and 1 gb of memory. Here is also where we could specify the specific compute nodes we want (model=), the number of mpi processes we want (mpiprocs=), and the filesystem (?). See man pbs_resources
. Finally, the PBS -N
directive specifies the job name.
Let's try running it!
qsub hello-job.sh
Alright, it was rejected because the node model was not specified. I'll specify Pleiades Sandy Bridge with model=san
in the resource line.
Also worth noting that there is a Pleiades development queue that I think this work would fall under (testing the commands that is, not the final benchmarks!).
-
I should ask Henry about the billing and mission shares.
I just added -q devel
and -l model=san
to the script, trying again.
qsub hello-job.sh
Output: 10791518.pbspl1.nas.nasa.gov
Running qstat -u paddison
lists the jobs I've submitted. This is a pretty quick job on a fast-turnaround queue, so it'll go by quickly. But three quick runs of that command showed the job in three different states. The fourth time running qstat the output was empty -- the job was complete.
qstat
paddison@pfe24:~> qstat -u paddison
Req'd Elap
JobID User Queue Jobname TSK Nds wallt S wallt Eff
--------------- -------- ----- -------- --- --- ----- - ----- ---
10791518.pbspl1 paddison devel HelloJob 1 1 02:00 Q 00:00 --
paddison@pfe24:~> qstat -u paddison
Req'd Elap
JobID User Queue Jobname TSK Nds wallt S wallt Eff
--------------- -------- ----- -------- --- --- ----- - ----- ---
10791518.pbspl1 paddison devel HelloJob 1 1 02:00 R 00:00 50%
paddison@pfe24:~> qstat -u paddison
Req'd Elap
JobID User Queue Jobname TSK Nds wallt S wallt Eff
--------------- -------- ----- -------- --- --- ----- - ----- ---
10791518.pbspl1 paddison devel HelloJob 1 1 02:00 E 00:00 50%
Two files are now present in the directory where I ran qsub
.
HelloJob.o10791518
Job 10791518.pbspl1.nas.nasa.gov started on Sun Mar 21 20:12:29 PDT 2021
The job requested the following resources:
mem=1gb
ncpus=1
place=scatter:excl
walltime=02:00:00
PBS set the following environment variables:
FORT_BUFFERED = 1
TZ = PST8PDT
On *****:
Current directory is /home6/paddison
Hello
____________________________________________________________________
Job Resource Usage Summary for 10791518.pbspl1.nas.nasa.gov
CPU Time Used : 00:00:02
Real Memory Used : 2732kb
Walltime Used : 00:00:02
Exit Status : 0
Memory Requested : 1gb
Number of CPUs Requested : 1
Walltime Requested : 02:00:00
Execution Queue : devel
Charged To : *****
Job Stopped : Sun Mar 21 20:12:36 2021
____________________________________________________________________
The e
file was empty. Here is that file from a previous run where an invalid command was used.
HelloJob.e10791410
/var/spool/pbs/mom_priv/jobs/10791410.pbspl1.nas.nasa.gov.SC: line 5: print: command not found
The job summary and output is shown in the o
file, and it appears that stderr is shown in the e
file.
Nice!
"Hello World PBS job run on Pleiades" | HEAD -> main | 2021-03-21
Let's get an IOR benchmark running as a PBS job.
This is going to involve:
- Ensure software dependencies exist... and learn how to load modules/packages
- Learn how to install software dependencies if need be!
- Download the IOR executable to /home(?) and try executing it in a PBS job
Software modules: https://www.nas.nasa.gov/hecc/support/kb/using-software-modules_115.html -- I'll probably need to module load mpi...
.
Software directories: https://www.nas.nasa.gov/hecc/support/kb/software-directories_113.html -- since /u/scicon/tools
is used by the APP group I have a feeling a handful of dependencies will be there already. These should already be in PATH.
Also good to know that the pfe nodes can load these modules and it's fine to use them for quick testing/debugging! So I'll be able to test the minimal IOR (and work out all of the dependency, module load, etc steps) before submitting a PBS job :) never mind. mpi jobs are not permitted on the pfe nodes. Still, I should be able to run the ./configure
script which checks all dependencies.
Starting by downloading the IOR release from https://github.com/hpc/ior/releases/ to my pfe home directory.
wget -O- https://github.com/hpc/ior/releases/download/3.3.0/ior-3.3.0.tar.gz | tar zxf -
cd ior-3.3.0
Now we need to make sure that the necessary dependences are loaded by running ./configure
.
Trying to run it results in
checking for mpicc... no
checking for mpixlc_r... no
...
configure: error: in `/home6/paddison/ior-3.3.0':
configure: error: MPI compiler requested, but could not use MPI.
Which I think I can fix by running a module load mpi...
. First let's check what mpi modules are available using module avail mpi
. Alright, I'll try module load mpi-sgi
.
Let's try the configure script again.
Sweet! It worked fully this time! So we know that we'll need to ==module load mpi-sgi==.
Now I can run make
. Seems to have worked fine.
I cannot run make install
at the moment because I don't have permission to install the binary to /usr/local/bin
-- but I can change the installation path when running ./configure
. Not necessary though, I can just run the binary from src directly.
Alright. I honestly think that all other dependencies are met. I suppose it's time to run a PBS job! I've written the folloinwg minial-ior.sh
file:
#!/bin/sh
#PBS -q devel
#PBS -l select=1:ncpus=8:mpiprocs=8:mem=2gb:model=san
#PBS -N MinimalIOR
module load mpi-sgi
cd "$PBS_O_WORKDIR/ior-3.3.0"
# Should write and read a total of 2gb (8 procs * 16 segments of * 16mb)
mpirun -np 8 ./src/ior -t 1m -b 16m -s 16
Let's try it out! Huh, it complained that I didn't specify the model!? Oh. It was because I had mistyped the comment on the shebang, so it probably didn't read any of the directives.
qsub minimal-ior.sh
Out: 10799119.pbspl1.nas.nasa.gov
, and running qstat
shows us the job move from Queued, to Running, to Exiting.
qstat
paddison@pfe26:~> qstat -u paddison
Req'd Elap
JobID User Queue Jobname TSK Nds wallt S wallt Eff
--------------- -------- ----- ---------- --- --- ----- - ----- ---
10799119.pbspl1 paddison devel MinimalIOR 8 1 02:00 Q 00:01 --
paddison@pfe26:~> qstat -u paddison
Req'd Elap
JobID User Queue Jobname TSK Nds wallt S wallt Eff
--------------- -------- ----- ---------- --- --- ----- - ----- ---
10799119.pbspl1 paddison devel MinimalIOR 8 1 02:00 R 00:00 4%
paddison@pfe26:~> qstat -u paddison
Req'd Elap
JobID User Queue Jobname TSK Nds wallt S wallt Eff
--------------- -------- ----- ---------- --- --- ----- - ----- ---
10799119.pbspl1 paddison devel MinimalIOR 8 1 02:00 E 00:01 4%
Unfortunately, the e
file resulted in /var/spool/pbs/mom_priv/jobs/10799119.pbspl1.nas.nasa.gov.SC: line 11: mpirun: command not found
. Looks like our module load didn't give us the mpirun command. Hmmmmm.
Sure enough, on the pfe I can see an mpiexec command, but no mpirun command. I seem to be able to access this command by ==module load mpi-hpcx==.
Let's try adding that module and run the job again.
Also, this is pretty handy: watch qstat -u paddison
Alright this time we got: mpirun: error while loading shared libraries: libimf.so: cannot open shared object file: No such file or directory
. A quick look at Stack Exchange shows that this is an Intel math library. There are some comp-intel
modules available, but a module help comp-intel
shows only libfftw files... still, I will try it.
Ah! I can test mpirun
(without any arguments so I won't actually do anything) on pfe, that way I can check if it complains about dependencies. Sure enough it does complain about missing libimf. Fortunately, after a ==module load comp-intel== it no longer complains!
Let's try this in a PBS job again.
Output
MinimalIOR.e10799410
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 0 on node r327i7n6 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
MinimalIOR.o10799410
Job 10799410.pbspl1.nas.nasa.gov started on Mon Mar 22 15:52:46 PDT 2021
The job requested the following resources:
mem=2gb
ncpus=8
place=scatter:excl
walltime=02:00:00
PBS set the following environment variables:
FORT_BUFFERED = 1
TZ = PST8PDT
On *****:
Current directory is /home6/paddison
____________________________________________________________________
Job Resource Usage Summary for 10799410.pbspl1.nas.nasa.gov
CPU Time Used : 00:00:04
Real Memory Used : 2280kb
Walltime Used : 00:00:04
Exit Status : 139
Memory Requested : 2gb
Number of CPUs Requested : 8
Walltime Requested : 02:00:00
Execution Queue : devel
Charged To : *****
Job Stopped : Mon Mar 22 15:52:58 2021
____________________________________________________________________
Hmmm, so it didn't work fully, but it didn't not work at all at least :')
"Minimal IOR test almost capable of running on Pleiades. Faced segfault" | HEAD -> main | 2021-03-22
Just went ahead and created a usra-hpc
namespace on Nautilus, set up a larger volume and new deployment to test out IOR over there. I checked the file sizes and sure enough they're all mebibytes and whatnot. So I was correct before that a command of mpirun -np 8 ior -t 1m -b 16m -s 16
does infact produce an aggregated file size of 2GiB -- actually, the IOR output says this pretty nicely!
Also worth noting that once make install
is run (this is done already in the images I set up, e.g. parkeraddison/io500) then wherever ior is run from serves as the filesystem -- so I merely need to navigate to /storage
then run ior to test it on that volume.
Finally (and most relevant right now), I did not see any segfault errors when I ran it on Nautilus. Let's try it again on Pleiades.
I'm going to modify the minimal script to simply run ior
without any arguments -- this writes/reads only one mebibyte of data and is practically instant. It's truly minimal!
Same message as before.
Perhaps:
Or... the error output says "Per user-direction, the job has been aborted" -- this sounds like maybe the PBS job was aborted because it saw a non-zero exit code. Is there some way to specify that I don't want the job aborted?
To make figuring this out easier, we can run the PBS job in interactive! This is basically like exec'ing into a compute node shell in a k8s way of thinking about it! Running qsub -I minimal-ior.sh
will request the resources reading the PBS directives, then attach the terminal. I can run each line of the script manually.
After loading the mpi-sgi, mpi-hpcx, and comp-intel modules, here's what running ior shows:
PBS *****:~> cd ior-3.3.0/
PBS *****:~/ior-3.3.0> ./src/ior
[*****:20237:0:20237] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xe5)
==== backtrace ====
0 /nasa/hpcx/2.4.0_mt/ucx/install/lib/libucs.so.0(+0x1d98c) [0x2aaabb7a498c]
1 /nasa/hpcx/2.4.0_mt/ucx/install/lib/libucs.so.0(+0x1dbfb) [0x2aaabb7a4bfb]
2 /nasa/hpcx/2.4.0_mt/ompi-mt-icc/lib/libmpi.so(MPI_Comm_rank+0) [0x2aaaab668e00]
3 ./src/ior() [0x40d58c]
4 /lib64/libc.so.6(__libc_start_main+0xf5) [0x2aaaab935a35]
5 ./src/ior() [0x403209]
===================
Segmentation fault (core dumped)
PBS *****:~/ior-3.3.0/src> mpirun -n 1 ./ior
[*****:20556:0:20556] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xe5)
==== backtrace ====
0 /nasa/hpcx/2.4.0_mt/ucx/install/lib/libucs.so.0(+0x1d98c) [0x2aaabb7a498c]
1 /nasa/hpcx/2.4.0_mt/ucx/install/lib/libucs.so.0(+0x1dbfb) [0x2aaabb7a4bfb]
2 /nasa/hpcx/2.4.0_mt/ompi-mt-icc/lib/libmpi.so(MPI_Comm_rank+0) [0x2aaaab668e00]
3 ./ior() [0x40d58c]
4 /lib64/libc.so.6(__libc_start_main+0xf5) [0x2aaaab935a35]
5 ./ior() [0x403209]
===================
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node ***** exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Looks like the abort happens due to mpirun -- but the truth of the matter is we are getting a segfault from IOR itself. Now we just need to figure out why!
"More minimal testing on Pleiades" | HEAD -> main | 2021-03-23
Sweet, over the past few days I've been unable to replicate the segfault -- in other words, IOR has been working fine as a PBS job!
Furthermore, I've gone ahead and cleaned up/commented the code and ran it on the Lusture (/nobackup) filesystem!
It's working great :) It's a super minimal example to just confirm it works. I'll run a slightly larger example parameter sweep as I finish that code.
"Working IOR on Pleiades NFS and Lustre" | HEAD -> main | 2021-03-27
There are a few things I've been working on:
- Code to run parameter sweeps
- Code to parse IOR outputs and graph them
- Better installation and setup of this repository (e.g. automate downloading IOR)
There are some examples online, such as on the Lustre docs, of running parameter sweeps in a shell script. That's fine, and I've worked on one... but I can't help but feel things would be a lot easier (and more readable) to just script it in Python with some subprocess
es and much easier iteration/logic flow.
What we can do that (I think) makes things easiest is to (1) use a shell script to submit the PBS job, load all dependencies, and load in the correct Python module then (2) call a Python script from within the PBS job which calls the IOR tests for the different parameter ranges.
Right now I have some hard-coded values in the Python script itself for the parameter sweeps... not sure if it pays to make the script read from a configuration file.
To more easily validate and see the effect of the different parameters, it would help to have code that converts the IOR outputs into tables then graphs them!
"Facilitate running IOR with a parameter sweep" | HEAD -> main | 2021-03-30
Following https://www.nas.nasa.gov/hecc/support/kb/secure-setup-for-using-jupyter-notebook-on-nas-systems_622.html and https://www.nas.nasa.gov/hecc/support/kb/using-jupyter-notebook-for-machine-learning-development-on-nas-systems_576.html, I can do my visualization work in a Jupyter notebook running on a compute node.
When going through the setup steps, I used the pyt1_8 environment -- I'm not sure what if 'pyt' stands for anything besides 'Python' and what the number denotes, I imaging that the tf... environments are for TensorFlow. But regardless, I checked and pyt1_8 has jupyter and Python 3.9, along with pandas, numpy, scipy, and matplotlib, so it'll work well!
Okay, yesterday I ended up giving up on NAS Jupyter because I kept running into SSL errors when trying to actually go to localhost and access the lab. After multiple attempts today, trying different environments, following all steps again, etc, I've realized the problem was Chrome -- after switching to Firefox to view Jupyter all is fine.
Also, now that I can finally use Jupyter for development, it's worth remembering the following help CSS rule to inject to add an 80ch ruler to the JupyterLab code editor:
.CodeMirror-line::after {
content: '';
position: absolute;
left: 88ex;
border-left: 1px dashed gray;
}
Here are the descriptions of the different NAS ML conda environments: https://www.nas.nasa.gov/hecc/support/kb/machine-learning-overview_572.html. Looks like 'pyt' stands for PyTorch (d'oh). It also looks like the /nasa jupyterlab
environment doesn't have matplotlib. The machine learning environments do, however. So in the future I'll start up the lab from that environment. I could also go ahead and create my own virtual environment probably... but I really don't need to! Having PyTorch or TensorFlow is overkill, but that is fine by me ;)
Turns out IOR has a few different output formats -- including JSON and CSV which make life a lot easier -- I've been trying to parse the human-readable output but ran into some issues with whitespace delimiting. It looks like the JSON output is the best (in my opinion) since it's easy to access exactly what you need and it doesn't hide any information. Side note - I wonder if YAML will ever take over JSON's place in society...
Now that I'm using the JSON output from IOR, everything is much more straightforward when it comes to parsing. I polished up a file to parse and plot outputs. I think it's time now to actually do some larger-scale runs so we can make sure what we're getting as a result makes sense.
"Output parsing and plotting functions complete" | HEAD -> main | 2021-04-06
Great news is (after some tweaking/bug squashing) the parameter sweep job is working like a charm. Furthermore, the output file was instantly able to be parsed and visualized with the functions written!
The current steps remaining are:
- Come up with some good parameter values to test
- Make the code a bit easier to use and adjust
"More robust parameter sweep code and pbs job script; Add sweep and visualization to readme" | HEAD -> main | 2021-04-07
For parameter sweeps, we have the following guidelines:
- We should do multiple iterations of each test for consistency sake. We can change this depending on how long tests take, but for now maybe
-i 5
or so - We're pretty interested in how each system scales with concurrency -- so testing sweeping # tasks is an important test
- We're interested in how ^ might change with different data size and different access patterns -- so we should test various file size (combination of block and segment size) and transfer size
- Key importance: to explore, to understand what bottlenecks we might be experiencing
Largely, this is exploratory -- the parameter values and sweeps aren't fixed by any means, rather we should try some out and if we see something interesting then we should dive into it further. Fortunately, the tests (at the current scale I've tested) don't take too long (although now that multiple iterations are being run, expect it to take
To come up with the initial values, though, I've been drawing inspiration mostly from some papers which have used IOR to evaluate HPC performance:
-
Using IOR to Analyze the I/O performance for HPC Platforms by Hongzhang Shan, John Shalf
- Sides: https://cug.org/5-publications/proceedings_attendee_lists/2007CD/S07_Proceedings/pages/Authors/Shan/Shan_slides.pdf
- Conducted user survey for typical IO access patterns at NERSC. Findings:
- Mostly sequential IO (rather than random)
- Mostly writes -- really? I would assume that most scientific projects are moreso write-once read-many...
- Transfer size varies a lot -- "1KB to tens of MB"
- Typical IO patterns: one processor, one file per processor (both POSIX), MPI-IO single shared file
- File per process can lead to lots of files being written, especially if there are restarts. This doesn't scale well in terms of data management!
- Small transactions and random accesses lead to poor performance... but lots of poorly designed applications do this
- Important IOR parameters:
- API -- POSIX, MPI-IO, HDF5, or NetCDF
- ReadFile/WriteFile -- whether to measure read/write operations
- SegmentCount (s) -- number of 'datasets' in the file
- Each dataset is composed of NumTasks (N) blocks of BlockSize (b), read/written by the processor in chunks of TransferSize (t)
- To avoid caching, filesize per processor (= BlockSize) should be large enough to exhaust the memory buffers on each node. BlockSize was swept from 16MB to 8GB to see where caching effects (for read performance) were mitigated. In their words: "where the derivative of the performance was asymptotically zero"
-
Curious, can't IOR's reorder option mitigate caching? We should test a block size sweep with and without reordering.This would only apply for tests on more than one node -- we're doing this so that we can trust the rest of the tests which only involve a single node. - For this test, only one node was used and TransferSize was fixed at 2MB with one segment.
-
- TransferSize was swept from 1KiB to 256MiB (using a power of 4 in KiB) to get a sense of if the system is optimized for larger transfer size/the system overhead.
- Using the ideal parameters seen above, file-per-process versus shared file were both evaluated as NumTasks was swept from 8 to 256/1024 (depending on how many nodes were available on each system)
- On their systems, read and write performance were very similar.
- The theoretical peak IO bandwidth of each system was calculated/known before hand... for the Lustre system it was calculated as the number of DDN couplets times the bandwidth of each couplet
- What is the theoretical peak IO bandwidth on Pleiades?
- It's important to compare systems "on the basis of performance rather than raw performance" due to differences in scale
- The paper also explains the physical topology of the systems it tested -- something which I stumbled upon ANL's CODES project to simulate the impact of different topologies... outside the scope of this project, but perhaps worth ==NOTE==ing
- We should see what performance a single node is capable of -- this'll let us measure speedup (fixed work per processor, as is default with IOR) and maybe also scaleup (if we adjust parameters to fix aggregate work done)
- Truthfully, a speedup chart would be more effective at comparing different systems than a shared plot of raw performance!
- I/O Performance on Cray XC30 by Zhengji Zhao, Doug Petesch, David Knaak, and Tina Declerck
Darshan is an IO profiler which intercepts IO calls to collect statistics which can be viewed on a timeline or summarized later -- things like bandwidth, IO size, etc. Basically, it's a way to get all of those useful measurements which a finished IOR/FIO run tells us but on any arbitrary mpirun jobs (including scientific application benchmarks)!
Useful video: https://www.youtube.com/watch?v=7cDoBusXK5Q; slides: https://pop-coe.eu/sites/default/files/pop_files/darshan_io_profiling_webinar.pdf
Definitely worth getting this to run on NAS -- even for IOR runs. The video mentions looking at how well the percentage of metadata IO scales, because that was a bottleneck they faced.
It came to my attention that I've found a lot of academic papers which reference IOR, but not a lot of widespread 'internet' popularity. FIO, however, is immensely popular in terms of internet points -- plenty of blog posts, technical pages (from Microsoft, Google, Oracle, etc)... I wonder if there are some HPC papers which reference FIO?
In order to run FIO on NAS, it the release can be downloaded and unpacked like so:
wget -O- https://github.com/axboe/fio/archive/refs/tags/fio-3.26.tar.gz | tar zxf -
Before we make
however, we need to upgrade to gcc version of at least 4.9
module avail gcc
...
module load gcc/8.2
cd fio-fio-3.26/
make
The minimal job in readwrite.fio
can be run with
path/to/fio path/to/readwrite.fio
Hmmmm, I came across Gordon: design, performance, and experiences deploying and supporting a data intensive supercomputer by Shawn Strande, Pietro Cicotti, et al. (and it's out of SDSC - it's a small world after all ;) ) but, I also think I came across the reason why I'm not seeing HPC papers that use FIO: I'm not so sure that FIO can do single-shared-file workloads, https://github.com/axboe/fio/issues/631. So it might be really easy to set up job script and get baseline readings for your filesystems, but not when there are multiple nodes involved.
"FIO works on NAS" | HEAD -> main | 2021-04-09
To view some documentation PDFs and to prepare for viewing plots generated by Darshan, I went ahead and went through the (really easy!) process of setting up a VNC server/connection to a graphical interface. Following https://www.nas.nasa.gov/hecc/support/kb/vnc-a-faster-alternative-to-x11_257.html was straightforward, and boiled down to:
# On pfe
vncserver -localhost
# > "New desktop is at pfe:XX"
~C
-L 5900:localhost:59XX
# Connect to localhost:5900 with local VNC client
vncserver -kill :XX
I should be using mpi-sgi/mpt
(or mpi-hpe
?) rather than mpi-hpcx. This includes mpicc
.
Trying to set up Darshan has proven a challenge! But, here's what I've come up with so far, trying to follow https://www.mcs.anl.gov/research/projects/darshan/docs/darshan-runtime.html:
- Download and untar
wget -O- ftp://ftp.mcs.anl.gov/pub/darshan/releases/darshan-3.2.1.tar.gz | tar zxf -
- Load in mpi-hpe for mpicc
module load mpi-hpe comp-intel
- Configure and make the darshan-runtime
cd darshan-runtime && ./configure --with-log-path=~/profiler/darshan-logs --with-jobid-env=PBS_JOBID CC=mpicc && make
Now is where I get stuck. I can't make install
since I don't have write permissions to /usr/local/lib
, but I can do something like make install DESTDIR=~/
to install it to my home directory... I can even add ~/usr/local/bin
to my path. But what about the lib
and share
directories? How do I make sure those are accessible?
The reason I ask, is because when I try to run an mpiexec that is monitored by Darshan, I face an error
paddison@pfe20:~> LD_PRELOAD=~/usr/local/lib/libdarshan.so mpiexec -n 2 ~/benchmarks/ior/ior-3.3.0/src/ior
mpiexec: symbol lookup error: /home6/paddison/usr/local/lib/libdarshan.so: undefined symbol: darshan_variance_reduce
I just tried export LD_LIBRARY_PATH=~/usr/local/lib:$LD_LIBRARY_PATH
as well to no avail.
To be honest, I've spent some time reading about libraries and linking, but I don't truly understand how it all works and what specifically is breaking here. Perhaps I need to set some paths in ./configure
. For instance, --prefix
.
Using --prefix ~/usr/local
lets me run make install
without messing with Makefile variables (whoops, shoulda just looked at the ./configure --help
to begin with!). And my hope is that it'll also let me actually run the thing!
paddison@pfe20:~/profiler/darshan-3.2.1/darshan-runtime> LD_PRELOAD=~/usr/local/lib/libdarshan.so mpiexec -n 2 ~/benchmarks/ior/ior-3.3.0/src/ior
Can't open proc file /proc/arsess
: function completed normally
Can't open proc file /proc/arsess
: function completed normally
asallocash failed: array services not available
Can't open proc file /proc/arsess
: array services not available
mpiexec: all_launch.c:737: newash: Assertion `new_ash != old_ash' failed.
Aborted (core dumped)
Hey! At least it's different than before :') Oh whoops, that might be because I tried running an mpiexec command on a front-end node rather than a compute node. Let's try it again in an interactive qsub.
Hmmmm... it hung on me. Gotta figure out how to terminate a PBS job. I tried qsig -s INT jobid
and -s
(which should be SIGTERM), then I tried qdel jobid
but it hasn't worked yet :o After a while (~10 minutes or so), my qdel timed out, then trying it again said "qdel: Server could not connect to MOM ...", then after a bit more time I tried it again and it worked. Maybe some backend server was down temporarily or something...
As if I didn't learn my lesson, I'm going to try again.
Aw shucks, here we go again. Something about LD_PRELOAD=~/usr/local/lib/libdarshan.so mpiexec -n 2 ~/benchmarks/ior/ior-3.3.0/src/ior
is hanging. Same exact thing happened -- qdel
timed out after 12 minutes, then a subsequent call returned no connection to MoM, then a third call a few seconds later succeeded. Not sure what's going on.
I'll re-examine Darshan in the future, or perhaps while waiting for some parameter sweeps to conclude. For now, it's time to use the parameter values from the paper and start running some initial tests!
Huh, so actually when I was getting things set up to run the parameter sweeps, I realized that I can't run IOR using mpi-hpe/mpt nor mpi-sgi/mpt... only mpi-hpcx + comp-intel it seems... With otherwise I'm met with error while loading shared libraries: libopen-rte.so.40: cannot open shared object file: No such file or directory
...
Maybe it's because I ran make
with hpcx loaded? That would make sense. I've gone ahead and re-downloaded, re-configured, and re-made IOR with mpi-hpe loaded -- it works this time with mpi-hpe as the only required module :)
Let's try Darshan super quick? Damn. It hung again. Alright, I'll give up on Darshan for now and just move on with the parameter sweep finally.
Additional IOR documentation can be found here https://github.com/hpc/ior/blob/main/doc/USER_GUIDE. It includes some things that aren't on the website. Based on this, I could have written Python code to generate IOR scripts then have the PBS job script run that, rather than execute commands within Python. Oh well, maybe I will change to that in the future.
I've gone ahead and done the parameter sweeps. The results are plotted and commented on in the 1_Parameter_sweeps.ipynb
notebook (on NAS pfe). Most notably, there was an interesting dip in performance at a transferSize of 4MiB and performance decreased with more nodes.
It's important to figure out if that behavior is consistent, then if so figure out what is causing it. The hardware? The network topology? The software, like Lustre stripe sizes?
I ran all of the benchmarks on /nobackupp18 but supposedly that filesystem is not fully set up yet. It also has different hardware (SSDs) than /nobackupp12. I will attempt to run the same set of tests on /nobackupp12 and compare the results.
"Initial parameter sweeps; Configurable sweeps; Parsing/plotting" | HEAD -> main | 2021-04-14
Henry warned me that the progressive Lustre striping on /nobackupp12 is broken, and I should make sure that a fixed stripe count is being used instead. To see what stripe count is currently being used, I can run lfs getstripe [options] path
. So for instance, I ran a small test with the keepFile
directive enabled so I could see what striping is being done on the written testFile
.
lfs getstripe testFile
confirms that progressive striping is taking place. Whereas if I specify a new file with a fixed stripe count (or size)
lfs setstripe -c 2 testFile2
cp testFile testFile2
lfs getstripe testFile2
I see that fixed number! Fortunately, I can specify stripe size by using IOR directive options!
Huh... when I tried to run IOR with a Lustre-specific directive it complained
ior ERROR: ior was not compiled with Lustre support, errno 34, Numerical result out of range (parse_options.c:248)
I compiled this version of IOR with the mpi-hpe module... I'll try ./configure again to see if Lustre is shown as supported. This time around I ran ./configure --with-lustre
, then make
. Let's see if it works. I suppose if it doesn't I can always just add an explicit lfs setstripe
command before each test.
Didn't work. Maybe I need to compile it on a Lustre filesystem? Like, move it to a /nobackup and then re-configure/compile?
Maybe it's related: https://github.com/hpc/ior/issues/189
Shucks, as a workaround I tried an explicit lfs setstripe
on testFile
before running IOR, but the getstripe
afterwards showed that it didn't work. I think this is because IOR deletes the file before writing it.
Here are some great resources about Lustre striping, IO benchmarks, etc:
- https://www.nics.tennessee.edu/computing-resources/file-systems/lustre-striping-guide
- https://www.nics.tennessee.edu/computing-resources/file-systems/io-lustre-tips
These explain that performance greatly benefits from stripe alignment, in which OST contention is minimized by ensuring each processor is requesting parts of a file from different OSTs -- this can be done by setting the number of stripes to the number of processes, for instance. Performance is also optimized by a stripe size similar to the transfer size.
Honestly, this document has some incredible tips and insight. NICS is co-located on the ORNL campus so has ties to the DoE.
Ah! Looks like IOR Lustre options not working is potentially a known issue: https://github.com/hpc/ior/issues/353
Perhaps this is a workaround to pre-stripe and keep the file: https://github.com/hpc/ior/issues/273 Basically, use the -E
(existing file) option :) And that works!
lfs setstripe -c 2 testFile
mpiexec -np 2 ~/benchmarks/ior/ior-3.3.0/src/ior -a MPIIO -E -k
lfs getstripe testFile
So we can explicitly run lfs setstripe
and create the testFile before hand as long as we also make sure to use the existing file flag!
Woohoo, let's run those tests again!
I'm taking a closer look at some more test runs... I think perhaps part of the reason behind the high variance is due to the data sizes being relatively small? I'm not sure... but the variation between two consecutive repetitions can be huge. For instance, I ran another transfer size test and the first repetition read time was 1.3 seconds -- then the next was 0.3 seconds.
Actually, I've looked at all of the individual tests now (not just the summary) and it looks like the first repetition always takes considerably (~3x) longer than the rest of the repetitions. The next repetitions or two are usually the best, then read time starts to climb again.
This is not true for writes -- though there is a lot of variation in write time... I'm not sure why there would be.
Perhaps there is truly some caching effect happening when I run the repetitions?
Looking into running multiple an mpi job (IOR) across multiple nodes on the PRP.
I think having some tests from the PRP to compare to will be nice. I'm puzzled by a bit of the NAS results... trying to formalize some visualizations and run some more tests to get a better grasp of the I/O performance behavior that's going on...
"sync changes" | HEAD -> main | 2021-04-24
After installing Helm (a software manager for Kubernetes) it is time to start following https://github.com/everpeace/kube-openmpi#quick-start.
Note that the Helm version has changed and the --name
option is gone, so the deploy command should now be:
helm template $MPI_CLUSTER_NAME chart --namespace $KUBE_NAMESPACE ...
I took a peek at what this command outputs by redirecting to a file > OUT
-- it produces a nice kubernetes yaml which defines:
- A Secret containing the generated ssh key and authorized keys variable
- A ConfigMap containing a script to generate the hostfile
- A Service -- this is "an abstraction which defines a logical set of Pods and a policy by which to access them (sometimes this pattern is called a micro-service)". Basically a way to group up Pods into an application with frontend and backend pods, and a way to network between them.
- A Pod containing the openmpi container with our desired image and a hostfile init container
- A StatefulSet which manages the pods -- this is like a Deployment in which all pods (including replicas) are uniquely identified and supports persistent storage
Aw, attempting to create that resource led to:
Error from server (Forbidden): statefulsets.apps "nautilus-worker" is forbidden: User "system:serviceaccount:usra-hpc:default" cannot get resource "statefulsets" in API group "apps" in the namespace "usra-hpc"
+ cluster_size=
+ rm -f /kube-openmpi/generated/hostfile_new
stream closed
I'll need to ask Dima about permission for that resource. Perhaps the API group has just shifted... Or, perhaps it is because I haven't added the rolebindings yet. The rolebinding command is using the GitLab /blob/ instead of the /raw/, but after fixing that I did not face the 'cannot get resource' issue! I still did face an issue though:
Error from server (NotFound): statefulsets.apps "nautilus-worker" not found
Ah, I think that was just due to me not fully tearing down my previous attempt. After deleting all the pods and re-running the resource creation -- it's working!
I should now be able to run mpiexec via a kubectl exec
to the master pod.
Sweet! The example command works!
kubectl exec -it $MPI_CLUSTER_NAME-master -- mpiexec --allow-run-as-root --hostfile /kube-openmpi/generated/hostfile --display-map -n 4 -npernode 1 sh -c 'echo $(hostname):hello'
Worth noting for the future in case I need to do some care node selection or mess with some mpiexec options: Some of my nodes (currently master and worker-0) are on nysernet and some aren't. In the JOB MAP section of the output the ones not on nysernet show
Data for node: nautilus-worker-1.nautilus Num slots: 8 Max slots: 0 Num procs: 1
Process OMPI jobid: [35664,1] App: 0 Process rank: 2 Bound: UNBOUND
whereas the ones on nysernet show Bound to a bunch of sockets
Data for node: nautilus-worker-0.nautilus Num slots: 96 Max slots: 0 Num procs: 1
Process OMPI jobid: [35664,1] App: 0 Process rank: 1 Bound: socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 0-1]], socket 0[core 10[hwt 0-1]], socket 0[core 11[hwt 0-1]], socket 0[core 12[hwt 0-1]], socket 0[core 13[hwt 0-1]], socket 0[core 14[hwt 0-1]], socket 0[core 15[hwt 0-1]], socket 0[core 16[hwt 0-1]], socket 0[core 17[hwt 0-1]], socket 0[core 18[hwt 0-1]], socket 0[core 19[hwt 0-1]], socket 0[core 20[hwt 0-1]], socket 0[core 21[hwt 0-1]], socket 0[core 22[hwt 0-1]], socket 0[core 23[hwt 0-1]], socket 0[core 24[hwt 0-1]], socket 0[core 25[hwt 0-1]], socket 0[core 26[hwt 0-1]], socket 0[core 27[hwt 0-1]], socket 0[core 28[hwt 0-1]], socket 0[core 29[hwt 0-1]], socket 0[core 30[hwt 0-1]], socket 0[core 31[hwt 0-1]], socket 0[core 32[hwt 0-1]], socket 0[core 33[hwt 0-1]], socket 0[core 34[hwt 0-1]], socket 0[core 35[hwt 0-1]], socket 0[core 36[hwt 0-1]][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..]
Regardless, I should be setting up node affinities so that I get nodes with 16 cores for the closest comparison to Sandy Bridge.
Before we do that, though, let's get a custom image with IOR on it to do a minimal test run for that. Locally, this was as easy as just downloading IOR, running ./configure
and make
. It worked fine without needing to mess with any additional dependencies :) Let's try it on the cluster.
Alright, IOR runs, but not without some issues. When running a POSIX API test, the following warnings showed up in the results section of both write and read:
ior WARNING: inconsistent file size by different tasks.
WARNING: Expected aggregate file size = 4194304.
WARNING: Stat() of aggregate file size = 1048576.
WARNING: Using actual aggregate bytes moved = 4194304.
Then, when using MPIIO as the API, IOR will not run fully, as we're met with:
[nautilus-worker-2:00058] [3]mca_sharedfp_lockedfile_file_open: Error during file open
[nautilus-worker-0:00057] [1]mca_sharedfp_lockedfile_file_open: Error during file open
[nautilus-worker-1:00057] [2]mca_sharedfp_lockedfile_file_open: Error during file open
Oh. Probably because I'm not working on a shared volume, duh. So each node can only see its own file. Well, anyway, IOR is technically working!
"kube-openmpi running with IOR on PRP" | HEAD -> main | 2021-04-24
I can use the rook-cephfs
storage class -- it uses CephFS and supports ReadWriteMany -- once Dima gives me the okay. See: https://pacificresearchplatform.org/userdocs/storage/ceph-posix/
Basically all I need to do is change my volume yaml to specify:
spec:
storageClassName: rook-cephfs
accessModes:
- ReadWriteMany
resources:
requests:
storage: 100Gi
Then, I believe I can adjust values.yaml
to:
volumes:
- name: shared-cephfs
persistentVolumeClaim:
claimName: shared-cephfs
volumeMounts:
- mountPath: /shared
name: shared-cephfs
for both mpiMaster
and mpiWorkers
... we'll see!
Wonderful! The shared storage was successfully mounted to all nodes. I tried running IOR with just a single process on the master node in the shared directory and it worked -- now let's go ahead and try a multi-node job.
Woohoo! It worked!
Performance was really bad (~0.5 MiB/s wr), probably due to huge separation between nodes -> high latency (0.25s for write performance :o ). To be honest, I didn't even check what region the storage is assigned to. But nevertheless -- it worked :)
I want to make sure I'm requesting nodes that have 16 cores -- just like Sandy Bridge.
To do so, I can do a couple things in values.yaml
:
- Specify
resources.requests/limits
-
SpecifyNevermind. I just checkednodeSelector
withnautilus.io/sockets: 2
as the required label. This will prevent being assigned to nodes with more cpus.k get nodes -l nautilus.io/sockets=2 -o custom-columns=NAME:.metadata.name,CPU:.status.capacity.cpu
and the vast majority of nodes labeled has having 2 sockets have tons of cpus. In general, it looks like most nodes on the cluster have more than 16 cpus. It doesn't make sense to try to get dedicated 16 cpu nodes.
In light of the high latency, I think I'll go ahead and request specific nodes at first. My storage is US West (probably at UCSD), so I'll want to request some other nodes also at UCSD to limit communication latency between pods and between storage.
I'll need to talk to Dima about which nodes to use, but I should be able to ask for:
- Pods of type
general
(avoidtesting
,system
,osg
, etc) -
calit2.optiputer.net
nodes (looks like these should be at UCSD, whereas calit2.uci.edu are at Irvine)) -
sdsc.optiputer.net
nodes -
ucsd.edu
nodes -
suncave
nodes
Ah, I can use these nodes:
k get nodes -l topology.kubernetes.io/zone=ucsd
(with the possible exception of a .ucsb.edu node which might have been labeled by mistake)
This means I can use nodeSelector in values.yaml
. Couple it with my resource requests:
resources:
limits:
cpu: 8
memory: 8Gi
requests:
cpu: 8
memory: 8Gi
nodeSelector:
topology.kubernetes.io/zone: ucsd
Uhhh ohhhh.
Error from server: error when creating "STDIN": admission webhook "pod.nautilus.optiputer.net" denied the request: PODs without controllers are limited to 2 cores and 12 GB of RAM
Gotta figure that one out.
"IOR working on shared cephfs filesystem with node selection" | HEAD -> main | 2021-04-26
Dima mentioned there are some issues with CephFS at the moment and heavy usage is causing the OSDs to run out of memory and crash. In the meantime, he mentioned I can check out SeaweedFS.
https://pacificresearchplatform.org/userdocs/storage/seaweedfs/
"SeaweedFS volume" | HEAD -> main | 2021-04-27
Running into some issues with SeaweedFS. I created the pvc, but when I created a deployment the pod failed to mount to the pvc.
Later in the day I tried it again and the pvc itself failed to be provisioned
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ExternalProvisioning 2m46s (x26 over 8m46s) persistentvolume-controller waiting for a volume to be created, either by external provisioner "seaweedfs-csi-driver" or manually created by system administrator
Normal Provisioning 15s (x10 over 8m46s) seaweedfs-csi-driver_csi-seaweedfs-controller-0_7da00a20-3339-4cce-a620-44a28c9b6d7d External provisioner is provisioning volume for claim "usra-hpc/shared-seaweedfs"
Warning ProvisioningFailed 15s (x10 over 8m46s) seaweedfs-csi-driver_csi-seaweedfs-controller-0_7da00a20-3339-4cce-a620-44a28c9b6d7d failed to provision volume with StorageClass "seaweedfs-storage": rpc error: code = Unknown desc = Error setting bucket metadata: mkdir /buckets/pvc-439d990a-501a-4801-99b8-d5163aedbdf8: CreateEntry: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 10.98.219.214:18888: connect: connection refused"
Then after a while of waiting it magically worked. Then the deployment failed to mount again, then after a while that too managed to work...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled <unknown> Successfully assigned usra-hpc/filebench-seaweedfs-85cf7589b5-6f4pd to suncave-11
Normal SuccessfulAttachVolume 7m39s attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-439d990a-501a-4801-99b8-d5163aedbdf8"
Warning FailedMount 3m45s (x9 over 7m13s) kubelet, suncave-11 MountVolume.SetUp failed for volume "pvc-439d990a-501a-4801-99b8-d5163aedbdf8" : rpc error: code = Internal desc = Timeout waiting for mount
Warning FailedMount 3m19s (x2 over 5m36s) kubelet, suncave-11 Unable to attach or mount volumes: unmounted volumes=[shared-seaweedfs, unattached volumes=[shared-seaweedfs default-token-nqkfj: timed out waiting for the condition
Normal Pulling 101s kubelet, suncave-11 Pulling image "localhost:30081/parkeraddison/kube-openmpi-ior"
Normal Pulled 100s kubelet, suncave-11 Successfully pulled image "localhost:30081/parkeraddison/kube-openmpi-ior" in 1.209529308s
Normal Created 100s kubelet, suncave-11 Created container filebench-seaweedfs
Normal Started 100s kubelet, suncave-11 Started container filebench-seaweedfs
Both times I eventually ran into an issue where performing any filesystem operations (e.g. ls
) would hang. It seemed that sometimes these operations would complete after a while... sometimes I got impacient and killed the process and that seemed to un-hang things. Really not sure what's going on there.
When doing my single node parameter sweep, I came across a bunch of helpful things to keep in mind for the future:
This is very useful for editing files inside a pod without needing to install vim on that pod. With the VS Code Kubernetes extension installed, we can:
- Command pallet:
View: Show Kubernetes
- Kubernetes cluster panel:
nautilus > Workloads > Pods
- Right click pod name:
Attach Visual Studio Code
This'll open up a new window, take care of all the port forwarding, and allow you to open remote folders and files just as you would any other remote ssh host!
Only caution I've noticed so far: the integrated terminal doesn't handle text wrapping well. As always, I recommend using a separate terminal window in general.
Rather than using the Python parameter_sweep script, I just whipped up a very tiny amount of code to populate an IOR script than ran that via ior -f path/to/script
. This is similar to how it's done in Glenn' Lockwood's TOKIO-ABC. Using IOR scripts is the way to go. Multiple tests of different parameters can be defined at once in a portable file then shared between systems without a Python dependency.
On NAS at the moment I still need to use the Python orchestration in order to set the Lustre stripe sizes/counts before each run... at least until the IOR Lustre options are fixed.
-
I'm curious... does my previous Lustre striping workaround still work when there are multiple repetitions? I don't recall checking...Yes, it works :) And it works for multiple tests too, so I actually don't need to use the Python script at all as long as I remember to manually set the stripe count before running the test.
Also latency output (json) is measured in seconds.
"Run using IOR scripts; PRP ceph and seaweed" | HEAD -> main | 2021-04-28
I am going to be showing the NAS and Ceph findings so far to the NAS APP group in an attempt to figure out what's going on with the drop in performance at 4mb transfer size on NAS, and to ask about the hardware/software stack at NAS, Lustre monitoring, etc.
So, I'm re-running a bunch of the parameter sweeps on NAS (and PRP) to make sure my results are consistent. At the same time, I'd like to experiment with I/O hints. This should be useful: http://www.idris.fr/media/docs/docu/idris/idris_patc_hints_proj.pdf, and https://github.com/hpc/ior/blob/main/doc/USER_GUIDE#L649. I was able to use a hints file that looks like this:
# File: hints.ior
IOR_HINT__MPI__romio_cb_write=enable
IOR_HINT__MPI__romio_cb_read=enable
Coupled with hintsFileName
(-U) set to the path to that file, and showHints
(-H) it worked! Now let's do some parameter sweeps and see if it actually makes a difference.
==NOTE== Collective option in IOR causes massive drop in performance -- bandwidth on the order of single mebibytes.
Hogging memory on the node (-M
) seems to affect the blockSize performance as Mahmoud suggested. Trying to run on /nobackupp2 with 85% memory hogging leads to out of memory error at some point when testing the 1.5Gi block size... not sure why this didn't happen when testing on /nobackupp12... the requested compute nodes were the same.
Transfer size exhibited no drop off when memory hogging was used, and read performance was pretty level at around 200MiB/s, write performance was consistently greater than read.
I'd like to run a read test on an existing file that is for sure not in the Lustre OSS cache.
These links are useful.
- https://www.nas.nasa.gov/hecc/support/kb/pleiades-lustre-filesystems_225.html
- https://www.nas.nasa.gov/hecc/support/kb/pleiades-configuration-details_77.html
- https://www.nas.nasa.gov/hecc/support/kb/sandy-bridge-processors_301.html
However, the hardware of the OSTs is not discussed -- and that's where we'd find out the theoretical performance of our filesystems by looking at the OST drive performance and networking overhead.
I created some files the other day, and I'm now trying to do read-only IOR tests on these existing files. On my small-scale tests it seems to have worked -- I get much lower read bandwidth.
Here are the useful things to know about such a test:
-
Absolutely keepFile (
-k
) must be enabled (otherwise the data will be deleted after the test, meaning you'll need to create a new file and wait a while again -- whoops) - It is still important to use memory hogging (
-M %
) for multiple tests, otherwise the read file will be in the local cache.- 85% seemed to work well. I wouldn't be surprised if too high and you risk crashing the node due to OOM, however (just like what happened on a nbp2 test earlier)
- We can read just a portion of the file just fine, just a warning will show
WARNING: Expected aggregate file size = 1073741824. WARNING: Stat() of aggregate file size = 17179869184. WARNING: Using actual aggregate bytes moved = 1073741824.
Probably a good idea to just create a handful of very large files (to support out multi-node largest tests) and keep those laying around.
We're interested in exploring our hypothesis that performance drops at certain transfer sizes are related to the Lustre stripe sizes.
- We can pass a script into IOR via stdin like so:
cat | ior -f /dev/stdin << EOF
IOR START
# rest of script
RUN
IOR STOP
EOF
- Warnings show up when they occur, so attempting to use summaryFormat=json without a corresponding summaryFile will cause invalid jsons in stdout if anything is logged to stdout
Snippet for plotting the write test until I can figure out a better way
for rn in df12.ReferenceNumber.unique(): quick_plot(df12[df12.ReferenceNumber==rn],'transferSize','bandwidth')
I'm suspecting that the real reason for transfer size performance drops has something to do with memory. I believe this because we're seeing the effect while reading from a file that has been cached in local memory. Observe that the read speeds are astronomical -- but only after the very first iteration for a file. Somehow I need to avoid local caching -- and I was using memory hogging at 85% but that wasn't enough.
I'm thinking of the following possible workarounds:
- Avoid repetitions in IOR itself -- too likely to re-read from cache
- Make the sweep round-robin style doing each parameter value for all files before moving on to the next, coupled with memory hogging to ensure only one file fits in memory
- Try to manually drop the file(s) from the memory cache
I tried the manual cache dropping. First I ran free -h
to see my memory and cache usage, then read a 4Gi file with IOR and saw memory usage jump up. Sure enough the next IOR read test had OOM better read bandwidth.
Some testing with memory hogging shows that it definitely lowers performance, but by no means does it prevent the caching effects wholly.
Then I tried running
dd of=FILE_name oflag=nocache conv=notrunc,fdatasync count=0
Looking at free
confirmed that memory usage went down, and the next IOR run had similar performance to the very first run!
This is a helpful read to understand caches: http://arighi.blogspot.com/2007/04/how-to-bypass-buffer-cache-in-linux.html
Some more links related to avoiding/dropping the file cache:
- https://man7.org/linux/man-pages/man2/posix_fadvise.2.html -- the functionality in the Linux kernel
- https://github.com/lamby/python-fadvise -- a Python interface to posix_fadvise
- https://unix.stackexchange.com/questions/17936/setting-proc-sys-vm-drop-caches-to-clear-cache -- if you want to clear the entire cache
I'm editing the values of the values.yaml used by kube-openmpi to utilize Node Anchors and Aliases which let me re-use keys so I can write my desired resources and volumes once and have them shared across all workers.
Now that's done, I've lowered the resources to within the allowed PRP settings for interactive pods -- 2cpus and 8gb ram -- and I'll run a multi-node test.
Yes! I created a nice alias omexec
which takes care of the run-as-root and hostfile considerations, and now I can run it just fine.
I requested a limit of 2cpus but that is a quote -- it does not mean that the container cannot access the rest of the cores. So, I can still execute mpiexec with more than 2 processes (assuming we want only 1 proc per core). Setting -npernode
adjusts how many processes per node we want.
Useful resource:
- https://www.golinuxcloud.com/kubernetes-resources/#Understanding_resource_units
- https://www.golinuxcloud.com/kubernetes-resources/#How_pods_with_resource_limits_are_managed
I don't think I've run into the quota throttling yet... and monitoring the Grafana dashboard shows I'm well within limits overall. I think Dima was explaining to someone in the support room that bursty behavior is fine, it just can't consistently exceed the limits.
There are a handful of ways to look at the cpu information, lscpu
is nice. Notice that this claims we have access to 16 CPUs, but it's actually 8 cores with 2 threads per core. On a single node, if I run IOR with 8 procs, I get really fast performance (and I notice the same caching effect as seen before in the read speeds). When I try to run it with 16 procs the performance is significantly worse all around -- that's because threads can't actually run in parallel, rather they timeshare.
So... the performance is pretty bad.
After downloading and untarring Darshan, I'm trying to ./configure the darshan-runtime. But I got the error that no zlib headers could be found. This can be fixed by specifically installed zlib1g-dev
-- the non-dev version will not do.
Then we can configure it. We'll need to pass --with-log-path
and --with-jobid-env
. The first is easy because I can set it to wherever I want to store logs. The latter I don't know. On NAS I knew that PBS was used so I new the environment. Here, I'm trying to figure it out by running mpiexec env
and seeing what variables are populated. I'll pass NONE
for now, but it might be PMIX_ID
or something like that... we'll see later when I do multi-node Darshan.
./configure --with-log-path=/shared/darshan-logs --with-jobid-env=NONE
Finally make
and make install
both did the trick! Then follow it up with mkdir /shared/darshan-logs
and darshan-mk-log-dirs.pl
as noted in the documentation.
Now let's actually try to use it.
mpiexec -np 2 -x LD_PRELOAD=/usr/local/lib/libdarshan.so ior
since openmpi uses -x instead of -env.
Welp, it did't crash like it was on NAS. However, it was unable to create the darshan log
darshan_library_warning: unable to create log file /shared/darshan-logs/2021/5/18/root_ior_id5293_5-18-81444-18062306667005854292.darshan_partial.
my guess is permissions? Oh... it was pointing to the wrong path. For some reason I changed the path and re-configured but that error still came up even though darshan-config --log-path
showed the right path. I simply created a soft link between the actual and expected path and re-ran -- it worked! Let's peek at these logs, shall we?
I needed to install Python (odd, that wasn't listed in the requirements), and I'll need to install some other things to get graphical outputs, but for now the ./configure, make, and make install went find and I can get a textual description of the log by running
darshan-parser <path/to/file>
Sweet! The file is well-documented and understandable.
Trying to get a test run of Darshan observing some ML application like image analysis.
Turns out, Darshan was working but there are a few things to consider:
- The environment variable
DARSHAN_ENABLE_NONMPI
needs to be set (it can be empty) - I think UTC is used so sometimes you need to look at the next day of log data
env DARSHAN_ENABLE_NONMPI= LD_PRELOAD=/usr/local/lib/libdarshan.so python script.py
"A whole bunch of work on PRP; Want image registry" | HEAD -> main | 2021-05-24
I don't have Docker on the nasmac, so I'm trying to get the Nautilus GitLab container registry working.
"Update Darshan images" | HEAD -> main | 2021-05-24
"Fix file treated as command" | HEAD -> main | 2021-05-24
"Install python3" | HEAD -> main | 2021-05-24
"-y" | HEAD -> main | 2021-05-24
"Use multiple stages for multiple images" | HEAD -> main | 2021-05-24
"Prompt image build" | HEAD -> main | 2021-05-24
Alright, I've figured out the images, I have a deployment with PyTorch and Darshan running and I've copied over the flight anomaly code and data. Let's run it once to make sure it does indeed run.
python main_CCLP.py -e 1 -v 1
Ha! It does!
Okie dokes, now time to try monitoring it with Darshan.
env DARSHAN_ENABLE_NONMPI= LD_PRELOAD=/usr/local/lib/libdarshan.so python main_CCLP.py -e 1
Sweet, now to examine the Darshan logs.
I can create a human readable text dump of the log with darshan-parser
, but I should also have PyDarshan installed in this image, so let's try to use it! Hmm, trying to import darshan complained. When I install darshan-util I should ./configure it with --enable-pydarshan --enable-shared
.
Then I can read in a report as the following in a Python shell, and tell it to read in all records (POSIX, MPI, STDIO, etc):
import darshan
report = darshan.DarshanReport('filename',read_all=False)
report.read_all_generic_records()
Within the report, there are multiple records. We can see what records we have with report.info()
, then access them through report.records['API']
and then run things like record.info(plot=True)
.
However, this relies on an implied IPython environment, since it uses display
. I'll try installing jupyter into this image. Seems to be working like a charm!
pip install jupyter # Pod
jupyter notebook
k port-forward podname 8888:8888 # Local
Oh my I'm always reminded just how much I absolutely love working in Jupyter notebooks :')
Great! This is awesome. I have the data and can play around with it.
Okay, now to do a run on multiple epochs. Also it's worth noting this is just the training process we're monitoring -- the preprocessing stage is entirely separate.
I'd be really interested in seeing how much of the total runtime was waiting for I/O?
Looks like I'll want to use their experimental aggregators: https://www.mcs.anl.gov/research/projects/darshan/docs/pydarshan/api/pydarshan/darshan.experimental.aggregators.html. They don't return plots (so actually I guess we don't need jupyter, but I'm still going to use it), so we'll want to write some plotting code to visualize.
darshan.enable_experimental(True)
# IO Size Histogram, given the API ('module')
report.mod_agg_iohist('POSIX')
# Cumulative number of operations
report.agg_ioops()
It seems like I can basically call them all using .summarize()
then access with .summary
report.summarize()
report.summary
Plotting a hist/bar of access sizes is easy enough. How about the timeline? Here are the plots I want to replicate: https://www.mcs.anl.gov/research/projects/darshan/docs/ssnyder_ior-hdf5_id3655016_9-23-29011-12333993518351519212_1.darshan.pdf
"Fix PyDarshan installation" | HEAD -> main | 2021-05-25
Here's the current WIP Python plotting implementation: https://github.com/darshan-hpc/darshan/blob/1ade6cc05c86b2bcab887bf8db96a24f920f6954/darshan-util/pydarshan/darshan/cli/summary.py
I know it feels a bit late in the process to do this -- but it's really about time that I actually do real research and consult more papers.
In an effort to follow https://dl.acm.org/doi/pdf/10.1145/3337821.3337902 (IO analysis of BeeGFS for deep learning) I am trying to set up similar conditions on the PRP and run the same experiments.
"Fix bash alias" | HEAD -> main | 2021-06-01
Again. Let's figure this out.
wget -O- ...3.3.0 | tar zxf -
cd darshan-3.3.0/darshan-runtime
module load mpi-hpe
./configure --with-log-path=$HOME/darshan-logs --with-jobid-env=PBS_JOBID --prefix=$HOME/usr/local
make
make install
mkdir -p ~/darshan-logs
chmod +x darshan-mk-log-dirs.pl
./darshan-mk-log-dirs.pl
So far that worked without any issue. We can install the util the same way.
cd ../darshan-util
./configure --prefix=$HOME/usr/local
make
make install
That worked too. Time to launch a compute node and test if I can get it to monitor without crashing.
I should be able to use
env DARSHAN_ENABLE_NONMPI= LD_PRELOAD=$HOME/usr/local/lib/libdarshan.so <my_command>
At first I got:
/bin/sh: error while loading shared libraries: libmpi.so: cannot open shared object file: No such file or directory
but that's just complaining I didn't do module load mpi-hpe
.
Then I ran it again, I was initially scared because the test script I wrote just reads and writes 4k random bytes a few times (very quick), but when I ran it with Darshan nothing showed up -- I assumed it had crashed again.
It hadn't. Rather, it just took ages to do each operation -- why? I thought Darshan had a low overhead?
Ah... It generated a separate summary for every individual operation -- so in this case I have 5 files in the directory now, one for the sh
invocation (the script itself), then three dd
and one ls
. Interesting.
Good news, darshan-parser
works on the files! So, ultimately, Darshan is working on NAS!
The run took 75 seconds total, when typically it takes only a fraction of a second. Good news, fortunately each operation is still logged as taking only a fraction of a second, so it didn't affect the operations it was merely overhead.
While we wait for real NASA data-intensive ML apps to become available, we can run Darshan on other ML models or artificial pipelines made to mimic the app (e.g. using IOR).
When using IOR, Darshan creates a separate log for every run -- one per operation per iteration. And mpiexec gets a log too. I think in order to see the whole thing I need to read all logs with the same ID.
There is a separate log for every mpi process invocation of IOR. There are also somehow three mpiexec logs.
The GOES data is on NAS. The MTBS data can be downloaded online.
A conda environment can be made with all requirements. Start by making an empty environment conda create -n geo
, then conda install -c conda-forge gdal
, then cartopy, xarray, rioxarray
.
- https://gdal.org/download.html#conda
- https://scitools.org.uk/cartopy/docs/latest/installing.html#conda-pre-built-binaries
The Jupyter notebook provided to me doesn't need to import gdal
(it's used as a command line utility), nor the from utils
(not used in the file).
All other imports in the notebook work except for geopandas
which has not been installed yet. There is a conflict when I try to install geopandas
... let's wait a long time and find out why. I need geopandas just to import the MTBS shapefiles. Maybe I should have installed geopandas before cartopy and the rest...
I made a new environment with conda create -n ... gdal geopandas cartopy
(did not specify the Python version, let conda figure that out) and it made one with Python 3.9.5. Then xarray
then -c conda-forge rioxarray
.
Thought that worked... but then importing anything which imports numpy results in
File "/nobackup/paddison/.conda/envs/geofire/lib/python3.9/site-packages/numpy/__init__.py", line 148, in <module>
from . import lib
File "/nobackup/paddison/.conda/envs/geofire/lib/python3.9/site-packages/numpy/lib/__init__.py", line 44, in <module>
__all__ += type_check.__all__
NameError: name 'type_check' is not defined
Fixed it by explicitly installing numpy
as well.
I added some multiprocessing to the code. Good.
When trying to profile it with Darshan though, I ran into a symbol error for llapi_layout_get_by_xattr
. Looks like https://github.com/darshan-hpc/darshan/issues/399. I never ran into this issue on NAS before though, oh well. Looks like a fix has been merged, so I'm installing the latest Darshan from the repo to try to get things working.
Seemed to work on a test Python invocation. Let's run the preprocessing now!
Damn. I interrupted the program, but I think in doing so I halted it. Let me run it again on just the first handful of samples :/
Sheesh. It didn't work that time either. Darshan is giving me an 'objdump' log rather than a python log... probably because the program didn't fully terminate properly. I'm trying one more time just going through the data on a single specified date. I'm going to run into the walltime though before this run completes! Noooooooooooo, lol
Hmmmm... so this time the Python ran through all the data specified just fine. But Darshan produced both an objdump and python log, but the Python log is "incomplete" when viewing with darshan-parser. That's fine, just use the --show-incomplete
. It looks like it worked! Later today I'll load up the results in Python and plot them.
When I looked at the timings in the Darshan logs using pydarshan, it looked like I/O contributed less than a percent of the runtime. The multiprocessing wasn't actually working correctly, so the compute time should be substantially quicker, but still it seemed odd that I/O took so little time when the files are actually quite large and plentiful.
To ensure Darshan is showing me the right things, I wrote a series of Python scripts which performed different read and write access patterns, such as reading 1GB chunks of a large file and writing to a new file, or writing random bytes.
I used cProfile to compare timings seen by Python natively and those seen by Darshan
python -m cProfile -o out.profile script.py
import pstats
p = pstats.Stats('out.profile')
p.print_stats()
They meshed up just fine. In a first test I was skeptical when Darshan showed my write file test as being over 50% computation, but by using cProfile I figured out that my os.urandom
generation was indeed taking about half the time!
Over the last week I've been working with a DELTA training pipeline which will soon get some data and model specification to train a flood detection algorithm. I was able to make a geoflood
conda environment on top of geofire
.
(Side node, I realized that we have flight, fire and flood algorithms. That alliteration is kinda cool!)
In running the Landsat-8 example training and attempting to monitor it with Darshan, all seemed well. However, the logs showed up as incomplete, and attempting to analyze them with pydarshan resulted in a segfault and core dump...
# *ERROR*: The POSIX module contains incomplete data!
# This happens when a module runs out of
# memory to store new record data.
# To avoid this error, consult the darshan-runtime
# documentation and consider setting the
# DARSHAN_EXCLUDE_DIRS environment variable to prevent
# Darshan from instrumenting unecessary files.
# You can display the (incomplete) data that is
# present in this log using the --show-incomplete
# option to darshan-parser.
I've got to figure this out before moving forwards.
"Add profiling work" | HEAD -> main | 2021-07-01