Filesystem Benchmarks

Table of contents

IOR
- Setup
- Running
IO-500
- Setup
- Running
Custom image for IO500 dependencies
FIO benchmark
- References, graphs, and job files
Understanding IOR output
Pleiades PBS Hello World
- PBS Directives
- Hello World
Running IOR in a PBS job
- Download IOR
- Preparing IOR
- Dependency issues
Nautilus namespace and IOR units
Re-trying IOR on Pleiades
- Debugging in interactive mode
Automated parameter sweeps
Parsing and graphing IOR outputs
Values for initial parameter sweeps
- Side note: Darshan
FIO on NAS
Darshan on NAS
What MPI module to use
Conducting initial parameter sweeps
Initial sweeps on /nobackupp12
- Lustre stripe counts
- IOR Lustre directives
- IOR Lustre striping workaround
MPI on PRP
- Setup
- Running
  - IOR MPI job
- Shared volume
- Node selection
PRP SeaweedFS
Single node CephFS parameter sweep
- Attach VSCode to kubernetes pod
- IOR scripts and outputs
IO Hints
Trying out nbp2 and memory hogging
Better understanding of the filesystem hardware
No-cache read performance
Designing stripe tests
Some useful things to know about IOR
The real reason for transfer size performance drops
- Working around the local memory cache
A quick multi-node test on PRP
- Pay attention to -npernode
- Pay attention to cores vs threads
- Understanding performance
Darshan on PRP
Darshan to observe an ML application
- I/O Behavior of Flight Anomaly Detection
Replicating Chowdhury et al IO Evaluation of BeeGFS for Deep Learning
Darshan on NAS
Psuedo pipeline to observe with Darshan
Fire detection setup
Validating Darshan outputs
Flood detection profiling
- Darshan incomplete logs and segfault

03/09-03/13 2021

IOR

Setup

First deploy the volume and a pod.

k create -f volumes/block.yml
k create -f minimal-deploy.yml

These commands are run in the pod.

Dependencies. See: https://github.com/hpc/ior/blob/main/testing/docker/ubuntu16.04/Dockerfile

apt-get update
apt-get install -y libopenmpi-dev openmpi-bin mpich git pkg-config gcc git vim less curl wget
apt-get install -y sudo

Downloading

wget -qO- https://github.com/hpc/ior/releases/download/3.3.0/ior-3.3.0.tar.gz | tar -zxv

Configuration. See ./configure --help.

./configure

Installation

make

Running

See: https://ior.readthedocs.io/en/latest/userDoc/tutorial.html

cd src
./ior ...

mpirun ...

Not sure how to really use it yet.

When I run ior it does a test instantly it seems.

When I tried doing

mpirun -n 64 ./ior -t 1m -b 16m -s 16

I got a ton of:

ior ERROR: open64("testFile", 66, 0664) failed, errno 13, Permission denied (aiori-POSIX.c:412)
...
[filebench-574869c787-pdn62:07749] PMIX ERROR: UNREACHABLE in file ../../../src/server/pmix_server.c at line 2193
...

Also note that I ran useradd testu and su testu because MPIrun doesn't want to be run as a root user. But this user has no permissions! I think that's the issue.

Seems like a chmod -R 777 . as the root fixed this!

For example, run 10 tasks with a transfer size of 1m(egabyte?), a block size of 16m(egabyte?), and a segment count of 16:

mpirun -n 10 ./src/ior -t 1m -b 16m -s 16

Output:

IOR-3.3.0: MPI Coordinated Test of Parallel I/O
Began               : Mon Mar  8 23:08:17 2021
Command line        : ./src/ior -t 1m -b 16m -s 16
Machine             : Linux filebench-574869c787-pdn62
TestID              : 0
StartTime           : Mon Mar  8 23:08:17 2021
Path                : /storage/ior-3.3.0
FS                  : 8.0 GiB   Used FS: 0.5%   Inodes: 4.0 Mi   Used Inodes: 0.0%

Options: 
api                 : POSIX
apiVersion          : 
test filename       : testFile
access              : single-shared-file
type                : independent
segments            : 16
ordering in a file  : sequential
ordering inter file : no tasks offsets
nodes               : 1
tasks               : 10
clients per node    : 10
repetitions         : 1
xfersize            : 1 MiB
blocksize           : 16 MiB
aggregate filesize  : 2.50 GiB

Results: 

access    bw(MiB/s)  IOPS       Latency(s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
------    ---------  ----       ----------  ---------- ---------  --------   --------   --------   --------   ----
write     678.06     678.07     0.118931    16384      1024.00    0.770489   3.78       3.59       3.78       0   
read      4233       4234       0.019146    16384      1024.00    0.000035   0.604695   0.298354   0.604706   0   
remove    -          -          -           -          -          -          -          -          3.20       0   
Max Write: 678.06 MiB/sec (711.00 MB/sec)
Max Read:  4233.46 MiB/sec (4439.11 MB/sec)

Summary of all tests:
Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev   Max(OPs)   Min(OPs)  Mean(OPs)     StdDev    Mean(s) Stonewall(s) Stonewall(MiB) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt   blksiz    xsize aggs(MiB)   API RefNum
write         678.06     678.06     678.06       0.00     678.06     678.06     678.06       0.00    3.77548         NA            NA     0     10  10    1   0     0        1         0    0     16 16777216  1048576    2560.0 POSIX      0
read         4233.46    4233.46    4233.46       0.00    4233.46    4233.46    4233.46       0.00    0.60471         NA            NA     0     10  10    1   0     0        1         0    0     16 16777216  1048576    2560.0 POSIX      0
Finished            : Mon Mar  8 23:08:25 2021

IO-500

Setup

The hpc/ior:ubuntu16.04 image (built locally and pushed to Docker Hub parkeraddison/ior:ubuntu16.04) almost passes ./prepare.sh for the io500 repo -- it just needs to

apt-get install -y autoconf

Set up permissions

groupadd stor
chgrp -R stor /storage
chmod -R g+rwx /storage

useradd usr -G stor

Running

su usr
mpiexec -np 2 ./io500 config-minimal.ini

Output:

Unexpected end of /proc/mounts line `overlay / overlay rw,relatime,lowerdir=/var/lib/docker/overlay2/l/LJJ3QEF6WLMS4VWPVV2XKL6JYS:/var/lib/docker/overlay2/l/24YRTXTAGXIULRRXZY4JB5WTGG:/var/lib/docker/overlay2/l/42MQ3LQTYO2IUBUARVOT7IPDQP:/var/lib/docker/overlay2/l/YGOPACSWOGMEMKHTJYCM6UU3FH:/var/lib/docker/overlay2/l/KPZVRXHJW6K2V5FL24TRWFWO6B:/var/lib/docker/overlay2/l/HX22FHOBPYU4GIEFU6V5JWP2FJ:/var/lib/docker/overlay2/l/GJP2A7A4T3XQYPZHHNZR3LC76R:/var/lib/docker/overlay2/l/TZLAGOYFXJHZETSVMY4KIDZ543:/var/lib/docker/overlay2/l/4NTW7PG2N53XK'
IO500 version io500-sc20_v3-6-gd25ea80d54c7
ERROR: write(12, 0x225c000, 2097152) failed, (aiori-POSIX.c:563)

Oof.

Look like: https://stackoverflow.com/questions/46138549/docker-openmpi-and-unexpected-end-of-proc-mounts-line

I'm trying a flattened image now (and including autoconf). https://tuhrig.de/flatten-a-docker-container-or-image/

Still getting

./ior: error while loading shared libraries: libmpi.so.40: cannot open shared object file: No such file or directory

Seems like a dependency problem -- openmpi3 is needed. Tried the centos7 image and the same thing happened. Note that to use mpirun in centos you need to first run module load mpi. This image seems to work (locally). For some reason it caused an error when I tried to deploy on PRP. May try again.

Repo: https://github.com/joshuarobinson/docker_ior_mpi

03/14

Custom image for IO500 dependencies

I created a custom image to hold the ior/io500 dependencies so I'll have finer control over it. Then I went ahead and edited the ./ior --list > config-all.ini output to disable all but the two easy IOR tests. I've put this into config.ini.

I also changed the transfer and block size to very small values (proof of concept). I believe in the past when I was trying to run it I was using the defaults (very large values!).

bash-4.3$ mpiexec -np 2 ./io500 config.ini

ERROR INVALID (src/phase_dbg.c)stonewall-time != 300s
IO500 version io500-sc20_v3-6-gd25ea80d54c7
[RESULT-invalid]       ior-easy-write        0.650314 GiB/s : time 0.008 seconds
[RESULT-invalid]    mdtest-easy-write        0.000000 kIOPS : time 0.000 seconds
[RESULT-invalid]       ior-hard-write        0.000000 GiB/s : time 0.000 seconds
[RESULT-invalid]    mdtest-hard-write        0.000000 kIOPS : time 0.000 seconds
[RESULT-invalid]                 find        0.000000 kIOPS : time 0.000 seconds
[RESULT]        ior-easy-read        3.403407 GiB/s : time 0.003 seconds
[RESULT-invalid]     mdtest-easy-stat        0.000000 kIOPS : time 0.000 seconds
[RESULT-invalid]        ior-hard-read        0.000000 GiB/s : time 0.000 seconds
[RESULT-invalid]     mdtest-hard-stat        0.000000 kIOPS : time 0.000 seconds
[RESULT-invalid]   mdtest-easy-delete        0.000000 kIOPS : time 0.000 seconds
[RESULT-invalid]     mdtest-hard-read        0.000000 kIOPS : time 0.000 seconds
[RESULT-invalid]   mdtest-hard-delete        0.000000 kIOPS : time 0.000 seconds
[SCORE-invalid] Bandwidth 0.000000 GiB/s : IOPS 0.000000 kiops : TOTAL 0.000000

The result files are stored in the directory: ./results/2021.03.14-23.26.49

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 150 RUNNING AT filebench-78c6c98d98-nrdlr
=   EXIT CODE: 139
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

Despite the warnings, I'm pretty sure it actually worked.

FIO benchmark

Was able to get FIO running by downloading the tar.gz from https://github.com/axboe/fio, installing the dependencies here (alpine), then running make (ignoring a warning) and make install.

Finally, I created a simple job file as write.fio then ran fio write.fio.

Output:

bash-4.3$ fio write.fio 
job1: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
fio-3.26
Starting 1 process
job1: Laying out IO file (1 file / 128MiB)

job1: (groupid=0, jobs=1): err= 0: pid=4189: Sun Mar 14 23:51:53 2021
  write: IOPS=273k, BW=1067MiB/s (1118MB/s)(128MiB/120msec); 0 zone resets
    clat (nsec): min=1130, max=190684, avg=3212.52, stdev=8403.44
     lat (nsec): min=1200, max=190734, avg=3271.35, stdev=8403.92
    clat percentiles (nsec):
     |  1.00th=[ 1304],  5.00th=[ 1352], 10.00th=[ 1384], 20.00th=[ 1464],
     | 30.00th=[ 1544], 40.00th=[ 1624], 50.00th=[ 1688], 60.00th=[ 1768],
     | 70.00th=[ 1896], 80.00th=[ 2096], 90.00th=[ 2512], 95.00th=[ 3280],
     | 99.00th=[55552], 99.50th=[58624], 99.90th=[77312], 99.95th=[84480],
     | 99.99th=[91648]
  lat (usec)   : 2=76.55%, 4=19.21%, 10=0.49%, 20=1.31%, 50=0.86%
  lat (usec)   : 100=1.58%, 250=0.01%
  cpu          : usr=10.08%, sys=89.92%, ctx=57, majf=0, minf=11
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,32768,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=1067MiB/s (1118MB/s), 1067MiB/s-1067MiB/s (1118MB/s-1118MB/s), io=128MiB (134MB), run=120-120msec

Disk stats (read/write):
  rbd2: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%

References, graphs, and job files

It looks like FIO is pretty popular! There are lots of repositories that have tools to work with FIO (both inputs and outputs).

Here are some repos which chart fio outputs or do other helpful things with FIO. They may all help get a better idea of how to understand the outputs and how to set up useful jobs!

This is probably a good search: https://github.com/search?q=fio+benchmark&type=Repositories

957b70caaca3f180ca323dbf1045965823f349df

"Exploration of IOR and FIO benchmarks; Noteful wiki" | HEAD -> main | 2021-03-14

03/15

Time to figure out how to start making sense of and plotting the outputs. That way I can make sure that IO500 and/or FIO are good choices to pursue.

Once that's done, we can start to figure out how to run this on Pleiades. Henry mentioned that a Python virtualenv would be one way to get specific software (I think one of the repos above is a Python wrapper...). Some packages should already be available. Also, I'd expect that as an HPC environment lots of the software needed for these HPC filesystem benchmarks should already be present!

Understanding IOR output

Some description of IOR output: https://gitlab.msu.edu/reyno392/good-practices-in-IO/blob/dfcff70e9b9e39f1199f918d1a4000f44bc1b384/benchmark/IOR/USER_GUIDE#L686

Looks like the charts seen in some of the papers I came across earlier (e.g. this one) were made using an I/O profiler "Darshan". I'm sure there must be a profiler used at NAS.

Seems like a hopeful reference: https://cug.org/5-publications/proceedings_attendee_lists/2007CD/S07_Proceedings/pages/Authors/Shan/Shan_slides.pdf.

The useful outputs of IOR are simply read and write bandwidth in Me(bi/ga?)bytes per second and operations per second.

The charts seen in papers and presentations, such as here, are the result of multiple runs of IOR with different parameters.

For example, useful charts may demonstrate how bandwidth changes as transfer size, effective file size per processor, or number of processors increases.

This is something I could (hopefully easily) whip up and have it be useful -- run a bunch of IOR tests on a parameter grid. The Lustre docs do this exact thing in their example, going from 1,2,4,8 processors.

I might try this out right now on Nautilus... let me go ahead set up a slightly larger volume.

~~Talk with John/Dima about how large of a volume and how many pods I can set up for future benchmarking on Nautilus~~

03/16

Things are starting to make more sense and work more consistently with IOR and FIO runs in Nautilus.

One thing I'm not fully sure the importance of or how to use is specifying a file in IOR... For instance, if I create a file of random bytes like seen here is there any point to using that as an existing file to read from? Ah... perhaps there is a point. I could create multiple small files or one very large file... this coupled with filePerProc... maybe that's the point.

c57f6dd8feb60eedf5a0c0ae809f30e3ef91f111

"Minimal IOR test script; Repo organization" | HEAD -> main | 2021-03-16

03/21

When I spoke with John he was interested in the IO500 leaderboard and FIO benchmark. He mentioned a few cool things:

I can create a namespace to run the benchmarks on Nautilus
They have used FIO a lot before!
I should talk to Igor about the benchmarks/IO500

It's about time that I run a job on Pleiades! Then I'll try to run a minimal IOR and FIO run.

Pleiades PBS Hello World

Alright, let's give this a go.

ssh sfe
ssh pfe

Explanation of PBS on the HECC knowedlgebase: https://www.nas.nasa.gov/hecc/support/kb/portable-batch-system-(pbs)-overview_126.html

Batch jobs run on compute nodes, not the front-end nodes. A PBS scheduler allocates blocks of compute nodes to jobs to provide exclusive access. You will submit batch jobs to run on one or more compute nodes using the qsub command from an interactive session on one of Pleiades front-end systems (PFEs).

Normal batch jobs are typically run by submitting a script. A "jobid" is assigned after submission. When the resources you request become available, your job will execute on the compute nodes. When the job is complete, the PBS standard output and standard error of the job will be returned in files available to you.

When porting job submission scripts from systems outside of the NAS environment or between the supercomputers, be careful to make changes to your existing scripts to make them work properly on these systems.

A job is submitted to PBS using qsub. Typing man qsub gives a nice description of the expected job script format and capabilities. Here are some useful parts:

The script can run Python, Sh, Csh, Batch, Perl
A script consists of: 1) An optional shell specification, 2) PBS directives, 3) User tasks, programs, commands, applications, 4) Comments
A shebang can be used to specify the shell, or the -S command line option can be used
- E.g. Python can be used by having the first line of the script as #!/usr/bin/python3

PBS Directives

These are needed in a job script, and are written as #PBS-prefixed lines at the top of the script file, or can be passed in as arguments to the qsub command. It's probably best to include them in the script though!

With that said, the shell could be specified with #PBS -S, too.

Common directives can be found here: https://www.nas.nasa.gov/hecc/support/kb/commonly-used-qsub-command-options_175.html. And other directives (options) can be seen with man qsub.

Hello World

Here's a basic script seen in the man pages, but I modified 'print' to 'echo' instead to avoid an invalid command!

#!/bin/sh
#PBS -l select=1:ncpus=1:mem=1gb
#PBS -N HelloJob
echo "Hello"

The script will be executed using the shell based on the first line shebang. The PBS -l directive specifies resources. It asks for 1 'chunk' of resources with 1 cpu and 1 gb of memory. Here is also where we could specify the specific compute nodes we want (model=), the number of mpi processes we want (mpiprocs=), and the filesystem (?). See man pbs_resources. Finally, the PBS -N directive specifies the job name.

Let's try running it!

qsub hello-job.sh

Alright, it was rejected because the node model was not specified. I'll specify Pleiades Sandy Bridge with model=san in the resource line.

Also worth noting that there is a Pleiades development queue that I think this work would fall under (testing the commands that is, not the final benchmarks!).

~~I should ask Henry about the billing and mission shares.~~

I just added -q devel and -l model=san to the script, trying again.

qsub hello-job.sh

Output: 10791518.pbspl1.nas.nasa.gov

Running qstat -u paddison lists the jobs I've submitted. This is a pretty quick job on a fast-turnaround queue, so it'll go by quickly. But three quick runs of that command showed the job in three different states. The fourth time running qstat the output was empty -- the job was complete.

qstat

paddison@pfe24:~> qstat -u paddison
                                                Req'd    Elap
JobID           User     Queue Jobname  TSK Nds wallt S wallt Eff
--------------- -------- ----- -------- --- --- ----- - ----- ---
10791518.pbspl1 paddison devel HelloJob   1   1 02:00 Q 00:00  --

paddison@pfe24:~> qstat -u paddison
                                                Req'd    Elap
JobID           User     Queue Jobname  TSK Nds wallt S wallt Eff
--------------- -------- ----- -------- --- --- ----- - ----- ---
10791518.pbspl1 paddison devel HelloJob   1   1 02:00 R 00:00 50%

paddison@pfe24:~> qstat -u paddison
                                                Req'd    Elap
JobID           User     Queue Jobname  TSK Nds wallt S wallt Eff
--------------- -------- ----- -------- --- --- ----- - ----- ---
10791518.pbspl1 paddison devel HelloJob   1   1 02:00 E 00:00 50%

Two files are now present in the directory where I ran qsub.

HelloJob.o10791518

Job 10791518.pbspl1.nas.nasa.gov started on Sun Mar 21 20:12:29 PDT 2021
The job requested the following resources:
    mem=1gb
    ncpus=1
    place=scatter:excl
    walltime=02:00:00

PBS set the following environment variables:
        FORT_BUFFERED = 1
                   TZ = PST8PDT

On *****:
Current directory is /home6/paddison
Hello

____________________________________________________________________
Job Resource Usage Summary for 10791518.pbspl1.nas.nasa.gov

    CPU Time Used            : 00:00:02
    Real Memory Used         : 2732kb
    Walltime Used            : 00:00:02
    Exit Status              : 0

    Memory Requested         : 1gb
    Number of CPUs Requested : 1
    Walltime Requested       : 02:00:00

    Execution Queue          : devel
    Charged To               : *****

    Job Stopped              : Sun Mar 21 20:12:36 2021
____________________________________________________________________

The e file was empty. Here is that file from a previous run where an invalid command was used.

HelloJob.e10791410

/var/spool/pbs/mom_priv/jobs/10791410.pbspl1.nas.nasa.gov.SC: line 5: print: command not found

The job summary and output is shown in the o file, and it appears that stderr is shown in the e file.

Nice!

6e9d9445847ef3dba7e80e79885bbd54cadb122b

"Hello World PBS job run on Pleiades" | HEAD -> main | 2021-03-21

03/22

Let's get an IOR benchmark running as a PBS job.

Running IOR in a PBS job

This is going to involve:

Ensure software dependencies exist... and learn how to load modules/packages
Learn how to install software dependencies if need be!
Download the IOR executable to /home(?) and try executing it in a PBS job

Software modules: https://www.nas.nasa.gov/hecc/support/kb/using-software-modules_115.html -- I'll probably need to module load mpi....

Software directories: https://www.nas.nasa.gov/hecc/support/kb/software-directories_113.html -- since /u/scicon/tools is used by the APP group I have a feeling a handful of dependencies will be there already. These should already be in PATH.

Also good to know that the pfe nodes can load these modules and it's fine to use them for quick testing/debugging! ~~So I'll be able to test the minimal IOR (and work out all of the dependency, module load, etc steps) before submitting a PBS job :)~~ never mind. mpi jobs are not permitted on the pfe nodes. Still, I should be able to run the ./configure script which checks all dependencies.

Download IOR

Starting by downloading the IOR release from https://github.com/hpc/ior/releases/ to my pfe home directory.

wget -O- https://github.com/hpc/ior/releases/download/3.3.0/ior-3.3.0.tar.gz | tar zxf -
cd ior-3.3.0

Preparing IOR

Now we need to make sure that the necessary dependences are loaded by running ./configure.

Trying to run it results in

checking for mpicc... no
checking for mpixlc_r... no
...
configure: error: in `/home6/paddison/ior-3.3.0':
configure: error: MPI compiler requested, but could not use MPI.

Which I think I can fix by running a module load mpi.... First let's check what mpi modules are available using module avail mpi. Alright, I'll try module load mpi-sgi.

Let's try the configure script again.

Sweet! It worked fully this time! So we know that we'll need to ==module load mpi-sgi==.

Now I can run make. Seems to have worked fine.

I cannot run make install at the moment because I don't have permission to install the binary to /usr/local/bin -- but I can change the installation path when running ./configure. Not necessary though, I can just run the binary from src directly.

Alright. I honestly think that all other dependencies are met. I suppose it's time to run a PBS job! I've written the folloinwg minial-ior.sh file:

#!/bin/sh
#PBS -q devel
#PBS -l select=1:ncpus=8:mpiprocs=8:mem=2gb:model=san
#PBS -N MinimalIOR

module load mpi-sgi

cd "$PBS_O_WORKDIR/ior-3.3.0"

# Should write and read a total of 2gb (8 procs * 16 segments of * 16mb)
mpirun -np 8 ./src/ior -t 1m -b 16m -s 16

Let's try it out! Huh, it complained that I didn't specify the model!? Oh. It was because I had mistyped the comment on the shebang, so it probably didn't read any of the directives.

qsub minimal-ior.sh

Out: 10799119.pbspl1.nas.nasa.gov, and running qstat shows us the job move from Queued, to Running, to Exiting.

qstat

paddison@pfe26:~> qstat -u paddison
                                                  Req'd    Elap
JobID           User     Queue Jobname    TSK Nds wallt S wallt Eff
--------------- -------- ----- ---------- --- --- ----- - ----- ---
10799119.pbspl1 paddison devel MinimalIOR   8   1 02:00 Q 00:01  --
paddison@pfe26:~> qstat -u paddison
                                                  Req'd    Elap
JobID           User     Queue Jobname    TSK Nds wallt S wallt Eff
--------------- -------- ----- ---------- --- --- ----- - ----- ---
10799119.pbspl1 paddison devel MinimalIOR   8   1 02:00 R 00:00  4%
paddison@pfe26:~> qstat -u paddison
                                                  Req'd    Elap
JobID           User     Queue Jobname    TSK Nds wallt S wallt Eff
--------------- -------- ----- ---------- --- --- ----- - ----- ---
10799119.pbspl1 paddison devel MinimalIOR   8   1 02:00 E 00:01  4%

Dependency issues

Unfortunately, the efile resulted in /var/spool/pbs/mom_priv/jobs/10799119.pbspl1.nas.nasa.gov.SC: line 11: mpirun: command not found. Looks like our module load didn't give us the mpirun command. Hmmmmm.

Sure enough, on the pfe I can see an mpiexec command, but no mpirun command. I seem to be able to access this command by ==module load mpi-hpcx==.

Let's try adding that module and run the job again.

Also, this is pretty handy: watch qstat -u paddison

Alright this time we got: mpirun: error while loading shared libraries: libimf.so: cannot open shared object file: No such file or directory. A quick look at Stack Exchange shows that this is an Intel math library. There are some comp-intel modules available, but a module help comp-intel shows only libfftw files... still, I will try it.

Ah! I can test mpirun (without any arguments so I won't actually do anything) on pfe, that way I can check if it complains about dependencies. Sure enough it does complain about missing libimf. Fortunately, after a ==module load comp-intel== it no longer complains!

Let's try this in a PBS job again.

Output

MinimalIOR.e10799410

--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 0 on node r327i7n6 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

MinimalIOR.o10799410

Job 10799410.pbspl1.nas.nasa.gov started on Mon Mar 22 15:52:46 PDT 2021
The job requested the following resources:
    mem=2gb
    ncpus=8
    place=scatter:excl
    walltime=02:00:00

PBS set the following environment variables:
        FORT_BUFFERED = 1
                   TZ = PST8PDT

On *****:
Current directory is /home6/paddison

____________________________________________________________________
Job Resource Usage Summary for 10799410.pbspl1.nas.nasa.gov

    CPU Time Used            : 00:00:04
    Real Memory Used         : 2280kb
    Walltime Used            : 00:00:04
    Exit Status              : 139

    Memory Requested         : 2gb
    Number of CPUs Requested : 8
    Walltime Requested       : 02:00:00

    Execution Queue          : devel
    Charged To               : *****

    Job Stopped              : Mon Mar 22 15:52:58 2021
____________________________________________________________________

Hmmm, so it didn't work fully, but it didn't not work at all at least :')

ea7d869f2b97bc0607e9105c1b2dd5f853aa255c

"Minimal IOR test almost capable of running on Pleiades. Faced segfault" | HEAD -> main | 2021-03-22

03/23

Nautilus namespace and IOR units

Just went ahead and created a usra-hpc namespace on Nautilus, set up a larger volume and new deployment to test out IOR over there. I checked the file sizes and sure enough they're all mebibytes and whatnot. So I was correct before that a command of mpirun -np 8 ior -t 1m -b 16m -s 16 does infact produce an aggregated file size of 2GiB -- actually, the IOR output says this pretty nicely!

Also worth noting that once make install is run (this is done already in the images I set up, e.g. parkeraddison/io500) then wherever ior is run from serves as the filesystem -- so I merely need to navigate to /storage then run ior to test it on that volume.

Finally (and most relevant right now), I did not see any segfault errors when I ran it on Nautilus. Let's try it again on Pleiades.

Re-trying IOR on Pleiades

I'm going to modify the minimal script to simply run ior without any arguments -- this writes/reads only one mebibyte of data and is practically instant. It's truly minimal!

Same message as before.

Perhaps:

https://github.com/hpc/ior/issues/47

Or... the error output says "Per user-direction, the job has been aborted" -- this sounds like maybe the PBS job was aborted because it saw a non-zero exit code. Is there some way to specify that I don't want the job aborted?

Debugging in interactive mode

To make figuring this out easier, we can run the PBS job in interactive! This is basically like exec'ing into a compute node shell in a k8s way of thinking about it! Running qsub -I minimal-ior.sh will request the resources reading the PBS directives, then attach the terminal. I can run each line of the script manually.

After loading the mpi-sgi, mpi-hpcx, and comp-intel modules, here's what running ior shows:

PBS *****:~> cd ior-3.3.0/
PBS *****:~/ior-3.3.0> ./src/ior
[*****:20237:0:20237] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xe5)
==== backtrace ====
    0  /nasa/hpcx/2.4.0_mt/ucx/install/lib/libucs.so.0(+0x1d98c) [0x2aaabb7a498c]
    1  /nasa/hpcx/2.4.0_mt/ucx/install/lib/libucs.so.0(+0x1dbfb) [0x2aaabb7a4bfb]
    2  /nasa/hpcx/2.4.0_mt/ompi-mt-icc/lib/libmpi.so(MPI_Comm_rank+0) [0x2aaaab668e00]
    3  ./src/ior() [0x40d58c]
    4  /lib64/libc.so.6(__libc_start_main+0xf5) [0x2aaaab935a35]
    5  ./src/ior() [0x403209]
===================
Segmentation fault (core dumped)

PBS *****:~/ior-3.3.0/src> mpirun -n 1 ./ior
[*****:20556:0:20556] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xe5)
==== backtrace ====
    0  /nasa/hpcx/2.4.0_mt/ucx/install/lib/libucs.so.0(+0x1d98c) [0x2aaabb7a498c]
    1  /nasa/hpcx/2.4.0_mt/ucx/install/lib/libucs.so.0(+0x1dbfb) [0x2aaabb7a4bfb]
    2  /nasa/hpcx/2.4.0_mt/ompi-mt-icc/lib/libmpi.so(MPI_Comm_rank+0) [0x2aaaab668e00]
    3  ./ior() [0x40d58c]
    4  /lib64/libc.so.6(__libc_start_main+0xf5) [0x2aaaab935a35]
    5  ./ior() [0x403209]
===================
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node ***** exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Looks like the abort happens due to mpirun -- but the truth of the matter is we are getting a segfault from IOR itself. Now we just need to figure out why!

c1a44925fbad444c7a84e054f828a0cb6e6c52a5

"More minimal testing on Pleiades" | HEAD -> main | 2021-03-23

03/27

Sweet, over the past few days I've been unable to replicate the segfault -- in other words, IOR has been working fine as a PBS job!

Furthermore, I've gone ahead and cleaned up/commented the code and ran it on the Lusture (/nobackup) filesystem!

It's working great :) It's a super minimal example to just confirm it works. I'll run a slightly larger example parameter sweep as I finish that code.

ae00408b88260dd35be000cb263c924ee09f16fa

"Working IOR on Pleiades NFS and Lustre" | HEAD -> main | 2021-03-27

03/28

There are a few things I've been working on:

Code to run parameter sweeps
Code to parse IOR outputs and graph them
Better installation and setup of this repository (e.g. automate downloading IOR)

Automated parameter sweeps

There are some examples online, such as on the Lustre docs, of running parameter sweeps in a shell script. That's fine, and I've worked on one... but I can't help but feel things would be a lot easier (and more readable) to just script it in Python with some subprocesses and much easier iteration/logic flow.

What we can do that (I think) makes things easiest is to (1) use a shell script to submit the PBS job, load all dependencies, and load in the correct Python module then (2) call a Python script from within the PBS job which calls the IOR tests for the different parameter ranges.

Right now I have some hard-coded values in the Python script itself for the parameter sweeps... not sure if it pays to make the script read from a configuration file.

Parsing and graphing IOR outputs

To more easily validate and see the effect of the different parameters, it would help to have code that converts the IOR outputs into tables then graphs them!

fddf04911b8fb2af9f03f07a1a1f5c6dbd49cdc7

"Facilitate running IOR with a parameter sweep" | HEAD -> main | 2021-03-30

04/04

Following https://www.nas.nasa.gov/hecc/support/kb/secure-setup-for-using-jupyter-notebook-on-nas-systems_622.html and https://www.nas.nasa.gov/hecc/support/kb/using-jupyter-notebook-for-machine-learning-development-on-nas-systems_576.html, I can do my visualization work in a Jupyter notebook running on a compute node.

When going through the setup steps, I used the pyt1_8 environment -- I'm not sure what if 'pyt' stands for anything besides 'Python' and what the number denotes, I imaging that the tf... environments are for TensorFlow. But regardless, I checked and pyt1_8 has jupyter and Python 3.9, along with pandas, numpy, scipy, and matplotlib, so it'll work well!

04/05

Okay, yesterday I ended up giving up on NAS Jupyter because I kept running into SSL errors when trying to actually go to localhost and access the lab. After multiple attempts today, trying different environments, following all steps again, etc, I've realized the problem was Chrome -- after switching to Firefox to view Jupyter all is fine.

Also, now that I can finally use Jupyter for development, it's worth remembering the following help CSS rule to inject to add an 80ch ruler to the JupyterLab code editor:

.CodeMirror-line::after {
  content: '';
  position: absolute;
  left: 88ex;
  border-left: 1px dashed gray;
}

Here are the descriptions of the different NAS ML conda environments: https://www.nas.nasa.gov/hecc/support/kb/machine-learning-overview_572.html. Looks like 'pyt' stands for PyTorch (d'oh). It also looks like the /nasa jupyterlab environment doesn't have matplotlib. The machine learning environments do, however. So in the future I'll start up the lab from that environment. I could also go ahead and create my own virtual environment probably... but I really don't need to! Having PyTorch or TensorFlow is overkill, but that is fine by me ;)

Turns out IOR has a few different output formats -- including JSON and CSV which make life a lot easier -- I've been trying to parse the human-readable output but ran into some issues with whitespace delimiting. It looks like the JSON output is the best (in my opinion) since it's easy to access exactly what you need and it doesn't hide any information. Side note - I wonder if YAML will ever take over JSON's place in society...

04/06

Now that I'm using the JSON output from IOR, everything is much more straightforward when it comes to parsing. I polished up a file to parse and plot outputs. I think it's time now to actually do some larger-scale runs so we can make sure what we're getting as a result makes sense.

9aa37d75a8cb70b06023251e065a64795cb58569

"Output parsing and plotting functions complete" | HEAD -> main | 2021-04-06

Great news is (after some tweaking/bug squashing) the parameter sweep job is working like a charm. Furthermore, the output file was instantly able to be parsed and visualized with the functions written!

Values for initial parameter sweeps

The current steps remaining are:

Come up with some good parameter values to test
Make the code a bit easier to use and adjust

04/07

246de9feb3882fe6a91a66306661a3aeeacf3814

"More robust parameter sweep code and pbs job script; Add sweep and visualization to readme" | HEAD -> main | 2021-04-07

04/08

For parameter sweeps, we have the following guidelines:

We should do multiple iterations of each test for consistency sake. We can change this depending on how long tests take, but for now maybe -i 5 or so
We're pretty interested in how each system scales with concurrency -- so testing sweeping # tasks is an important test
We're interested in how ^ might change with different data size and different access patterns -- so we should test various file size (combination of block and segment size) and transfer size
Key importance: to explore, to understand what bottlenecks we might be experiencing

Largely, this is exploratory -- the parameter values and sweeps aren't fixed by any means, rather we should try some out and if we see something interesting then we should dive into it further. Fortunately, the tests (at the current scale I've tested) don't take too long (although now that multiple iterations are being run, expect it to take $i$ times longer).

To come up with the initial values, though, I've been drawing inspiration mostly from some papers which have used IOR to evaluate HPC performance:

Using IOR to Analyze the I/O performance for HPC Platforms by Hongzhang Shan, John Shalf
- Sides: https://cug.org/5-publications/proceedings_attendee_lists/2007CD/S07_Proceedings/pages/Authors/Shan/Shan_slides.pdf
- Conducted user survey for typical IO access patterns at NERSC. Findings:
  - Mostly sequential IO (rather than random)
  - Mostly writes -- really? I would assume that most scientific projects are moreso write-once read-many...
  - Transfer size varies a lot -- "1KB to tens of MB"
  - Typical IO patterns: one processor, one file per processor (both POSIX), MPI-IO single shared file
  - File per process can lead to lots of files being written, especially if there are restarts. This doesn't scale well in terms of data management!
  - Small transactions and random accesses lead to poor performance... but lots of poorly designed applications do this
- Important IOR parameters:
  - API -- POSIX, MPI-IO, HDF5, or NetCDF
  - ReadFile/WriteFile -- whether to measure read/write operations
  - SegmentCount (s) -- number of 'datasets' in the file
  - Each dataset is composed of NumTasks (N) blocks of BlockSize (b), read/written by the processor in chunks of TransferSize (t)
  - To avoid caching, filesize per processor (= BlockSize) should be large enough to exhaust the memory buffers on each node. BlockSize was swept from 16MB to 8GB to see where caching effects (for read performance) were mitigated. In their words: "where the derivative of the performance was asymptotically zero"
    - ~~Curious, can't IOR's reorder option mitigate caching? We should test a block size sweep with and without reordering.~~ This would only apply for tests on more than one node -- we're doing this so that we can trust the rest of the tests which only involve a single node.
    - For this test, only one node was used and TransferSize was fixed at 2MB with one segment.
  - TransferSize was swept from 1KiB to 256MiB (using a power of 4 in KiB) to get a sense of if the system is optimized for larger transfer size/the system overhead.
  - Using the ideal parameters seen above, file-per-process versus shared file were both evaluated as NumTasks was swept from 8 to 256/1024 (depending on how many nodes were available on each system)
    - On their systems, read and write performance were very similar.
- The theoretical peak IO bandwidth of each system was calculated/known before hand... for the Lustre system it was calculated as the number of DDN couplets times the bandwidth of each couplet
  - What is the theoretical peak IO bandwidth on Pleiades?
- It's important to compare systems "on the basis of performance rather than raw performance" due to differences in scale
- The paper also explains the physical topology of the systems it tested -- something which I stumbled upon ANL's CODES project to simulate the impact of different topologies... outside the scope of this project, but perhaps worth ==NOTE==ing
- We should see what performance a single node is capable of -- this'll let us measure speedup (fixed work per processor, as is default with IOR) and maybe also scaleup (if we adjust parameters to fix aggregate work done)
  - Truthfully, a speedup chart would be more effective at comparing different systems than a shared plot of raw performance!
I/O Performance on Cray XC30 by Zhengji Zhao, Doug Petesch, David Knaak, and Tina Declerck
- Sides: https://www.nersc.gov/assets/pubs_presos/Edison-IO-CUG2014-presentation.pdf

Side note: Darshan

Darshan is an IO profiler which intercepts IO calls to collect statistics which can be viewed on a timeline or summarized later -- things like bandwidth, IO size, etc. Basically, it's a way to get all of those useful measurements which a finished IOR/FIO run tells us but on any arbitrary mpirun jobs (including scientific application benchmarks)!

Useful video: https://www.youtube.com/watch?v=7cDoBusXK5Q; slides: https://pop-coe.eu/sites/default/files/pop_files/darshan_io_profiling_webinar.pdf

Definitely worth getting this to run on NAS -- even for IOR runs. The video mentions looking at how well the percentage of metadata IO scales, because that was a bottleneck they faced.

04/09

FIO on NAS

It came to my attention that I've found a lot of academic papers which reference IOR, but not a lot of widespread 'internet' popularity. FIO, however, is immensely popular in terms of internet points -- plenty of blog posts, technical pages (from Microsoft, Google, Oracle, etc)... I wonder if there are some HPC papers which reference FIO?

In order to run FIO on NAS, it the release can be downloaded and unpacked like so:

wget -O- https://github.com/axboe/fio/archive/refs/tags/fio-3.26.tar.gz | tar zxf -

Before we make however, we need to upgrade to gcc version of at least 4.9

module avail gcc
...
module load gcc/8.2
cd fio-fio-3.26/
make

The minimal job in readwrite.fio can be run with

path/to/fio path/to/readwrite.fio

Hmmmm, I came across Gordon: design, performance, and experiences deploying and supporting a data intensive supercomputer by Shawn Strande, Pietro Cicotti, et al. (and it's out of SDSC - it's a small world after all ;) ) but, I also think I came across the reason why I'm not seeing HPC papers that use FIO: I'm not so sure that FIO can do single-shared-file workloads, https://github.com/axboe/fio/issues/631. So it might be really easy to set up job script and get baseline readings for your filesystems, but not when there are multiple nodes involved.

b2beb35c3771b5251442b3cdd48801f1d2c57e4c

"FIO works on NAS" | HEAD -> main | 2021-04-09

04/10

To view some documentation PDFs and to prepare for viewing plots generated by Darshan, I went ahead and went through the (really easy!) process of setting up a VNC server/connection to a graphical interface. Following https://www.nas.nasa.gov/hecc/support/kb/vnc-a-faster-alternative-to-x11_257.html was straightforward, and boiled down to:

# On pfe
vncserver -localhost
# > "New desktop is at pfe:XX"
~C
-L 5900:localhost:59XX
# Connect to localhost:5900 with local VNC client
vncserver -kill :XX

04/11

Darshan on NAS

I should be using mpi-sgi/mpt (or mpi-hpe?) rather than mpi-hpcx. This includes mpicc.

Trying to set up Darshan has proven a challenge! But, here's what I've come up with so far, trying to follow https://www.mcs.anl.gov/research/projects/darshan/docs/darshan-runtime.html:

Download and untar wget -O- ftp://ftp.mcs.anl.gov/pub/darshan/releases/darshan-3.2.1.tar.gz | tar zxf -
Load in mpi-hpe for mpicc module load mpi-hpe comp-intel
Configure and make the darshan-runtime cd darshan-runtime && ./configure --with-log-path=~/profiler/darshan-logs --with-jobid-env=PBS_JOBID CC=mpicc && make

Now is where I get stuck. I can't make install since I don't have write permissions to /usr/local/lib, but I can do something like make install DESTDIR=~/ to install it to my home directory... I can even add ~/usr/local/bin to my path. But what about the lib and share directories? How do I make sure those are accessible?

The reason I ask, is because when I try to run an mpiexec that is monitored by Darshan, I face an error

paddison@pfe20:~> LD_PRELOAD=~/usr/local/lib/libdarshan.so mpiexec -n 2 ~/benchmarks/ior/ior-3.3.0/src/ior
mpiexec: symbol lookup error: /home6/paddison/usr/local/lib/libdarshan.so: undefined symbol: darshan_variance_reduce

I just tried export LD_LIBRARY_PATH=~/usr/local/lib:$LD_LIBRARY_PATH as well to no avail.

To be honest, I've spent some time reading about libraries and linking, but I don't truly understand how it all works and what specifically is breaking here. Perhaps I need to set some paths in ./configure. For instance, --prefix.

Using --prefix ~/usr/local lets me run make install without messing with Makefile variables (whoops, shoulda just looked at the ./configure --help to begin with!). And my hope is that it'll also let me actually run the thing!

paddison@pfe20:~/profiler/darshan-3.2.1/darshan-runtime> LD_PRELOAD=~/usr/local/lib/libdarshan.so mpiexec -n 2 ~/benchmarks/ior/ior-3.3.0/src/ior
Can't open proc file /proc/arsess
: function completed normally
Can't open proc file /proc/arsess
: function completed normally
asallocash failed: array services not available
Can't open proc file /proc/arsess
: array services not available
mpiexec: all_launch.c:737: newash: Assertion `new_ash != old_ash' failed.
Aborted (core dumped)

Hey! At least it's different than before :') Oh whoops, that might be because I tried running an mpiexec command on a front-end node rather than a compute node. Let's try it again in an interactive qsub.

Hmmmm... it hung on me. Gotta figure out how to terminate a PBS job. I tried qsig -s INT jobid and -s (which should be SIGTERM), then I tried qdel jobid but it hasn't worked yet :o After a while (~10 minutes or so), my qdel timed out, then trying it again said "qdel: Server could not connect to MOM ...", then after a bit more time I tried it again and it worked. Maybe some backend server was down temporarily or something...

As if I didn't learn my lesson, I'm going to try again.

Aw shucks, here we go again. Something about LD_PRELOAD=~/usr/local/lib/libdarshan.so mpiexec -n 2 ~/benchmarks/ior/ior-3.3.0/src/ior is hanging. Same exact thing happened -- qdel timed out after 12 minutes, then a subsequent call returned no connection to MoM, then a third call a few seconds later succeeded. Not sure what's going on.

04/12

I'll re-examine Darshan in the future, or perhaps while waiting for some parameter sweeps to conclude. For now, it's time to use the parameter values from the paper and start running some initial tests!

What MPI module to use

Huh, so actually when I was getting things set up to run the parameter sweeps, I realized that I can't run IOR using mpi-hpe/mpt nor mpi-sgi/mpt... only mpi-hpcx + comp-intel it seems... With otherwise I'm met with error while loading shared libraries: libopen-rte.so.40: cannot open shared object file: No such file or directory...

Maybe it's because I ran make with hpcx loaded? That would make sense. I've gone ahead and re-downloaded, re-configured, and re-made IOR with mpi-hpe loaded -- it works this time with mpi-hpe as the only required module :)

Let's try Darshan super quick? Damn. It hung again. Alright, I'll give up on Darshan for now and just move on with the parameter sweep finally.

Conducting initial parameter sweeps

Additional IOR documentation can be found here https://github.com/hpc/ior/blob/main/doc/USER_GUIDE. It includes some things that aren't on the website. Based on this, I could have written Python code to generate IOR scripts then have the PBS job script run that, rather than execute commands within Python. Oh well, maybe I will change to that in the future.

04/13

I've gone ahead and done the parameter sweeps. The results are plotted and commented on in the 1_Parameter_sweeps.ipynb notebook (on NAS pfe). Most notably, there was an interesting dip in performance at a transferSize of 4MiB and performance decreased with more nodes.

It's important to figure out if that behavior is consistent, then if so figure out what is causing it. The hardware? The network topology? The software, like Lustre stripe sizes?

I ran all of the benchmarks on /nobackupp18 but supposedly that filesystem is not fully set up yet. It also has different hardware (SSDs) than /nobackupp12. I will attempt to run the same set of tests on /nobackupp12 and compare the results.

a10b7d71f4ffb1420f45d916b72968b6174236e4

"Initial parameter sweeps; Configurable sweeps; Parsing/plotting" | HEAD -> main | 2021-04-14

04/16

Initial sweeps on /nobackupp12

Lustre stripe counts

Henry warned me that the progressive Lustre striping on /nobackupp12 is broken, and I should make sure that a fixed stripe count is being used instead. To see what stripe count is currently being used, I can run lfs getstripe [options] path. So for instance, I ran a small test with the keepFile directive enabled so I could see what striping is being done on the written testFile.

lfs getstripe testFile

confirms that progressive striping is taking place. Whereas if I specify a new file with a fixed stripe count (or size)

lfs setstripe -c 2 testFile2
cp testFile testFile2
lfs getstripe testFile2

I see that fixed number! Fortunately, I can specify stripe size by using IOR directive options!

IOR Lustre directives

Huh... when I tried to run IOR with a Lustre-specific directive it complained

ior ERROR: ior was not compiled with Lustre support, errno 34, Numerical result out of range (parse_options.c:248)

I compiled this version of IOR with the mpi-hpe module... I'll try ./configure again to see if Lustre is shown as supported. This time around I ran ./configure --with-lustre, then make. Let's see if it works. I suppose if it doesn't I can always just add an explicit lfs setstripe command before each test.

Didn't work. Maybe I need to compile it on a Lustre filesystem? Like, move it to a /nobackup and then re-configure/compile?

Maybe it's related: https://github.com/hpc/ior/issues/189

Shucks, as a workaround I tried an explicit lfs setstripe on testFile before running IOR, but the getstripe afterwards showed that it didn't work. I think this is because IOR deletes the file before writing it.

Here are some great resources about Lustre striping, IO benchmarks, etc:

These explain that performance greatly benefits from stripe alignment, in which OST contention is minimized by ensuring each processor is requesting parts of a file from different OSTs -- this can be done by setting the number of stripes to the number of processes, for instance. Performance is also optimized by a stripe size similar to the transfer size.

Honestly, this document has some incredible tips and insight. NICS is co-located on the ORNL campus so has ties to the DoE.

IOR Lustre striping workaround

Ah! Looks like IOR Lustre options not working is potentially a known issue: https://github.com/hpc/ior/issues/353

Perhaps this is a workaround to pre-stripe and keep the file: https://github.com/hpc/ior/issues/273 Basically, use the -E (existing file) option :) And that works!

lfs setstripe -c 2 testFile
mpiexec -np 2 ~/benchmarks/ior/ior-3.3.0/src/ior -a MPIIO -E -k
lfs getstripe testFile

So we can explicitly run lfs setstripe and create the testFile before hand as long as we also make sure to use the existing file flag!

Woohoo, let's run those tests again!

04/17

I'm taking a closer look at some more test runs... I think perhaps part of the reason behind the high variance is due to the data sizes being relatively small? I'm not sure... but the variation between two consecutive repetitions can be huge. For instance, I ran another transfer size test and the first repetition read time was 1.3 seconds -- then the next was 0.3 seconds.

Actually, I've looked at all of the individual tests now (not just the summary) and it looks like the first repetition always takes considerably (~3x) longer than the rest of the repetitions. The next repetitions or two are usually the best, then read time starts to climb again.

This is not true for writes -- though there is a lot of variation in write time... I'm not sure why there would be.

Perhaps there is truly some caching effect happening when I run the repetitions?

04/19

MPI on PRP

Looking into running multiple an mpi job (IOR) across multiple nodes on the PRP.

04/20

I think having some tests from the PRP to compare to will be nice. I'm puzzled by a bit of the NAS results... trying to formalize some visualizations and run some more tests to get a better grasp of the I/O performance behavior that's going on...

1aeb8957bb17f24494cbff25ebb03d371422054e

"sync changes" | HEAD -> main | 2021-04-24

04/24

Setup

https://pacificresearchplatform.org/userdocs/running/mpi-jobs/

After installing Helm (a software manager for Kubernetes) it is time to start following https://github.com/everpeace/kube-openmpi#quick-start.

Note that the Helm version has changed and the --name option is gone, so the deploy command should now be:

helm template $MPI_CLUSTER_NAME chart --namespace $KUBE_NAMESPACE ...

I took a peek at what this command outputs by redirecting to a file > OUT -- it produces a nice kubernetes yaml which defines:

A Secret containing the generated ssh key and authorized keys variable
A ConfigMap containing a script to generate the hostfile
A Service -- this is "an abstraction which defines a logical set of Pods and a policy by which to access them (sometimes this pattern is called a micro-service)". Basically a way to group up Pods into an application with frontend and backend pods, and a way to network between them.
A Pod containing the openmpi container with our desired image and a hostfile init container
A StatefulSet which manages the pods -- this is like a Deployment in which all pods (including replicas) are uniquely identified and supports persistent storage

Aw, attempting to create that resource led to:

Error from server (Forbidden): statefulsets.apps "nautilus-worker" is forbidden: User "system:serviceaccount:usra-hpc:default" cannot get resource "statefulsets" in API group "apps" in the namespace "usra-hpc" 
+ cluster_size=
+ rm -f /kube-openmpi/generated/hostfile_new
stream closed

I'll need to ask Dima about permission for that resource. Perhaps the API group has just shifted... Or, perhaps it is because I haven't added the rolebindings yet. The rolebinding command is using the GitLab /blob/ instead of the /raw/, but after fixing that I did not face the 'cannot get resource' issue! I still did face an issue though:

Error from server (NotFound): statefulsets.apps "nautilus-worker" not found

Ah, I think that was just due to me not fully tearing down my previous attempt. After deleting all the pods and re-running the resource creation -- it's working!

Running

I should now be able to run mpiexec via a kubectl exec to the master pod.

Sweet! The example command works!

kubectl exec -it $MPI_CLUSTER_NAME-master -- mpiexec --allow-run-as-root --hostfile /kube-openmpi/generated/hostfile --display-map -n 4 -npernode 1 sh -c 'echo $(hostname):hello'

Worth noting for the future in case I need to do some care node selection or mess with some mpiexec options: Some of my nodes (currently master and worker-0) are on nysernet and some aren't. In the JOB MAP section of the output the ones not on nysernet show

Data for node: nautilus-worker-1.nautilus	Num slots: 8	Max slots: 0	Num procs: 1
  Process OMPI jobid: [35664,1] App: 0 Process rank: 2 Bound: UNBOUND

whereas the ones on nysernet show Bound to a bunch of sockets

Data for node: nautilus-worker-0.nautilus	Num slots: 96	Max slots: 0	Num procs: 1
  Process OMPI jobid: [35664,1] App: 0 Process rank: 1 Bound: socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 0-1]], socket 0[core 10[hwt 0-1]], socket 0[core 11[hwt 0-1]], socket 0[core 12[hwt 0-1]], socket 0[core 13[hwt 0-1]], socket 0[core 14[hwt 0-1]], socket 0[core 15[hwt 0-1]], socket 0[core 16[hwt 0-1]], socket 0[core 17[hwt 0-1]], socket 0[core 18[hwt 0-1]], socket 0[core 19[hwt 0-1]], socket 0[core 20[hwt 0-1]], socket 0[core 21[hwt 0-1]], socket 0[core 22[hwt 0-1]], socket 0[core 23[hwt 0-1]], socket 0[core 24[hwt 0-1]], socket 0[core 25[hwt 0-1]], socket 0[core 26[hwt 0-1]], socket 0[core 27[hwt 0-1]], socket 0[core 28[hwt 0-1]], socket 0[core 29[hwt 0-1]], socket 0[core 30[hwt 0-1]], socket 0[core 31[hwt 0-1]], socket 0[core 32[hwt 0-1]], socket 0[core 33[hwt 0-1]], socket 0[core 34[hwt 0-1]], socket 0[core 35[hwt 0-1]], socket 0[core 36[hwt 0-1]][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..]

Regardless, I should be setting up node affinities so that I get nodes with 16 cores for the closest comparison to Sandy Bridge.

IOR MPI job

Before we do that, though, let's get a custom image with IOR on it to do a minimal test run for that. Locally, this was as easy as just downloading IOR, running ./configure and make. It worked fine without needing to mess with any additional dependencies :) Let's try it on the cluster.

Alright, IOR runs, but not without some issues. When running a POSIX API test, the following warnings showed up in the results section of both write and read:

ior WARNING: inconsistent file size by different tasks.
WARNING: Expected aggregate file size       = 4194304.
WARNING: Stat() of aggregate file size      = 1048576.
WARNING: Using actual aggregate bytes moved = 4194304.

Then, when using MPIIO as the API, IOR will not run fully, as we're met with:

[nautilus-worker-2:00058] [3]mca_sharedfp_lockedfile_file_open: Error during file open
[nautilus-worker-0:00057] [1]mca_sharedfp_lockedfile_file_open: Error during file open
[nautilus-worker-1:00057] [2]mca_sharedfp_lockedfile_file_open: Error during file open

Oh. Probably because I'm not working on a shared volume, duh. So each node can only see its own file. Well, anyway, IOR is technically working!

ad5fe7f3ca91273e0074fdb26bfd56e9c9d1136e

"kube-openmpi running with IOR on PRP" | HEAD -> main | 2021-04-24

04/25

Shared volume

I can use the rook-cephfs storage class -- it uses CephFS and supports ReadWriteMany -- once Dima gives me the okay. See: https://pacificresearchplatform.org/userdocs/storage/ceph-posix/

Basically all I need to do is change my volume yaml to specify:

spec:
  storageClassName: rook-cephfs
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 100Gi

Then, I believe I can adjust values.yaml to:

volumes:
  - name: shared-cephfs
    persistentVolumeClaim:
      claimName: shared-cephfs
volumeMounts:
  - mountPath: /shared
    name: shared-cephfs

for both mpiMaster and mpiWorkers... we'll see!

Wonderful! The shared storage was successfully mounted to all nodes. I tried running IOR with just a single process on the master node in the shared directory and it worked -- now let's go ahead and try a multi-node job.

Woohoo! It worked!

Performance was really bad (~0.5 MiB/s wr), probably due to huge separation between nodes -> high latency (0.25s for write performance :o ). To be honest, I didn't even check what region the storage is assigned to. But nevertheless -- it worked :)

Node selection

I want to make sure I'm requesting nodes that have 16 cores -- just like Sandy Bridge.

To do so, I can do a couple things in values.yaml:

Specify resources.requests/limits
~~Specify nodeSelector with nautilus.io/sockets: 2 as the required label. This will prevent being assigned to nodes with more cpus.~~ Nevermind. I just checked k get nodes -l nautilus.io/sockets=2 -o custom-columns=NAME:.metadata.name,CPU:.status.capacity.cpu and the vast majority of nodes labeled has having 2 sockets have tons of cpus. In general, it looks like most nodes on the cluster have more than 16 cpus. It doesn't make sense to try to get dedicated 16 cpu nodes.

04/26

In light of the high latency, I think I'll go ahead and request specific nodes at first. My storage is US West (probably at UCSD), so I'll want to request some other nodes also at UCSD to limit communication latency between pods and between storage.

I'll need to talk to Dima about which nodes to use, but I should be able to ask for:

Pods of type general (avoid testing, system, osg, etc)
calit2.optiputer.net nodes (looks like these should be at UCSD, whereas calit2.uci.edu are at Irvine))
sdsc.optiputer.net nodes
ucsd.edu nodes
suncave nodes

Ah, I can use these nodes:

k get nodes -l topology.kubernetes.io/zone=ucsd

(with the possible exception of a .ucsb.edu node which might have been labeled by mistake)

This means I can use nodeSelector in values.yaml. Couple it with my resource requests:

resources:
  limits:
    cpu: 8
    memory: 8Gi
  requests:
    cpu: 8
    memory: 8Gi
nodeSelector:
  topology.kubernetes.io/zone: ucsd

Uhhh ohhhh.

Error from server: error when creating "STDIN": admission webhook "pod.nautilus.optiputer.net" denied the request: PODs without controllers are limited to 2 cores and 12 GB of RAM

Gotta figure that one out.

e58ea08968662f03b607111cc865d149c4d4ea12

"IOR working on shared cephfs filesystem with node selection" | HEAD -> main | 2021-04-26

04/27

PRP SeaweedFS

Dima mentioned there are some issues with CephFS at the moment and heavy usage is causing the OSDs to run out of memory and crash. In the meantime, he mentioned I can check out SeaweedFS.

https://pacificresearchplatform.org/userdocs/storage/seaweedfs/

ab474f1db0cedc44951f190a509d34a7194f1c76

"SeaweedFS volume" | HEAD -> main | 2021-04-27

Running into some issues with SeaweedFS. I created the pvc, but when I created a deployment the pod failed to mount to the pvc.

Later in the day I tried it again and the pvc itself failed to be provisioned

Events:
  Type     Reason                Age                     From                                                                                  Message
  ----     ------                ----                    ----                                                                                  -------
  Normal   ExternalProvisioning  2m46s (x26 over 8m46s)  persistentvolume-controller                                                           waiting for a volume to be created, either by external provisioner "seaweedfs-csi-driver" or manually created by system administrator
  Normal   Provisioning          15s (x10 over 8m46s)    seaweedfs-csi-driver_csi-seaweedfs-controller-0_7da00a20-3339-4cce-a620-44a28c9b6d7d  External provisioner is provisioning volume for claim "usra-hpc/shared-seaweedfs"
  Warning  ProvisioningFailed    15s (x10 over 8m46s)    seaweedfs-csi-driver_csi-seaweedfs-controller-0_7da00a20-3339-4cce-a620-44a28c9b6d7d  failed to provision volume with StorageClass "seaweedfs-storage": rpc error: code = Unknown desc = Error setting bucket metadata: mkdir /buckets/pvc-439d990a-501a-4801-99b8-d5163aedbdf8: CreateEntry: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 10.98.219.214:18888: connect: connection refused"

Then after a while of waiting it magically worked. Then the deployment failed to mount again, then after a while that too managed to work...

Events:
  Type     Reason                  Age                    From                     Message
  ----     ------                  ----                   ----                     -------
  Normal   Scheduled               <unknown>                                       Successfully assigned usra-hpc/filebench-seaweedfs-85cf7589b5-6f4pd to suncave-11
  Normal   SuccessfulAttachVolume  7m39s                  attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-439d990a-501a-4801-99b8-d5163aedbdf8"
  Warning  FailedMount             3m45s (x9 over 7m13s)  kubelet, suncave-11      MountVolume.SetUp failed for volume "pvc-439d990a-501a-4801-99b8-d5163aedbdf8" : rpc error: code = Internal desc = Timeout waiting for mount
  Warning  FailedMount             3m19s (x2 over 5m36s)  kubelet, suncave-11      Unable to attach or mount volumes: unmounted volumes=[shared-seaweedfs, unattached volumes=[shared-seaweedfs default-token-nqkfj: timed out waiting for the condition
  Normal   Pulling                 101s                   kubelet, suncave-11      Pulling image "localhost:30081/parkeraddison/kube-openmpi-ior"
  Normal   Pulled                  100s                   kubelet, suncave-11      Successfully pulled image "localhost:30081/parkeraddison/kube-openmpi-ior" in 1.209529308s
  Normal   Created                 100s                   kubelet, suncave-11      Created container filebench-seaweedfs
  Normal   Started                 100s                   kubelet, suncave-11      Started container filebench-seaweedfs

Both times I eventually ran into an issue where performing any filesystem operations (e.g. ls) would hang. It seemed that sometimes these operations would complete after a while... sometimes I got impacient and killed the process and that seemed to un-hang things. Really not sure what's going on there.

Single node CephFS parameter sweep

When doing my single node parameter sweep, I came across a bunch of helpful things to keep in mind for the future:

Attach VSCode to kubernetes pod

This is very useful for editing files inside a pod without needing to install vim on that pod. With the VS Code Kubernetes extension installed, we can:

Command pallet: View: Show Kubernetes
Kubernetes cluster panel: nautilus > Workloads > Pods
Right click pod name: Attach Visual Studio Code This'll open up a new window, take care of all the port forwarding, and allow you to open remote folders and files just as you would any other remote ssh host!

Only caution I've noticed so far: the integrated terminal doesn't handle text wrapping well. As always, I recommend using a separate terminal window in general.

IOR scripts and outputs

Rather than using the Python parameter_sweep script, I just whipped up a very tiny amount of code to populate an IOR script than ran that via ior -f path/to/script. This is similar to how it's done in Glenn' Lockwood's TOKIO-ABC. Using IOR scripts is the way to go. Multiple tests of different parameters can be defined at once in a portable file then shared between systems without a Python dependency.

On NAS at the moment I still need to use the Python orchestration in order to set the Lustre stripe sizes/counts before each run... at least until the IOR Lustre options are fixed.

~~I'm curious... does my previous Lustre striping workaround still work when there are multiple repetitions? I don't recall checking...~~ Yes, it works :) And it works for multiple tests too, so I actually don't need to use the Python script at all as long as I remember to manually set the stripe count before running the test.

Also latency output (json) is measured in seconds.

e5bc2cf2836df6c51b96bbd18bfe828595154d51

"Run using IOR scripts; PRP ceph and seaweed" | HEAD -> main | 2021-04-28

04/28

IO Hints

I am going to be showing the NAS and Ceph findings so far to the NAS APP group in an attempt to figure out what's going on with the drop in performance at 4mb transfer size on NAS, and to ask about the hardware/software stack at NAS, Lustre monitoring, etc.

So, I'm re-running a bunch of the parameter sweeps on NAS (and PRP) to make sure my results are consistent. At the same time, I'd like to experiment with I/O hints. This should be useful: http://www.idris.fr/media/docs/docu/idris/idris_patc_hints_proj.pdf, and https://github.com/hpc/ior/blob/main/doc/USER_GUIDE#L649. I was able to use a hints file that looks like this:

# File: hints.ior
IOR_HINT__MPI__romio_cb_write=enable
IOR_HINT__MPI__romio_cb_read=enable

Coupled with hintsFileName (-U) set to the path to that file, and showHints (-H) it worked! Now let's do some parameter sweeps and see if it actually makes a difference.

==NOTE== Collective option in IOR causes massive drop in performance -- bandwidth on the order of single mebibytes.

05/03

Trying out nbp2 and memory hogging

Hogging memory on the node (-M) seems to affect the blockSize performance as Mahmoud suggested. Trying to run on /nobackupp2 with 85% memory hogging leads to out of memory error at some point when testing the 1.5Gi block size... not sure why this didn't happen when testing on /nobackupp12... the requested compute nodes were the same.

Transfer size exhibited no drop off when memory hogging was used, and read performance was pretty level at around 200MiB/s, write performance was consistently greater than read.

I'd like to run a read test on an existing file that is for sure not in the Lustre OSS cache.

05/04

Better understanding of the filesystem hardware

These links are useful.

However, the hardware of the OSTs is not discussed -- and that's where we'd find out the theoretical performance of our filesystems by looking at the OST drive performance and networking overhead.

No-cache read performance

I created some files the other day, and I'm now trying to do read-only IOR tests on these existing files. On my small-scale tests it seems to have worked -- I get much lower read bandwidth.

Here are the useful things to know about such a test:

Absolutely keepFile (-k) must be enabled (otherwise the data will be deleted after the test, meaning you'll need to create a new file and wait a while again -- whoops)
It is still important to use memory hogging (-M %) for multiple tests, otherwise the read file will be in the local cache.
- 85% seemed to work well. I wouldn't be surprised if too high and you risk crashing the node due to OOM, however (just like what happened on a nbp2 test earlier)

We can read just a portion of the file just fine, just a warning will show

WARNING: Expected aggregate file size       = 1073741824.
WARNING: Stat() of aggregate file size      = 17179869184.
WARNING: Using actual aggregate bytes moved = 1073741824.

Probably a good idea to just create a handful of very large files (to support out multi-node largest tests) and keep those laying around.

Designing stripe tests

We're interested in exploring our hypothesis that performance drops at certain transfer sizes are related to the Lustre stripe sizes.

Some useful things to know about IOR

We can pass a script into IOR via stdin like so:

cat | ior -f /dev/stdin << EOF
IOR START
# rest of script
RUN
IOR STOP
EOF

Warnings show up when they occur, so attempting to use summaryFormat=json without a corresponding summaryFile will cause invalid jsons in stdout if anything is logged to stdout

Snippet for plotting the write test until I can figure out a better way

for rn in df12.ReferenceNumber.unique(): quick_plot(df12[df12.ReferenceNumber==rn],'transferSize','bandwidth')

The real reason for transfer size performance drops

I'm suspecting that the real reason for transfer size performance drops has something to do with memory. I believe this because we're seeing the effect while reading from a file that has been cached in local memory. Observe that the read speeds are astronomical -- but only after the very first iteration for a file. Somehow I need to avoid local caching -- and I was using memory hogging at 85% but that wasn't enough.

Working around the local memory cache

I'm thinking of the following possible workarounds:

Avoid repetitions in IOR itself -- too likely to re-read from cache
Make the sweep round-robin style doing each parameter value for all files before moving on to the next, coupled with memory hogging to ensure only one file fits in memory
Try to manually drop the file(s) from the memory cache
- https://unix.stackexchange.com/questions/36907/drop-a-specific-file-from-the-linux-filesystem-cache

I tried the manual cache dropping. First I ran free -h to see my memory and cache usage, then read a 4Gi file with IOR and saw memory usage jump up. Sure enough the next IOR read test had OOM better read bandwidth.

Some testing with memory hogging shows that it definitely lowers performance, but by no means does it prevent the caching effects wholly.

Then I tried running

dd of=FILE_name oflag=nocache conv=notrunc,fdatasync count=0

Looking at free confirmed that memory usage went down, and the next IOR run had similar performance to the very first run!

This is a helpful read to understand caches: http://arighi.blogspot.com/2007/04/how-to-bypass-buffer-cache-in-linux.html

Some more links related to avoiding/dropping the file cache:

https://man7.org/linux/man-pages/man2/posix_fadvise.2.html -- the functionality in the Linux kernel
https://github.com/lamby/python-fadvise -- a Python interface to posix_fadvise
https://unix.stackexchange.com/questions/17936/setting-proc-sys-vm-drop-caches-to-clear-cache -- if you want to clear the entire cache

05/18

A quick multi-node test on PRP

I'm editing the values of the values.yaml used by kube-openmpi to utilize Node Anchors and Aliases which let me re-use keys so I can write my desired resources and volumes once and have them shared across all workers.

Now that's done, I've lowered the resources to within the allowed PRP settings for interactive pods -- 2cpus and 8gb ram -- and I'll run a multi-node test.

Yes! I created a nice alias omexec which takes care of the run-as-root and hostfile considerations, and now I can run it just fine.

Pay attention to `-npernode`

I requested a limit of 2cpus but that is a quote -- it does not mean that the container cannot access the rest of the cores. So, I can still execute mpiexec with more than 2 processes (assuming we want only 1 proc per core). Setting -npernode adjusts how many processes per node we want.

Useful resource:

I don't think I've run into the quota throttling yet... and monitoring the Grafana dashboard shows I'm well within limits overall. I think Dima was explaining to someone in the support room that bursty behavior is fine, it just can't consistently exceed the limits.

https://grafana.nautilus.optiputer.net/d/85a562078cdf77779eaa1add43ccec1e/kubernetes-compute-resources-namespace-pods?orgId=1&refresh=10s&var-datasource=default&var-cluster=&var-namespace=usra-hpc

Pay attention to cores vs threads

There are a handful of ways to look at the cpu information, lscpu is nice. Notice that this claims we have access to 16 CPUs, but it's actually 8 cores with 2 threads per core. On a single node, if I run IOR with 8 procs, I get really fast performance (and I notice the same caching effect as seen before in the read speeds). When I try to run it with 16 procs the performance is significantly worse all around -- that's because threads can't actually run in parallel, rather they timeshare.

Understanding performance

So... the performance is pretty bad.

Darshan on PRP

https://www.mcs.anl.gov/research/projects/darshan/docs/darshan-runtime.html#_conventional_installation

After downloading and untarring Darshan, I'm trying to ./configure the darshan-runtime. But I got the error that no zlib headers could be found. This can be fixed by specifically installed zlib1g-dev -- the non-dev version will not do.

Then we can configure it. We'll need to pass --with-log-path and --with-jobid-env. The first is easy because I can set it to wherever I want to store logs. The latter I don't know. On NAS I knew that PBS was used so I new the environment. Here, I'm trying to figure it out by running mpiexec env and seeing what variables are populated. I'll pass NONE for now, but it might be PMIX_ID or something like that... we'll see later when I do multi-node Darshan.

./configure --with-log-path=/shared/darshan-logs --with-jobid-env=NONE

Finally make and make install both did the trick! Then follow it up with mkdir /shared/darshan-logs and darshan-mk-log-dirs.pl as noted in the documentation.

Now let's actually try to use it.

mpiexec -np 2 -x LD_PRELOAD=/usr/local/lib/libdarshan.so ior

since openmpi uses -x instead of -env.

Welp, it did't crash like it was on NAS. However, it was unable to create the darshan log

darshan_library_warning: unable to create log file /shared/darshan-logs/2021/5/18/root_ior_id5293_5-18-81444-18062306667005854292.darshan_partial.

my guess is permissions? Oh... it was pointing to the wrong path. For some reason I changed the path and re-configured but that error still came up even though darshan-config --log-path showed the right path. I simply created a soft link between the actual and expected path and re-ran -- it worked! Let's peek at these logs, shall we?

https://www.mcs.anl.gov/research/projects/darshan/docs/darshan-util.html

I needed to install Python (odd, that wasn't listed in the requirements), and I'll need to install some other things to get graphical outputs, but for now the ./configure, make, and make install went find and I can get a textual description of the log by running

darshan-parser <path/to/file>

Sweet! The file is well-documented and understandable.

05/20

Darshan to observe an ML application

Trying to get a test run of Darshan observing some ML application like image analysis.

Turns out, Darshan was working but there are a few things to consider:

The environment variable DARSHAN_ENABLE_NONMPI needs to be set (it can be empty)
I think UTC is used so sometimes you need to look at the next day of log data

env DARSHAN_ENABLE_NONMPI= LD_PRELOAD=/usr/local/lib/libdarshan.so python script.py

5cf6b72690a4b6c9b36cbed37802591106ed410b

"A whole bunch of work on PRP; Want image registry" | HEAD -> main | 2021-05-24

I don't have Docker on the nasmac, so I'm trying to get the Nautilus GitLab container registry working.

8996ea3de6bf558ab1d98fa7cb7148d941f9223d

"Update Darshan images" | HEAD -> main | 2021-05-24

0e13ee27b76f6091d5a37ee5e1816ad8bee1d0a0

"Fix file treated as command" | HEAD -> main | 2021-05-24

5f897c27c7be26ddbf1849af1515d89b044379e4

"Install python3" | HEAD -> main | 2021-05-24

7e7ba6bd239e7ba21a5b7ceb00d9a84540c2553c

"-y" | HEAD -> main | 2021-05-24

f60cd3d37d1a05c0f0642fb39e0d01943d6e9abb

"Use multiple stages for multiple images" | HEAD -> main | 2021-05-24

ddb3cc2587ee8151dd88fa956d9d00a37c81a64f

"Prompt image build" | HEAD -> main | 2021-05-24

05/25

Alright, I've figured out the images, I have a deployment with PyTorch and Darshan running and I've copied over the flight anomaly code and data. Let's run it once to make sure it does indeed run.

python main_CCLP.py -e 1 -v 1

Ha! It does!

Okie dokes, now time to try monitoring it with Darshan.

env DARSHAN_ENABLE_NONMPI= LD_PRELOAD=/usr/local/lib/libdarshan.so python main_CCLP.py -e 1

Sweet, now to examine the Darshan logs.

I/O Behavior of Flight Anomaly Detection

I can create a human readable text dump of the log with darshan-parser, but I should also have PyDarshan installed in this image, so let's try to use it! Hmm, trying to import darshan complained. When I install darshan-util I should ./configure it with --enable-pydarshan --enable-shared.

Then I can read in a report as the following in a Python shell, and tell it to read in all records (POSIX, MPI, STDIO, etc):

import darshan

report = darshan.DarshanReport('filename',read_all=False)
report.read_all_generic_records()

Within the report, there are multiple records. We can see what records we have with report.info(), then access them through report.records['API'] and then run things like record.info(plot=True).

However, this relies on an implied IPython environment, since it uses display. I'll try installing jupyter into this image. Seems to be working like a charm!

pip install jupyter              # Pod
jupyter notebook

k port-forward podname 8888:8888 # Local

Oh my I'm always reminded just how much I absolutely love working in Jupyter notebooks :')

Great! This is awesome. I have the data and can play around with it.

Okay, now to do a run on multiple epochs. Also it's worth noting this is just the training process we're monitoring -- the preprocessing stage is entirely separate.

I'd be really interested in seeing how much of the total runtime was waiting for I/O?

Looks like I'll want to use their experimental aggregators: https://www.mcs.anl.gov/research/projects/darshan/docs/pydarshan/api/pydarshan/darshan.experimental.aggregators.html. They don't return plots (so actually I guess we don't need jupyter, but I'm still going to use it), so we'll want to write some plotting code to visualize.

darshan.enable_experimental(True)
# IO Size Histogram, given the API ('module')
report.mod_agg_iohist('POSIX')
# Cumulative number of operations
report.agg_ioops()

It seems like I can basically call them all using .summarize() then access with .summary

report.summarize()
report.summary

Plotting a hist/bar of access sizes is easy enough. How about the timeline? Here are the plots I want to replicate: https://www.mcs.anl.gov/research/projects/darshan/docs/ssnyder_ior-hdf5_id3655016_9-23-29011-12333993518351519212_1.darshan.pdf

241c061caced2a176cd50628d9c4954a267152a8

"Fix PyDarshan installation" | HEAD -> main | 2021-05-25

Here's the current WIP Python plotting implementation: https://github.com/darshan-hpc/darshan/blob/1ade6cc05c86b2bcab887bf8db96a24f920f6954/darshan-util/pydarshan/darshan/cli/summary.py

06/01

Replicating Chowdhury et al IO Evaluation of BeeGFS for Deep Learning

I know it feels a bit late in the process to do this -- but it's really about time that I actually do real research and consult more papers.

In an effort to follow https://dl.acm.org/doi/pdf/10.1145/3337821.3337902 (IO analysis of BeeGFS for deep learning) I am trying to set up similar conditions on the PRP and run the same experiments.

4af8366e93fbc8d6632858446a3d247f7204ee25

"Fix bash alias" | HEAD -> main | 2021-06-01

Darshan on NAS

Again. Let's figure this out.

wget -O- ...3.3.0 | tar zxf -
cd darshan-3.3.0/darshan-runtime
module load mpi-hpe
./configure --with-log-path=$HOME/darshan-logs --with-jobid-env=PBS_JOBID --prefix=$HOME/usr/local
make
make install
mkdir -p ~/darshan-logs
chmod +x darshan-mk-log-dirs.pl
./darshan-mk-log-dirs.pl

So far that worked without any issue. We can install the util the same way.

cd ../darshan-util
./configure --prefix=$HOME/usr/local
make
make install

That worked too. Time to launch a compute node and test if I can get it to monitor without crashing.

I should be able to use

env DARSHAN_ENABLE_NONMPI= LD_PRELOAD=$HOME/usr/local/lib/libdarshan.so <my_command>

At first I got:

/bin/sh: error while loading shared libraries: libmpi.so: cannot open shared object file: No such file or directory

but that's just complaining I didn't do module load mpi-hpe.

Then I ran it again, I was initially scared because the test script I wrote just reads and writes 4k random bytes a few times (very quick), but when I ran it with Darshan nothing showed up -- I assumed it had crashed again.

It hadn't. Rather, it just took ages to do each operation -- why? I thought Darshan had a low overhead?

Ah... It generated a separate summary for every individual operation -- so in this case I have 5 files in the directory now, one for the sh invocation (the script itself), then three dd and one ls. Interesting.

Good news, darshan-parser works on the files! So, ultimately, Darshan is working on NAS!

The run took 75 seconds total, when typically it takes only a fraction of a second. Good news, fortunately each operation is still logged as taking only a fraction of a second, so it didn't affect the operations it was merely overhead.

Psuedo pipeline to observe with Darshan

While we wait for real NASA data-intensive ML apps to become available, we can run Darshan on other ML models or artificial pipelines made to mimic the app (e.g. using IOR).

When using IOR, Darshan creates a separate log for every run -- one per operation per iteration. And mpiexec gets a log too. I think in order to see the whole thing I need to read all logs with the same ID.

There is a separate log for every mpi process invocation of IOR. There are also somehow three mpiexec logs.

06/21

Fire detection setup

The GOES data is on NAS. The MTBS data can be downloaded online.

A conda environment can be made with all requirements. Start by making an empty environment conda create -n geo, then conda install -c conda-forge gdal, then cartopy, xarray, rioxarray.

The Jupyter notebook provided to me doesn't need to import gdal (it's used as a command line utility), nor the from utils (not used in the file).

All other imports in the notebook work except for geopandas which has not been installed yet. There is a conflict when I try to install geopandas... let's wait a long time and find out why. I need geopandas just to import the MTBS shapefiles. Maybe I should have installed geopandas before cartopy and the rest...

I made a new environment with conda create -n ... gdal geopandas cartopy (did not specify the Python version, let conda figure that out) and it made one with Python 3.9.5. Then xarray then -c conda-forge rioxarray.

Thought that worked... but then importing anything which imports numpy results in

  File "/nobackup/paddison/.conda/envs/geofire/lib/python3.9/site-packages/numpy/__init__.py", line 148, in <module>
    from . import lib
  File "/nobackup/paddison/.conda/envs/geofire/lib/python3.9/site-packages/numpy/lib/__init__.py", line 44, in <module>
    __all__ += type_check.__all__
NameError: name 'type_check' is not defined

Fixed it by explicitly installing numpy as well.

06/22

I added some multiprocessing to the code. Good.

When trying to profile it with Darshan though, I ran into a symbol error for llapi_layout_get_by_xattr. Looks like https://github.com/darshan-hpc/darshan/issues/399. I never ran into this issue on NAS before though, oh well. Looks like a fix has been merged, so I'm installing the latest Darshan from the repo to try to get things working.

Seemed to work on a test Python invocation. Let's run the preprocessing now!

Damn. I interrupted the program, but I think in doing so I halted it. Let me run it again on just the first handful of samples :/

Sheesh. It didn't work that time either. Darshan is giving me an 'objdump' log rather than a python log... probably because the program didn't fully terminate properly. I'm trying one more time just going through the data on a single specified date. I'm going to run into the walltime though before this run completes! Noooooooooooo, lol

Hmmmm... so this time the Python ran through all the data specified just fine. But Darshan produced both an objdump and python log, but the Python log is "incomplete" when viewing with darshan-parser. That's fine, just use the --show-incomplete. It looks like it worked! Later today I'll load up the results in Python and plot them.

06/23

Validating Darshan outputs

When I looked at the timings in the Darshan logs using pydarshan, it looked like I/O contributed less than a percent of the runtime. The multiprocessing wasn't actually working correctly, so the compute time should be substantially quicker, but still it seemed odd that I/O took so little time when the files are actually quite large and plentiful.

To ensure Darshan is showing me the right things, I wrote a series of Python scripts which performed different read and write access patterns, such as reading 1GB chunks of a large file and writing to a new file, or writing random bytes.

I used cProfile to compare timings seen by Python natively and those seen by Darshan

python -m cProfile -o out.profile script.py

import pstats
p = pstats.Stats('out.profile')
p.print_stats()

They meshed up just fine. In a first test I was skeptical when Darshan showed my write file test as being over 50% computation, but by using cProfile I figured out that my os.urandom generation was indeed taking about half the time!

06/29

Flood detection profiling

Over the last week I've been working with a DELTA training pipeline which will soon get some data and model specification to train a flood detection algorithm. I was able to make a geoflood conda environment on top of geofire.

(Side node, I realized that we have flight, fire and flood algorithms. That alliteration is kinda cool!)

Darshan incomplete logs and segfault

In running the Landsat-8 example training and attempting to monitor it with Darshan, all seemed well. However, the logs showed up as incomplete, and attempting to analyze them with pydarshan resulted in a segfault and core dump...

# *ERROR*: The POSIX module contains incomplete data!
#            This happens when a module runs out of
#            memory to store new record data.

# To avoid this error, consult the darshan-runtime
# documentation and consider setting the
# DARSHAN_EXCLUDE_DIRS environment variable to prevent
# Darshan from instrumenting unecessary files.

# You can display the (incomplete) data that is
# present in this log using the --show-incomplete
# option to darshan-parser.

I've got to figure this out before moving forwards.

8bda8bc359c393ecad71e92535172a6cfa1d2c66

"Add profiling work" | HEAD -> main | 2021-07-01

Notes - parkeraddison/nasa-filesystem-benchmarks GitHub Wiki

Filesystem Benchmarks

03/09-03/13 2021

IOR

Setup

Running

IO-500

Setup

Running

03/14

Custom image for IO500 dependencies

FIO benchmark

References, graphs, and job files

957b70caaca3f180ca323dbf1045965823f349df

03/15

Understanding IOR output

03/16

c57f6dd8feb60eedf5a0c0ae809f30e3ef91f111

03/21

Pleiades PBS Hello World

PBS Directives

Hello World

6e9d9445847ef3dba7e80e79885bbd54cadb122b

03/22

Running IOR in a PBS job

Download IOR

Preparing IOR

Dependency issues

ea7d869f2b97bc0607e9105c1b2dd5f853aa255c

03/23

Nautilus namespace and IOR units

Re-trying IOR on Pleiades

Debugging in interactive mode

c1a44925fbad444c7a84e054f828a0cb6e6c52a5

03/27

ae00408b88260dd35be000cb263c924ee09f16fa

03/28

Automated parameter sweeps

Parsing and graphing IOR outputs

fddf04911b8fb2af9f03f07a1a1f5c6dbd49cdc7

04/04

04/05

04/06

9aa37d75a8cb70b06023251e065a64795cb58569

Values for initial parameter sweeps

04/07

246de9feb3882fe6a91a66306661a3aeeacf3814

04/08

Side note: Darshan

04/09

FIO on NAS

b2beb35c3771b5251442b3cdd48801f1d2c57e4c

04/10

04/11

Darshan on NAS

04/12

What MPI module to use

Conducting initial parameter sweeps

04/13

a10b7d71f4ffb1420f45d916b72968b6174236e4

04/16

Initial sweeps on /nobackupp12

Lustre stripe counts

IOR Lustre directives

IOR Lustre striping workaround

04/17

04/19

MPI on PRP

04/20

1aeb8957bb17f24494cbff25ebb03d371422054e

04/24

Setup

Running

IOR MPI job

ad5fe7f3ca91273e0074fdb26bfd56e9c9d1136e

04/25

Shared volume

Node selection

04/26

e58ea08968662f03b607111cc865d149c4d4ea12

Pay attention to `-npernode`

⚠️ GitHub.com Fallback ⚠️