Meeting 2023 08 03 - openpmix/openpmix GitHub Wiki
August 3, 2023 OpenPMIx-devel call notes
Attendees
- Ralph Castain (Nanook)
- Howard Pritchard (LANL)
- Samuel Gutierrez (LANL)
- Rajat Bhattarai
- Thomas Naughton (ORNL)
Notes
-
pmix-4.4 release
-
pmix-4.5 likely to be released to include a few bugfixes that arose
-
then plan to move on to pmix-5.0
-
regarding prrte kind of on a holding
- some work on ompi side to adjust the docs
- trying to reorder to help organize use of man pages, while keeping the context sensitive help
- may release prrte-3.1 soon and delay the doc sorting
- few items coming up that will be fixed with prrte-3.1
-
a dynamic resource mgmt issue arose, so likely to be in prrte-4.x
- two apps tryign to run
- appA add nodes and then appB tryies to launch 3ppn, there is a race on the ordering... if appA first, then appB might get confusing results b/c appA influences appB's info
- would like to be deterministic, but not easy to resolve
- discussion on possible ideas, some options could resolve but would need to do lot of bookkeeping
- should appB see the extended allocation, or should appB only deal w/ the allocation that it started with... or is another flag needed to notify about growth
- in current context where spawn gets an error is actually more a matter of resource additions in the midst of asynchronous resource change, daemons change and the spawn fails b/c it is cought in midst of this asychronous update.
- The PMIx standard is silent on this.
- Actually not part of standard, it is on the implementation side
- Thinking now might be that you serialize things when a dvm modification occurs. But what happens if a change occurs after already mapped.
- Several corner cases that need to be worked out.
- Some thinking/discussion in concept of growth with SLURM, in that context it would be new nodes are a new session and you can link the old/new session so spawns are across the combined sessions.
-
Discussion about different scenarios/options for the new scheduler capability
-
Possibly useful to have a follow-up meeting to discuss some of the elastic resource mgmt items