Meeting 2023 06 01 - openpmix/openpmix GitHub Wiki
June 1, 2023 OpenPMIx-devel call notes
Attendees
- Aurelien Bouteiller (UTK)
- Howard Pritchard (LANL)
- Tim Wickberg (SchedMD)
- Samuel Gutierrez (LANL)
- Ralph Castain (Nanook)
- Michael Karo (Altair)
- Thomas Naughton (ORNL)
Notes
-
Admin: Thomas look into removing the "host admit" setting for Teams meeting
-
openpmix v4.2 release
- No known issues/objections
- DStore problem with with multiple session init/fini, but having
problems reproducing the issue
- https://github.com/open-mpi/ompi/issues/11543
- work around use gds=hash for older version being used
- Can not find location where the unlink is occurring, so just suggest waiting for things in update if gds=hash still not work
-
prrte v3.0
- A spawn issue w.r.t. using HAN collective component, not sure
exactly what is going on
- https://github.com/open-mpi/ompi/issues/11724
- George @ UTK is hitting the issue
- Aurelien suggests w/ can probably not hold up and fix in subsequent release
- Aurelien mentioned the issue comes from locality info be lost/missing, which might be in failure paths
- Ralph: something related to the interleaving the spawn/split etc., the locality of the underlying process gets confused. Not sure why/where If remove the splits/spawn, works fine.
- Aurelien: there is some packing of the info during the split/spawn and it is a subset of the COMM_WORLD during the split
- The basic passing of data is working. But maybe something environment specific?
- George working w/ what is OMPI main, Ralph tests were with latest PRRTE/PMIX head
- Maybe defer other items to v3.0.1 or v3.1.0
- A spawn issue w.r.t. using HAN collective component, not sure
exactly what is going on
-
Question: Can we update the submodule pointers in OMPIS
-
TODO: Update prrte/pmix submodule pointers on Open MPI
-
TODO: TJN create ompi ticket to update submodule pointers
-
Few OMPI tickets w/ failures that are updated/fixed by updating the submodule pointers
-
-
Recap of discussion related to Nic/GPU distance selection
- GDR needs to be on same root complex
- But still need to pick the NIC/GPU affinity
- Need to ensure that locality questions are considered w.r.t. to gpu/nic process placement
-
Aurelien: Lots of PMIx standard folks on this call
- There is a drift between standard/implementation
- There are a set of tickets to resolve the issues and standardize them
- https://github.com/pmix/pmix-standard/issues?q=is%3Aissue+is%3Aopen+drift%3A
- Ralph: Starting to move the marker for exceptions (drift items) in implementation into the documentation area, so may be able to leverage that veribiage to use it