Meeting 2024 11 07 - openpmix/openpmix GitHub Wiki
November 7, 2024 OpenPMIx-devel call notes
Attendees
- Ralph Castain
- Tim Wickberg
- Thomas Naughton
- Aurelien Bouteiller
Agenda
-
We have release candidates out for both PMIx and PRRTE that are scheduled for full release over the weekend. These will be the LAST releases of their respective series - and likely the last production releases for the foreseeable future. No further features will be allowed in either of those series. I may (repeat, may) occasionally back port bug fixes to them for anyone wanting to clone the branch. So please do check the release candidates in the immediate future!!
-
Both projects are progressing towards their designated “end-of-project” state - i.e., an initial release of a final major series, with only rare bug-fix releases within those series going forward. I had hoped to finish that by Thanksgiving in the US (late Nov), but it is looking like it will be more into early Dec before it is done. No hard timeline as this is purely an “as time permits” effort.
Notes
-
release candidate out
- openpmix-v5.0.4rc1
- prrte-v3.0.7rc1
- Will cut final after ompi release next
- some open issues remain, will be addressed next releases
-
next releases
- openpmix-6.0, prte-4.0
- will be no earlier than Dec
- some issues will
-
known bugs
- Data excchange (e.g., group construct)
- Related issue: pmix resolve-peers https://github.com/openpmix/openpmix/issues/3359
- The description of issue also captured in https://github.com/openpmix/openpmix/issues/3360, but that was closed only b/c not going to backport change to that version. But issue remains.
- ralph has reproducer from parastation folks, working to simplify and
- VM sharded memory in-consistency
- Data excchange (e.g., group construct)
-
other items
- slinghot/slurm, https://github.com/openpmix/prrte/issues/2004 (waiting on details from HowardP, see ticket)
- add-host issue, https://github.com/openpmix/prrte/issues/1773 likely to be address this in the dynamic scheduler work and thus this prrte issue is rather low priority
-
See "stable landing point" tickets for outstanding issues
-
Discussion on slurm-ext-launcher library
- Can not use the GPL license
- Could redevelop things to avoid using SLURM internals and release under different license, but that would be more longer term item.
- Does it seems like maybe workarounds are ok for now?
- Note: the PRRTE side will be more focused on research efforts, while the OMPI fork of PRRTE will be for more stable dedicated runtime in OMPI. The point being that the problem will persist, and current stop-gap measure will be ok but not some limits. There will likely be some interest in a longer-term solution.
- Seems like not a good idea to release slurm-ext as-is due to this limitation as it will not be consumable by others. So likely pull this out and not release it.
- TODO: Raise point with OMPI to get requirements and establish timeline for a more permanent solution
-
FYI SC24 panel on metrics
- Thomas: discussion planned on software project metric, will gather some input from others to try capture things of interest. seems like a good avenue for putting info out that inform DOE or other agencies considering supporting HPC software community software
-
Some discussion on CI testing, multi-node (or multi-container) testings is needed and maybe some resources we could leverage from existing projects/orgs. Nothing concrete at moment but looking into things.
-
Some discussions at the end related to others having interest from AI workloads on possibly using prte/pmix (runtime) to speed startup times. This might be a way to expand community.