Meeting 2023 12 07 - openpmix/openpmix GitHub Wiki

December 7, 2023 OpenPMIx-devel call notes

Attendees

  • Samuel Gutierrez (LANL)
  • Ralph Castain (Nanook)
  • Aurelien Bouteiller (UTK)
  • Thomas Naughton (ORNL)
  • Michael Karo (Altair)
  • Tim Wickberg (SchedMD)
  • Rajat Bhattarai (TNTech)

Notes

  • Open MPI v5.0.1 release coming soon

    • Will take a candidate release for the next openpmix-4.2 and prte-3.0 that will be used in openmpi-5.0.1
    • Possible issue that hit Aurora envvar needs to be double checked, think it is fixed.
    • TODO: confirm Aurora HEAD of openpmix-4.2/prte-3.0 and openmpi-5.0.1
  • Client/Server version back compat problem (openpmix-5.0.0 and future)

    • Due to shm key
    • Plan to resolve this issue (see mailing list for warning)
    • Q: How should we handle this known issue in past release (v5.0)
    • Suggestion is to keep the tag, but remove the release
    • Suggestion to version both the API and protocols in future
      • Was helpful in ALPS to have server/client versioning compatible, and could have ability to talk current and current-1, etc.
      • This in practise was very helpful
    • Should have possibly been a part of ABI versioning?
      • Not really b/c this is a detail in how things are stored internally
    • Plan is at the head of each SHM region, insert the dictionary being used. Then when access data, use it relative to that dictionary. The past example used mmap, this is using shm, but seems very relevant to the previously mentioned ALPS (Michael Karo)
    • Sam hoping to have a prototype soon
    • Suggestion of delete the release tarball frowned upon
    • So plan to mark as "DO NOT USE" b/c there is an ABI break between 5.0.0 and 5.0.. (Just add remarks on release text)
    • Will put a comment with this detail to clarify issue and why people may not want to use
    • Discussion on how to do handshaking and handle things cleanly
    • A main concern is that there is a silent error. Suggestion is to detect and hard fail to avoid silent issues.
  • Would be good to end the openpmix-4.2 series (e.g., dstore problem) and move on to the openpmix-5.0 series

  • Bad interface issue

    • User reported long delays during startup that was due to a bad network interface that was being used and eventually had a long timeout
    • Suggestion is to report an error with a long timeout
    • Seems like a good idea print a clear msg about the timeout
    • The error msg w/ interface name that is timing out would seem good and then the user can just restrict the set of interfaces.
    • It might be useful to try all interfaces at once to avoid long seriel timeout. Open for pull-request/prototype
  • Remove of 'stat()' calls

    • Found issue with Lustre install and stat() a lot of things and it was taking a lot time. So started to remove the use of stat to avoid these long delays.
    • Working through code base, cleaned up the session dir area. Lower priority, but open for others to get input -- just coordinate to avoid duplication of work.
  • European projects needing scheduler integration

    • Have milestones in Summer 2024
    • So need scheduler integration work
  • Succession planning

    • SC23 meeting and reasonable plan
    • Plan to run ideas by the PMIx Standard group at the Dec PMIx monthly
    • General idea is to have PMIx and OpenPMIx under same umbrella. Invision new ideas in standardization ideas and would still have someone in charge of the OpenPMIx efforts to have releases, etc.
    • Will keep option of alternate implementations of PMIx, but for now there are not known alt. implementations currently
    • Next week will have an open discussion to voice any concerns
    • If all are on board, will propose it as official change at next quarterly (Jan 2024). That would include adjustment to the PMIx standard governance as appropriate.
    • Need to also handle the commit/merge-request issue into master branch
    • Do not want to add long delays into openpmix devel side for longer delays, namely the scheduler integration (do not want to have that work off in a "side" branch)
    • May need to think about how will have roadmap for release management
    • There is value in having things in the official PMIx standard, but do not want to have added delays in implementation, and a good plan to get the standard updated in a timely fashion to best support the projects consuming these new interaces (i.e., scheduler interfaces).
    • Take away: make sure we have a good plan to avoid impacts to scheduler users
    • Q: Is part of the succession plan going to talk about the development side?
      • We talked about this a bit at SC23
      • There will be some changes, likely a bit more formal than today possibly more like Open MPI to get approvals/sign-offs on protected branches.
      • Main concerns would be adding APIs or adding attributes could have a bit more overhead, so would have to get the standard involved
    • Q: Will there be people who will review/approve PRs?
      • Yes
      • There will be some people assigned to manage releases and that will help to oversee these new processes.
    • Expectation is that this will help longer term
    • Will try things, adjust as we go
  • DOE IRI

    • SC23 workshop info (Debbie Bard talk)
    • Ralph: Help to move between DOE facilites
    • Ralph: spoke to Debbie Bard and have a planned phone call to talk further. The feeling being that PMIx could help make this work.
    • Possibly some support from IRI program or informal representation
    • Howard: Attended confab23 where this was discussed. This is much larger than just PMIx. It includes things related to data movement, etc. There are existing infrastructure from cloud side. The mechanism that will most likely provide funding will be FOAs. A portion of this will include IOT to DOE/Cloud things.
    • Ralph: Workshop was primarily focused on maleable computing and moving between facilities
  • Misc items

    • Flux+PMIx
      • Flux support for PMIx is minimal
      • Not apposed to PMIx, but do not have resources/time to invest in PMIx at the moment. So would need someone to provide code if want more functionality.
    • NERSC issue
      • Ralph: NERSC having versioning issue, need to sync Howard
      • Howard: Have some possible suggestions, related to build/config suggestions.
  • TODO: Better publish of OpenPMIx zoom link for 2024

  • TODO: Create new CY2024 zoom link