Meeting 2022 03 03 - openpmix/openpmix GitHub Wiki

03/03/2022 OpenPMIx-devel call notes

Attendees

  • Samuel Gutierrez (LANL)
  • Aurelien Bouteiller (UTK)
  • Matthew Baker (ORNL)
  • Ralph Castain (Nanook)
  • Brian Barrett (Amazon)
  • Thomas Naughton (ORNL)

Notes

  • See summary of Ralph's recent changes/agenda here, https://groups.google.com/g/pmix/c/EEHh2qfgNec

  • Ralph: refactoring code in prrte/pmix to avoid having duplicate code

    • started with CLI/arg parsing
    • moved much of util, and class pieces out of prrte (using pmix)
    • working on refactor of messaging/oob/rml reworking
  • Aurelien: some concern about the rework of oob/rml changes w/o discussion

    • concern about sending messages w/o routing, features from rml
    • would be good to review prior capability discussions to revisit these messaging features
  • Ralph: rework is based on the having fixed fan-out that matches the switch patterns.

  • Aurelien: we are talking about the algorithm with the messaging layer, and interest in having the ability to be versatile. eg., allgather algos

  • Ralph: intent in past w/ prte was to have out-of-band rte msg follow the shape of the switch shape. pass data back up to root and then broadcast out. may not be optimized, but want to avoid this behavior and instead move toward improving the broadcast.

  • Ralph: at small scale, not a big deal. at large scale, when doing the modex operation you are conceeding the cost of the allgather.

  • Ralph: Point being, intent not changing behavior but to cleanup code. So if want to extend functionality, add feature we can look into that after cleanup.

  • Aurelien: discussion about approach, may have challenges if remove the ability to have point to point operations. (e.g., past orte behaviour)

  • Ralph: in prte only doing xcast, not pt2pt

  • Ralph: the manytask users may be exposing these heavy costs w/ xcasts

  • Aurelien: the pipelining of xcast will probably be good for these to help order them. The BMG is not a good fit for the generic xcast. The advantage of the bmg will help with the reconnect to the parent multiple times (during failure cases).

  • Ralph: prte always assumed communication goes through routing tree. Current efforts are not trying to change that, just cleanup code base.

  • Aurelien: my immediate goal is to have something that works with ompi-v5 for fault-tolerance, even if not best/optimal. Then can have a second path where the performance can be improved.

  • Aurelien: There is some work for reconnection for fault tolerance. Will that fit/benefit?

  • Ralph: OMPI just cares about having the PMIx event of a failure everywhere, and prrte already does that and will continue. Currently, prte will by default just fail. But could possibly have option to avoid failing/tearing down the daemons.

  • Aurelien: there are interests in failure detection latency.

  • TJN: Discussions that I missed here.

  • Ralph: Plan for getting things cleaned up in master, and leave v2.1 branch with ft pieces for now.

  • Ralph: good to have a fault-response code for prte, so that the FT code can build by default.

  • Brian: Austen and Brian trying to track down a prte bug. trying to track down when the error was introduced, fails w/ MPI.

  • Austen: w/ head of master of pmix/prte, run dmodex test get segv w/ remote. With MPI got a hang.

  • Brian: Occurs before the rml changes, last placed that compiled clean and good behaviour was at prrte @ 72824f52

  • Can reproduce issue w/o OMPI and dmodex example test to reproduce issue

  • Ralph: maybe some recent changes may accidentally be calling PMIX_RELEASE on things that are not actually a proper object, i.e., accidentally calling release on a non-pmix object?

  • May need to revisit some of the CI to call more than just pmix init/fini to better excercise things sufficiently for MPI scenario

  • Ralph: ok, will take a look

  • Ralph: There is an issue with Python bindings. Maybe something specific to cython?

  • Matt: OK, i will take a look at 2482 and comment on ticket

  • Ralph: Sam, looked at the pieces you cared about and looks pretty trivial

  • Ralph: Howard needed pmix ID for MPI Sessions. Issue with implementation for fast local IDs, and MPI has to exchange the local ID to get the commID to comm-localID mapping. Can use the fence to do the exchange. When MPI process wants local-comm-ID for a particular global-ID, the pmix_get will get the local-ID for a specific global-comm-id. So need to add a qualifier to pmix_get to support this capability. This is in the PMIx spec, but was not yet in the implementation. Plan to have this ready next week. Also, done some work to reduce memory footprint with IDs to map to strings.